date:20230424

Re: Move un-parenthesized syntax docs to "compatibility" for few SQL commands

2023-04-24 Thread Michael Paquier

On Mon, Apr 24, 2023 at 09:52:21AM -0700, Nathan Bossart wrote:
> I think this can wait for v17, but if there's a strong argument for doing
> some of this sooner, we can reevaluate.

FWIW, I agree to hold on this stuff for v17~ once it opens for
business.  There is no urgency here.
--
Michael


signature.asc
Description: PGP signature

Re: Add LZ4 compression in pg_dump

2023-04-24 Thread Michael Paquier

On Wed, Apr 12, 2023 at 07:53:53PM -0500, Justin Pryzby wrote:
> I doubt it - in the !HAVE_LIBZ case, it's currently an "if" statement
> with nothing but a comment, which isn't a problem.
> 
> I think the only issue with an empty "if" is when you have no braces,
> like:
> 
>   if (...)
> #if ...
>   something;
> #endif
> 
>   // problem here //

(My apologies for the late reply.)

Still it could be easily messed up, and that's not a style that
really exists in the tree, either, because there are always #else
blocks set up in such cases.  Another part that makes me a bit
uncomfortable is that we would still call twice
parse_compress_specification(), something that should not happen but
we are doing so on HEAD because the default compression_algorithm_str
is "none" and we want to enforce "gzip" for custom and directory
formats when building with zlib.

What about just moving this block a bit up, just before the
compression spec parsing, then?  If we set compression_algorithm_str,
the specification is compiled with the expected default, once instead
of twice.
--
Michael
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 058244cd17..e4f32d170f 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -721,6 +721,21 @@ main(int argc, char **argv)
 	if (archiveFormat == archNull)
 		plainText = 1;
 
+	/*
+	 * Custom and directory formats are compressed by default with gzip when
+	 * available, not the others.  If gzip is not available, no compression
+	 * is done by default.
+	 */
+	if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
+		!user_compression_defined)
+	{
+#ifdef HAVE_LIBZ
+		compression_algorithm_str = "gzip";
+#else
+		compression_algorithm_str = "none";
+#endif
+	}
+
 	/*
 	 * Compression options
 	 */
@@ -749,21 +764,6 @@ main(int argc, char **argv)
 		pg_log_warning("compression option \"%s\" is not currently supported by pg_dump",
 	   "workers");
 
-	/*
-	 * Custom and directory formats are compressed by default with gzip when
-	 * available, not the others.
-	 */
-	if ((archiveFormat == archCustom || archiveFormat == archDirectory) &&
-		!user_compression_defined)
-	{
-#ifdef HAVE_LIBZ
-		parse_compress_specification(PG_COMPRESSION_GZIP, NULL,
-	 &compression_spec);
-#else
-		/* Nothing to do in the default case */
-#endif
-	}
-
 	/*
 	 * If emitting an archive format, we always want to emit a DATABASE item,
 	 * in case --create is specified at pg_restore time.


signature.asc
Description: PGP signature

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2023-04-24 Thread Michael Paquier

On Thu, Mar 30, 2023 at 12:07:58AM -0500, Justin Pryzby wrote:
> What more points ?  There's nothing that's been raised here.  In fact,
> your message last week is the first comment since last June ..

When I wrote this message, I felt like this may still be missing
something in the area of dump/restore.  Perhaps my feeling on the
matter is wrong, so consider this as a self-reminder not to be taken
seriously until I can have a closer look at what's proposed here for
v17.  :p
--
Michael


signature.asc
Description: PGP signature

Re: Autogenerate some wait events code and documentation

2023-04-24 Thread Michael Paquier

On Mon, Apr 24, 2023 at 09:03:53AM +0200, Drouvot, Bertrand wrote:
> Oh right, fixed.

I may tweak a few things if I put my hands on it, but that looks
pretty solid seen from here..  I have spotted a few extra issues.

One thing I have noticed with v4 is that the order of the tables
generated in wait_event_types.h and the SGML docs is inconsistent with
previous versions, and these are not in an alphabetical order.  HEAD
orders them as Activity, BufferPin, Client, Extension, IO, IPC, Lock,
LWLock and Timeout.  This patch switches the order to become
different, and note that the first table describing each of the wait
event type classes gets it right.

It seems to me that you should apply an extra ordering in
generate-waiteventnames.pl to make sure that the tables are printed in 
the same order as previously, around here:
+# Generate the output files
+foreach $waitclass (keys %hashwe) {

(The table describing all the wait event types could be removed from
the SGML docs as well, at the cost of having their description in the
new .txt file.  However, as these are long, it would make the .txt
file much messier, so not doing this extra part is OK for me.)

- * Use this category when a process is waiting because it has no work to do,
- * unless the "Client" or "Timeout" category describes the situation better.
- * Typically, this should only be used for background processes

wait_event.h includes a set of comments describing each category, that
this patch removes.  Rather than removing this information, which is
helpful to have around, why not making them comments of
waiteventnames.txt instead?  Losing this information would be sad.

+#   src/backend/utils/activity/pgstat_wait_event.c
+#  c functions to get the wait event name based on the enum
Nit.  'c' should be upper-case.

}
+
if (IsNewer(
'src/include/storage/lwlocknames.h',
Not wrong, but this is an unrelated diff.

+if %DIST%==1 if exist src\backend\utils\activity\pgstat_wait_event.c del /q 
src\backend\utils\activity\pgstat_wait_event.c
 if %DIST%==1 if exist src\backend\storage\lmgr\lwlocknames.h del /q 
src\backend\storage\lmgr\lwlocknames.h
+if %DIST%==1 if exist src\backend\utils\activity\wait_event_types.h del /q 
src\backend\utils\activity\wait_event_types.h
The order here is off a bit.  Missed that previously..

perltidy had better be run on generate-waiteventnames.pl (I can do
that myself, no worries), as a couple of lines' format don't seem
quite in line.
--
Michael


signature.asc
Description: PGP signature

Re: Add two missing tests in 035_standby_logical_decoding.pl

2023-04-24 Thread Amit Kapila

On Mon, Apr 24, 2023 at 5:38 PM Drouvot, Bertrand
 wrote:
>
> On 4/24/23 11:45 AM, Amit Kapila wrote:
> > On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila  
> > wrote:
> >>
> >> On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
> >>  wrote:
> >>>
> >>
> >> Few comments:
> >> 
> >>
> >
> > +# We can not test if the WAL file still exists immediately.
> > +# We need to let some time to the standby to actually "remove" it.
> > +my $i = 0;
> > +while (1)
> > +{
> > + last if !-f $standby_walfile;
> > + if ($i++ == 10 * $default_timeout)
> > + {
> > + die
> > +   "could not determine if WAL file has been retained or not, can't 
> > continue";
> > + }
> > + usleep(100_000);
> > +}
> >
> > Is this adhoc wait required because we can't guarantee that the
> > checkpoint is complete on standby even after using wait_for_catchup?
>
> Yes, the restart point on the standby is not necessary completed even after 
> wait_for_catchup is done.
>
> > Is there a guarantee that it can never fail on some slower machines?
> >
>
> We are waiting here at a maximum for 10 * $default_timeout (means 3 minutes) 
> before
> we time out. Would you prefer to wait more than 3 minutes at a maximum?
>

No, because I don't know what would be a suitable timeout here. At
this stage, I don't have a good idea on how to implement this test in
a better way. Can we split this into a separate patch as the first
test is a bit straightforward, we can push that one and then
brainstorm on if there is a better way to test this functionality.

-- 
With Regards,
Amit Kapila.

Re: Order changes in PG16 since ICU introduction

2023-04-24 Thread Jeff Davis

On Fri, 2023-04-21 at 16:00 -0400, Tom Lane wrote:
> I think I might like this idea, except for one thing: you're
> imagining
> that the locale doesn't control anything except string comparisons.
> What about to_upper/to_lower, character classifications in regexes,
> etc?

If provider='libc' and LC_CTYPE='C', str_toupper/str_tolower are
handled with asc_tolower/asc_toupper. The regex character
classification is done with pg_char_properties. In these cases neither
ICU nor libc is used; it's just code in postgres.

libc is special in that you can set LC_COLLATE and LC_CTYPE separately,
so that different locales are used for sorting and character
classification. That's potentially useful to set LC_COLLATE to C for
performance reasons, while setting LC_CTYPE to a useful locale. We
don't allow ICU to set collation and ctype separately (it would be
possible to allow it, but I don't think there's a huge demand and it's
arguably inconsistent to set them differently).

> (I'm not sure whether those operations can get redirected to ICU
> today
> or whether they still always go to libc, but we'll surely want to fix
> it eventually if the latter is still true.)

Those operations do get redirected to ICU today. There are extensions
that call locale-sensitive libc functions directly, and obviously those
won't use ICU.

> Aside from the user-surprise issues discussed up to now, pg_dump
> scripts
> emitted by pre-v15 pg_dump are not going to contain LOCALE_PROVIDER
> clauses in CREATE DATABASE, and people are going to be very unhappy
> if that means they suddenly get totally different locale semantics
> after restoring into a new DB.

Agreed.

>   I think we need some plan for mapping
> libc-style locale specs into ICU locales so that we can make that
> more nearly transparent.

ICU does a reasonable job mapping libc-like locale names to ICU
locales, e.g. en_US to en-US, etc. The ordering semantics aren't
guaranteed to be the same, of course (because the libc-locales are
platform-dependent), but it's at least conceptually the same locale.

> 
> Maybe this means we are not ready to do ICU-by-default in v16.
> It certainly feels like there might be more here than we want to
> start designing post-feature-freeze.

This thread is already on the Open Items list. As long as it's not too
disruptive to others I'll leave it as-is for now to see how this sorts
out. Right now it's not clear to me how much of this is a v15 issue vs
a v16 issue.

Regards,
Jeff Davis

Re: Add two missing tests in 035_standby_logical_decoding.pl

2023-04-24 Thread Amit Kapila

On Mon, Apr 24, 2023 at 3:36 PM Drouvot, Bertrand
 wrote:
>
> On 4/24/23 8:24 AM, Amit Kapila wrote:
>
> > 2.
> > +# Speed up the subscription creation
> > +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
> > +
> > +# Explicitly shut down psql instance gracefully - to avoid hangs
> > +# or worse on windows
> > +$psql_subscriber{subscriber_stdin} .= "\\q\n";
> > +$psql_subscriber{run}->finish;
> > +
> > +# Insert some rows on the primary
> > +$node_primary->safe_psql('postgres',
> > + qq[INSERT INTO tab_rep select generate_series(1,10);]);
> > +
> > +$node_primary->wait_for_replay_catchup($node_standby);
> > +
> > +# To speed up the wait_for_subscription_sync
> > +$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
> > +$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');
> >
> > It is not clear to me why you need to do pg_log_standby_snapshot() twice.
>
> That's because there is 2 logical slot creations that have the be done on the 
> standby.
>
> The one for the subscription:
>
> "
> CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (SNAPSHOT 'nothing')
> "
>
> And the one for the data sync:
>
> "
> CREATE_REPLICATION_SLOT "pg_16389_sync_16384_7225540800768250444" LOGICAL 
> pgoutput (SNAPSHOT 'use')
> "
>
> Without the second "pg_log_standby_snapshot()" then 
> wait_for_subscription_sync() would be waiting
> some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE 
> srsubstate NOT IN ('r', 's');"
>
> Adding a comment in V3 to explain the need for the second 
> pg_log_standby_snapshot().
>

Won't this still be unpredictable because it is possible that the
tablesync worker may take more time to get launched or create a
replication slot? If that happens after your second
pg_log_standby_snapshot() then wait_for_subscription_sync() will be
hanging. Wouldn't it be better to create a subscription with
(copy_data = false) to make it predictable and then we won't need
pg_log_standby_snapshot() to be performed twice?

If you agree with the above suggestion then you probably need to move
wait_for_subscription_sync() before Insert.

-- 
With Regards,
Amit Kapila.

Re: seemingly useless #include recently added

2023-04-24 Thread Thomas Munro

On Tue, Apr 25, 2023 at 3:12 PM Tom Lane  wrote:
> Kyotaro Horiguchi  writes:
> > While working on a patch, I noticed that a rcent commit (d4e71df6d75)
> > added an apparently unnecessary inclusion of guc.h in smgr.h.
>
> Yes, that seems quite awful, and I also wonder why it changed fd.h.
> Adding #include's to header files is generally not the first choice.

Agreed for smgr.h.  Will push when I'm back at a real computer soon,
or +1 from me if someone else wants to.  It must have been left over
from an earlier version that had a different arrangement with multiple
GUCs in different places and might have needed GUC-related types to
declare the check functions or something like that; sorry.  As for
fd.h, the reason it now includes  is that fd.h tests whether
O_DIRECT is defined, so in fact that was an omission from 2dbe8905
which moved the #if defined(O_DIRECT) stuff from xlogdefs.h to fd.h
but failed to move the #include with it; I will check if something
needs to be back-patched there.

Re: seemingly useless #include recently added

2023-04-24 Thread Tom Lane

Kyotaro Horiguchi  writes:
> While working on a patch, I noticed that a rcent commit (d4e71df6d75)
> added an apparently unnecessary inclusion of guc.h in smgr.h.

Yes, that seems quite awful, and I also wonder why it changed fd.h.
Adding #include's to header files is generally not the first choice.

regards, tom lane

seemingly useless #include recently added

2023-04-24 Thread Kyotaro Horiguchi

Hello.

While working on a patch, I noticed that a rcent commit (d4e71df6d75)
added an apparently unnecessary inclusion of guc.h in smgr.h.

The only change made by the commit to the file is the added #include
directive, which doesn't seem to be functioning, and the build
actually suceeds without it.  Moreover, it brings in some
server-related stuff when I incluce smgr.h in storage_xlog.h, causing
compilation issues for pg_rewind.

Should we remove it? Please find the attached patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 75d35c25d4cfaeb6fb40e2d451381beac1d5475c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 25 Apr 2023 11:43:32 +0900
Subject: [PATCH] Remove unnecessary include

The recently added inclusion of guc.h in msgr.h is not necessary and
introduces more server-related stuff. Removing the directive helps
avoid potential issues with including sgmr.h in frontends.
---
 src/include/storage/smgr.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 17fba6f91a..a9a179aaba 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,7 +17,6 @@
 #include "lib/ilist.h"
 #include "storage/block.h"
 #include "storage/relfilelocator.h"
-#include "utils/guc.h"
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
-- 
2.31.1

Re: buffer refcount leak in foreign batch insert code

2023-04-24 Thread Michael Paquier

On Mon, Apr 24, 2023 at 09:57:10AM +0900, Michael Paquier wrote:
> The attached is what I am finishing with, with a minimal regression
> test added to postgres_fdw.  Two partitions are enough.

Well, I have gone through that again this morning, and applied the fix
down to 14.  The buildfarm is digesting it fine.
--
Michael


signature.asc
Description: PGP signature

Re: pg_stat_io not tracking smgrwriteback() is confusing

2023-04-24 Thread Melanie Plageman

On Mon, Apr 24, 2023 at 7:02 PM Melanie Plageman 
wrote:

> On Mon, Apr 24, 2023 at 03:56:54PM -0700, Andres Freund wrote:
> >
> > I was thinking we'd track writeback separately from the write, rather
> than
> > attributing the writeback to "write".  Otherwise it looks good, based on
> a
> > quick skim.
>
> Like you want a separate IOOp IOOP_WRITEBACK? Interesting. Okay.
>

Okay, attached v2 does this (adds IOOP_WRITEBACK).

With my patch applied and the same pgbench setup as you (for -T30):

After pgbench:

backend_type |object |  context  |  writes  | write_time |
writebacks | writeback_time |  fsyncs | fsync_time |
-+---+---+--++++-++
 background writer   | relation  | normal| 5581 | 23.416 |
  5568 |  32.33 |   0 |  0 |
 checkpointer| relation  | normal|89116 |295.576 |
 89106 |  416.5 |  84 |   5242.764 |

and then after a stats reset followed by an explicit checkpoint:

backend_type |object |  context  |  writes | write_time
| writebacks | writeback_time |  fsyncs | fsync_time |
-+---+---+-++++-++
 checkpointer| relation  | normal|  229807 |
457.436004 | 229817 | 532.84 |  52 |378.652 |

I've yet to cook up a client backend test case (e.g. with COPY). I've taken
that as a todo.

I have a few outstanding questions:

1) Does it make sense for writebacks to count the number of blocks for
which writeback was requested or the number of calls to smgrwriteback() or
the number of actual syscalls made? We don't actually know from outside
of mdwriteback() how many FileWriteback() calls we will make.

2) I'm a little nervous about not including IOObject in the writeback
context. Technically, there is nothing stopping local buffer code from
calling IssuePendingWritebacks(). Right now, local buffer code doesn't
do ScheduleBufferTagForWriteback(). But it doesn't seem quite right to
hardcode in IOOBJECT_RELATION when there is nothing wrong with
requesting writeback of local buffers (AFAIK). What do you think?

3) Should any restrictions be added to pgstat_tracks_io_object() or
pgstat_tracks_io_op()? I couldn't think of any backend types or IO
contexts which would not do writeback as a rule. Also, though we don't
do writeback for temp tables now, it isn't nonsensical to do so. In
this version, I didn't add any restrictions.

Docs need work. I added a placeholder for the new columns. I'll update it
once we decide what writebacks should actually count. And, I don't think
we can do any kind of ongoing test.

- Melanie
From b20765526e5c4bb5a82734ad9977552ebf9dad2f Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 24 Apr 2023 18:21:54 -0400
Subject: [PATCH v2] Add writeback to pg_stat_io

---
 doc/src/sgml/monitoring.sgml  | 24 
 src/backend/catalog/system_views.sql  |  2 ++
 src/backend/postmaster/bgwriter.c |  4 ++--
 src/backend/storage/buffer/buf_init.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 12 ++--
 src/backend/utils/adt/pgstatfuncs.c   |  5 +
 src/include/catalog/pg_proc.dat   |  6 +++---
 src/include/pgstat.h  |  3 ++-
 src/include/storage/buf_internals.h   |  4 +++-
 src/test/regress/expected/rules.out   |  4 +++-
 10 files changed, 55 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 99f7f95c39..62572327cf 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3867,6 +3867,30 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i

+ 
+  
+   
+writebacks bigint
+   
+   
+Number of writeback operations, each of the size specified in
+op_bytes.
+   
+  
+ 
+
+ 
+  
+   
+writeback_time double precision
+   
+   
+Time spent in writeback operations in milliseconds (if
+ is enabled, otherwise zero)
+   
+  
+ 
+

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 48aacf66ee..d0c932ad0e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1131,6 +1131,8 @@ SELECT
b.read_time,
b.writes,
b.write_time,
+   b.writebacks,
+   b.writeback_time,
b.extends,
b.extend_time,
b.op_bytes,
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..4c3540d6e1 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -129,7 +129,7 @@ BackgroundWriterMain(void)
 			 ALLOCSET_DEF

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2023-04-24 Thread Justin Pryzby

On Thu, Mar 30, 2023 at 12:07:58AM -0500, Justin Pryzby wrote:
> On Mon, Mar 27, 2023 at 11:34:36PM -0500, Justin Pryzby wrote:
> > On Tue, Mar 28, 2023 at 09:13:10AM +0900, Michael Paquier wrote:
> > > On Mon, Mar 20, 2023 at 09:30:50AM +0900, Michael Paquier wrote:
> > > > Did you check dump and restore flows with partition
> > > > trees and --no-table-access-method?  Perhaps there should be
> > > > some regression tests with partitioned tables?
> > > 
> > > I was looking at the patch, and as I suspected the dumps generated
> > > are forgetting to apply the AM to the partitioned tables.
> > 
> > The patch said:
> > 
> > +   else if (RELKIND_HAS_TABLE_AM(relkind) || relkind == 
> > RELKIND_PARTITIONED_TABLE)
> > 
> > pg_dump was missing a similar change that's conditional on
> > RELKIND_HAS_TABLE_AM().  It looks like those are the only two places
> > that need be be specially handled.
> > 
> > I dug up my latest patch from 2021 and incorporated the changes from the
> > 0001 patch here, and added a test case.
> > 
> > I realized that one difference with tablespaces is that, as written,
> > partitioned tables will *always* have an AM specified,  and partitions
> > will never use default_table_access_method.  Is that what's intended ?
> > 
> > Or do we need logic similar tablespaces, such that the relam of a
> > partitioned table is set only if it differs from default_table_am ?
> 
> Actually .. I think it'd be a mistake if the relam needed to be
> interpretted differently for partitioned tables than other relkinds.
> 
> I updated the patch to allow intermediate partitioned tables to inherit
> relam from their parent.
> 
> Michael wrote:
> > .. and there are quite more points to consider.
> 
> What more points ?  There's nothing that's been raised here.  In fact,
> your message last week is the first comment since last June ..

Michael ?

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

2023-04-24 Thread Andres Freund

Hi,

On 2023-04-24 15:32:25 -0700, Andres Freund wrote:
> On 2023-04-24 10:53:35 +0200, Christoph Berg wrote:
> > I'm often seeing PG16 builds erroring out in the pgbench tests:
> > I don't think the disk is full since it's always hitting that same
> > spot, on some of the builds:
> 
> Yea, the EINTR pretty clearly indicates that it's not really out-of-space.

FWIW, I tried to reproduce this, without success - not too surprising, I
assume it's rather timing dependent.


> We obviously can add a retry loop to FileFallocate(), similar to what's
> already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
> further, and do it for all the fd.c routines where it's remotely plausible
> EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
> the functions.
> 
> 
> The following are documented to potentially return EINTR, without fd.c having
> code to retry:
> 
> - FileWriteback() / pg_flush_data()
> - FileSync() / pg_fsync()
> - FileFallocate()
> - FileTruncate()
> 
> With the first two there's the added complication that it's not entirely
> obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
> latter is a bit more sensible?

A prototype of that approach is attached. I pushed the retry handling into the
pg_* routines where applicable.  I guess we could add pg_* routines for
FileFallocate(), FilePrewarm() etc as well, but I didn't do that here.

Christoph, could you verify this fixes your issue?

Greetings,

Andres Freund
>From d8f5b0d4765044a0f35d01054c9d2720c6045b4c Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 24 Apr 2023 16:53:52 -0700
Subject: [PATCH v1] fd.c: Retry after EINTR in more places

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/file/fd.c | 71 +--
 1 file changed, 59 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 277a28fc137..2b232a80489 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -415,10 +415,18 @@ pg_fsync(int fd)
 int
 pg_fsync_no_writethrough(int fd)
 {
-	if (enableFsync)
-		return fsync(fd);
-	else
+	int		rc;
+
+	if (!enableFsync)
 		return 0;
+
+retry:
+	rc = fsync(fd);
+
+	if (rc == -1 && errno == EINTR)
+		goto retry;
+
+	return rc;
 }
 
 /*
@@ -448,10 +456,18 @@ pg_fsync_writethrough(int fd)
 int
 pg_fdatasync(int fd)
 {
-	if (enableFsync)
-		return fdatasync(fd);
-	else
+	int		rc;
+
+	if (!enableFsync)
 		return 0;
+
+retry:
+	rc = fdatasync(fd);
+
+	if (rc == -1 && errno == EINTR)
+		goto retry;
+
+	return rc;
 }
 
 /*
@@ -483,6 +499,7 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 		if (not_implemented_by_kernel)
 			return;
 
+retry:
 		/*
 		 * sync_file_range(SYNC_FILE_RANGE_WRITE), currently linux specific,
 		 * tells the OS that writeback for the specified blocks should be
@@ -498,6 +515,9 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 		{
 			int			elevel;
 
+			if (rc == EINTR)
+goto retry;
+
 			/*
 			 * For systems that don't have an implementation of
 			 * sync_file_range() such as Windows WSL, generate only one
@@ -629,32 +649,51 @@ pg_flush_data(int fd, off_t offset, off_t nbytes)
 #endif
 }
 
+static int
+pg_ftruncate(int fd, off_t length)
+{
+	int			ret;
+
+retry:
+	ret = ftruncate(fd, length);
+
+	if (ret == -1 && errno == EINTR)
+		goto retry;
+
+	return ret;
+}
+
 /*
  * Truncate a file to a given length by name.
  */
 int
 pg_truncate(const char *path, off_t length)
 {
+	int			ret;
 #ifdef WIN32
 	int			save_errno;
-	int			ret;
 	int			fd;
 
 	fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
 	if (fd >= 0)
 	{
-		ret = ftruncate(fd, length);
+		ret = pg_ftruncate(fd, length);
 		save_errno = errno;
 		CloseTransientFile(fd);
 		errno = save_errno;
 	}
 	else
 		ret = -1;
+#else
+
+retry:
+	ret = truncate(path, length);
+
+	if (ret == -1 && errno == EINTR)
+		goto retry;
+#endif
 
 	return ret;
-#else
-	return truncate(path, length);
-#endif
 }
 
 /*
@@ -2001,11 +2040,15 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
 	if (returnCode < 0)
 		return returnCode;
 
+retry:
 	pgstat_report_wait_start(wait_event_info);
 	returnCode = posix_fadvise(VfdCache[file].fd, offset, amount,
 			   POSIX_FADV_WILLNEED);
 	pgstat_report_wait_end();
 
+	if (returnCode == EINTR)
+		goto retry;
+
 	return returnCode;
 #else
 	Assert(FileIsValid(file));
@@ -2281,12 +2324,16 @@ FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
 	if (returnCode < 0)
 		return -1;
 
+retry:
+
 	pgstat_report_wait_start(wait_event_info);
 	returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
 	pgstat_report_wait_end();
 
 	if (returnCode == 0)
 		return 0;
+	else if (returnCode == EINTR)
+		goto retry;
 
 	/* for compatibility with %m printing etc */
 	errno = returnCode;
@@ -2334,7 +2381,7 @@ FileTruncate(File file, off_t offset, uint32 wait_event_info)
 		return r

Re: base backup vs. concurrent truncation

2023-04-24 Thread Andres Freund

Hi,

On 2023-04-21 09:42:57 -0400, Robert Haas wrote:
> Apologies if this has already been discussed someplace, but I couldn't
> find a previous discussion. It seems to me that base backups are
> broken in the face of a concurrent truncation that reduces the number
> of segments in a relation.

I think
https://www.postgresql.org/message-id/20230223010147.32oir7sb66slq...@awork3.anarazel.de
and the commits + discussions referenced from there is relevant for the topic.


> Suppose we have a relation that is 1.5GB in size, so that we have two
> files 23456, which is 1GB, and 23456.1, which is 0.5GB. We'll back
> those files up in whichever order the directory scan finds them.
> Suppose we back up 23456.1 first. Then the relation is truncated to
> 0.5GB, so 23456.1 is removed and 23456 gets a lot shorter. Next we
> back up the file 23456. Now our backup contains files 23456 and
> 23456.1, each 0.5GB. But this breaks the invariant in md.c:


>  *  On disk, a relation must consist of consecutively numbered segment
>  *  files in the pattern
>  *  -- Zero or more full segments of exactly RELSEG_SIZE blocks 
> each
>  *  -- Exactly one partial segment of size 0 <= size <
> RELSEG_SIZE blocks
>  *  -- Optionally, any number of inactive segments of size 0 
> blocks.
> 
> basebackup.c's theory about relation truncation is that it doesn't
> really matter because WAL replay will fix things up. But in this case,
> I don't think it will, because WAL replay relies on the above
> invariant holding. As mdnblocks says:
> 
> /*
>  * If segment is exactly RELSEG_SIZE, advance to next one.
>  */
> segno++;
> 
> So I think what's going to happen is we're not going to notice 23456.1
> when we recover the backup. It will just sit there as an orphaned file
> forever, unless we extend 23456 back to a full 1GB, at which point we
> might abruptly start considering that file part of the relation again.

One important point is that I think 23456.1 at that point can contain
data. Which can lead to deleted rows reappearing, vacuum failing due to rows
from before relfrozenxid existing, etc.


> Assuming I'm not wrong about all of this, the question arises: whose
> fault is this, and what to do about it? It seems to me that it's a bit
> hard to blame basebackup.c, because if you used pg_backup_start() and
> pg_backup_stop() and copied the directory yourself, you'd have exactly
> the same situation, and while we could (and perhaps should) teach
> basebackup.c to do something smarter, it doesn't seem realistic to
> impose complex constraints on the user's choice of file copy tool.

Agreed.


> So I think the problem is with md.c assuming that its invariant must
> hold on a cluster that's not guaranteed to be in a consistent state.
> But mdnblocks() clearly can't try to open every segment up to whatever
> the maximum theoretical possible segment number is every time it's
> invoked, because that would be wicked expensive. An idea that occurs
> to me is to remove all segment files following the first partial
> segment during startup, before we begin WAL replay. If that state
> occurs at startup, then either we have a scenario involving
> truncation, like those above, or a scenario involving relation
> extension, where we added a new segment and that made it to disk but
> the prior extension of the previous last segment file to maximum
> length did not. But in that case, WAL replay should, I think, fix
> things up. However, I'm not completely sure that there isn't some hole
> in this theory, and this way forward also doesn't sound particularly
> cheap. Nonetheless I don't have another idea right now.

What we've discussed somewhere in the past is to always truncate N+1 when
creating the first page in N. I.e. if we extend into 23456.1, we truncate
23456.2 to 0 blocks.  As far as I can tell, that'd solve this issue?

Greetings,

Andres Freund

Re: pg_stat_io not tracking smgrwriteback() is confusing

2023-04-24 Thread Jonathan S. Katz





On 4/24/23 6:14 PM, Andres Freund wrote:

Hi,

On 2023-04-24 10:52:15 +0530, Amit Kapila wrote:

On Sun, Apr 23, 2023 at 12:55 AM Jonathan S. Katz  wrote:

I wonder if it's
worth doing so for 16? It'd give a more complete picture that way. The
counter-argument I see is that we didn't track the time for it in existing
stats either, and that nobody complained - but I suspect that's mostly because
nobody knew to look.


[RMT hat]

(sorry for slow reply on this, I've been out for a few days).

It does sound generally helpful to track writeback to ensure anyone
building around pg_stat_io can see tthe more granular picture. How big
of an effort is this?



Right, I think this is the key factor to decide whether we can get
this in PG16 or not. If this is just adding a new column and a few
existing stats update calls then it should be okay to get in but if
this requires some more complex work then we can probably update the
docs.


I suspect it should really just be adding a few stats calls. The only possible
complication that I can see is that we might need to pass a bit more context
down in a place or two.


OK. So far it sounds reasonable to include. I think we should add this 
as an open item. I don't know if we need to set a deadline just yet, but 
we should try to keep go/nogo to earlier in the beta cycle.


Thanks,

Jonathan

Re: Request for comment on setting binary format output per session

2023-04-24 Thread Merlin Moncure

On Thu, Apr 20, 2023 at 2:52 PM Dave Cramer  wrote:

>
> As promised here is a patch with defines for all of the protocol messages.
>
I created a protocol.h file and put it in src/includes
> I'm fairly sure that some of the names I used may need to be changed but
> the grunt work of finding and replacing everything is done.
>

In many cases, converting inline character to macro eliminates the need for
inline comment, e.g.:
+ case SIMPLE_QUERY: /* simple query */

...that's more work obviously, do you agree and if so would you like some
help going through that?

merlin

>

Re: pg_stat_io not tracking smgrwriteback() is confusing

2023-04-24 Thread Melanie Plageman

On Mon, Apr 24, 2023 at 03:56:54PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2023-04-24 18:36:24 -0400, Melanie Plageman wrote:
> > On Mon, Apr 24, 2023 at 6:13 PM Andres Freund  wrote:
> > > > Also, it seems like this (given the current code) is only reachable for
> > > > permanent relations (i.e. not for IO object temp relation). If other
> > > backend
> > > > types than checkpointer may call smgrwriteback(), we likely have to
> > > consider
> > > > the IO context.
> > >
> > > I think we should take it into account - it'd e.g. interesting to see a
> > > COPY
> > > is bottlenecked on smgrwriteback() rather than just writing the data.
> > >
> > 
> > With the quick and dirty attached patch and using your example but with a
> > pgbench -T200 on my rather fast local NVMe SSD, you can still see quite
> > a difference.
> 
> Quite a difference between what?

With and without the patch. Meaning: clearly tracking writeback is a good idea.

> 
> What scale of pgbench did you use?

1000, as you did

> 
> -T200 is likely not a good idea, because a timed checkpoint might "interfere",
> unless you use a non-default checkpoint_timeout. A timed checkpoint won't show
> the issue as easily, because checkpointer spend most of the time sleeping.

Ah, I see. I did not use a non-default checkpoint timeout.

> > Patch needs cleanup/comments and a bit more work, but I could do with
> > a sanity check review on the approach.
> 
> I was thinking we'd track writeback separately from the write, rather than
> attributing the writeback to "write".  Otherwise it looks good, based on a
> quick skim.

Like you want a separate IOOp IOOP_WRITEBACK? Interesting. Okay.

- Melanie

Re: pg_stat_io not tracking smgrwriteback() is confusing

2023-04-24 Thread Andres Freund

Hi,

On 2023-04-24 18:36:24 -0400, Melanie Plageman wrote:
> On Mon, Apr 24, 2023 at 6:13 PM Andres Freund  wrote:
> > > Also, it seems like this (given the current code) is only reachable for
> > > permanent relations (i.e. not for IO object temp relation). If other
> > backend
> > > types than checkpointer may call smgrwriteback(), we likely have to
> > consider
> > > the IO context.
> >
> > I think we should take it into account - it'd e.g. interesting to see a
> > COPY
> > is bottlenecked on smgrwriteback() rather than just writing the data.
> >
> 
> With the quick and dirty attached patch and using your example but with a
> pgbench -T200 on my rather fast local NVMe SSD, you can still see quite
> a difference.

Quite a difference between what?

What scale of pgbench did you use?

-T200 is likely not a good idea, because a timed checkpoint might "interfere",
unless you use a non-default checkpoint_timeout. A timed checkpoint won't show
the issue as easily, because checkpointer spend most of the time sleeping.


> This is with a stats reset before the checkpoint.
> 
> unpatched:
> 
> backend_type |object |  context  |  writes | write_time |
>  fsyncs | fsync_time
> -+---+---+-++-+
>  background writer   | relation  | normal| 443 |  1.408 |
> 0 |  0
>  checkpointer| relation  | normal|  187804 |396.829 |
>47 |254.226
> 
> patched:
> 
> backend_type |object |  context  |  writes | write_time
> | fsyncs | fsync_time
> -+---+---+-+++
>  background writer   | relation  | normal| 917 |
> 4.4675 |  0 |  0
>  checkpointer| relation  | normal|  375798 |
>  977.354 | 48 |202.514
> 
> I did compare client backend stats before and after pgbench and it made
> basically no difference. I'll do a COPY example like you mentioned.


> Patch needs cleanup/comments and a bit more work, but I could do with
> a sanity check review on the approach.

I was thinking we'd track writeback separately from the write, rather than
attributing the writeback to "write".  Otherwise it looks good, based on a
quick skim.

Greetings,

Andres Freund

Re: pg_stat_io not tracking smgrwriteback() is confusing

2023-04-24 Thread Melanie Plageman

On Mon, Apr 24, 2023 at 6:13 PM Andres Freund  wrote:

> Hi,
>
> On 2023-04-24 17:37:48 -0400, Melanie Plageman wrote:
> > On Mon, Apr 24, 2023 at 02:14:32PM -0700, Andres Freund wrote:
> > > It starts blocking once "enough" IO is in flight. For things like an
> immediate
> > > checkpoint, that can happen fairly quickly, unless you have a very
> fast IO
> > > subsystem. So often it'll not matter whether we track smgrwriteback(),
> but
> > > when it matter, it can matter a lot.
> >
> > I see. So, it sounds like this is most likely to happen for checkpointer
> > and not likely to happen for other backends who call
> > ScheduleBufferTagForWriteback().
>
> It's more likely, but once the IO subsystem is busy, it'll also happen for
> other users ScheduleBufferTagForWriteback().
>
>
> > Also, it seems like this (given the current code) is only reachable for
> > permanent relations (i.e. not for IO object temp relation). If other
> backend
> > types than checkpointer may call smgrwriteback(), we likely have to
> consider
> > the IO context.
>
> I think we should take it into account - it'd e.g. interesting to see a
> COPY
> is bottlenecked on smgrwriteback() rather than just writing the data.
>

With the quick and dirty attached patch and using your example but with a
pgbench -T200 on my rather fast local NVMe SSD, you can still see quite
a difference.
This is with a stats reset before the checkpoint.

unpatched:

backend_type |object |  context  |  writes | write_time |
 fsyncs | fsync_time
-+---+---+-++-+
 background writer   | relation  | normal| 443 |  1.408 |
0 |  0
 checkpointer| relation  | normal|  187804 |396.829 |
   47 |254.226

patched:

backend_type |object |  context  |  writes | write_time
| fsyncs | fsync_time
-+---+---+-+++
 background writer   | relation  | normal| 917 |
4.4675 |  0 |  0
 checkpointer| relation  | normal|  375798 |
 977.354 | 48 |202.514

I did compare client backend stats before and after pgbench and it made
basically no difference. I'll do a COPY example like you mentioned.

Patch needs cleanup/comments and a bit more work, but I could do with
a sanity check review on the approach.

- Melanie
From 6c5661ee5f0efbda9d246184f40cd799c21b5d2a Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 24 Apr 2023 18:21:54 -0400
Subject: [PATCH v1] Add writeback to pg_stat_io writes.

---
 src/backend/postmaster/bgwriter.c |  4 ++--
 src/backend/storage/buffer/buf_init.c |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 12 ++--
 src/include/storage/buf_internals.h   |  4 +++-
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..4c3540d6e1 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -129,7 +129,7 @@ BackgroundWriterMain(void)
 			 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
-	WritebackContextInit(&wb_context, &bgwriter_flush_after);
+	WritebackContextInit(&wb_context, IOCONTEXT_NORMAL, &bgwriter_flush_after);
 
 	/*
 	 * If an exception is encountered, processing resumes here.
@@ -185,7 +185,7 @@ BackgroundWriterMain(void)
 		MemoryContextResetAndDeleteChildren(bgwriter_context);
 
 		/* re-initialize to avoid repeated errors causing problems */
-		WritebackContextInit(&wb_context, &bgwriter_flush_after);
+		WritebackContextInit(&wb_context, IOCONTEXT_NORMAL, &bgwriter_flush_after);
 
 		/* Now we can allow interrupts again */
 		RESUME_INTERRUPTS();
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0057443f0c..1d80532f3b 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -146,7 +146,7 @@ InitBufferPool(void)
 	StrategyInitialize(!foundDescs);
 
 	/* Initialize per-backend file flush context */
-	WritebackContextInit(&BackendWritebackContext,
+	WritebackContextInit(&BackendWritebackContext, IOCONTEXT_NORMAL,
 		 &backend_flush_after);
 }
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..774dbadd08 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1685,6 +1685,7 @@ again:
 		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
 		LWLockRelease(content_lock);
 
+		BackendWritebackContext.io_context = io_context;
 		ScheduleBufferTagForWriteback(&BackendWritebackContext,
 	  &buf_hdr->tag);
 	}
@@ -2555,7 +2556,7 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;	/* nothing to do */
 
-	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	WritebackContext

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

2023-04-24 Thread Andres Freund

Hi,

On 2023-04-24 10:53:35 +0200, Christoph Berg wrote:
> I'm often seeing PG16 builds erroring out in the pgbench tests:

Interesting!

> I don't think the disk is full since it's always hitting that same
> spot, on some of the builds:

Yea, the EINTR pretty clearly indicates that it's not really out-of-space.

> https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/
> 
> This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running
> that test works though, and the FS seems to support posix_fallocate:

I guess it requires a bunch of memory (?) pressure for this to happen
(triggering blocking during fallocate, opening the window for a signal to
arrive), which likely only happens when running things concurrently.

We obviously can add a retry loop to FileFallocate(), similar to what's
already present e.g. in FileRead(). But I wonder if we shouldn't go a bit
further, and do it for all the fd.c routines where it's remotely plausible
EINTR could be returned? It's a bit silly to add EINTR retries one-by-one to
the functions.

The following are documented to potentially return EINTR, without fd.c having
code to retry:

- FileWriteback() / pg_flush_data()
- FileSync() / pg_fsync()
- FileFallocate()
- FileTruncate()

With the first two there's the added complication that it's not entirely
obvious whether it'd be better to handle this in File* or pg_*. I'd argue the
latter is a bit more sensible?

Greetings,

Andres Freund

Re: Remove io prefix from pg_stat_io columns

2023-04-24 Thread Andres Freund

On 2023-04-21 07:38:01 +0900, Michael Paquier wrote:
> On Thu, Apr 20, 2023 at 11:38:42AM +0900, Michael Paquier wrote:
> > No worries, committers should take care of that.
> 
> Done as of 0ecb87e, as I can keep an eye on the buildfarm today, with
> a catversion bump.

Thanks!

Re: pg_stat_io not tracking smgrwriteback() is confusing

2023-04-24 Thread Andres Freund

Hi,

On 2023-04-24 10:52:15 +0530, Amit Kapila wrote:
> On Sun, Apr 23, 2023 at 12:55 AM Jonathan S. Katz  
> wrote:
> > > I wonder if it's
> > > worth doing so for 16? It'd give a more complete picture that way. The
> > > counter-argument I see is that we didn't track the time for it in existing
> > > stats either, and that nobody complained - but I suspect that's mostly 
> > > because
> > > nobody knew to look.
> >
> > [RMT hat]
> >
> > (sorry for slow reply on this, I've been out for a few days).
> >
> > It does sound generally helpful to track writeback to ensure anyone
> > building around pg_stat_io can see tthe more granular picture. How big
> > of an effort is this?
> >
> 
> Right, I think this is the key factor to decide whether we can get
> this in PG16 or not. If this is just adding a new column and a few
> existing stats update calls then it should be okay to get in but if
> this requires some more complex work then we can probably update the
> docs.

I suspect it should really just be adding a few stats calls. The only possible
complication that I can see is that we might need to pass a bit more context
down in a place or two.

Greetings,

Andres Freund

Re: pg_stat_io not tracking smgrwriteback() is confusing

2023-04-24 Thread Andres Freund

Hi,

On 2023-04-24 17:37:48 -0400, Melanie Plageman wrote:
> On Mon, Apr 24, 2023 at 02:14:32PM -0700, Andres Freund wrote:
> > It starts blocking once "enough" IO is in flight. For things like an 
> > immediate
> > checkpoint, that can happen fairly quickly, unless you have a very fast IO
> > subsystem. So often it'll not matter whether we track smgrwriteback(), but
> > when it matter, it can matter a lot.
> 
> I see. So, it sounds like this is most likely to happen for checkpointer
> and not likely to happen for other backends who call
> ScheduleBufferTagForWriteback().

It's more likely, but once the IO subsystem is busy, it'll also happen for
other users ScheduleBufferTagForWriteback().


> Also, it seems like this (given the current code) is only reachable for
> permanent relations (i.e. not for IO object temp relation). If other backend
> types than checkpointer may call smgrwriteback(), we likely have to consider
> the IO context.

I think we should take it into account - it'd e.g. interesting to see a COPY
is bottlenecked on smgrwriteback() rather than just writing the data.


> I would imagine that we want to smgrwriteback() timing to writes/write time
> for the relevant IO context and backend type.

Yes.

Greetings,

Andres Freund

Re: pg_stat_io not tracking smgrwriteback() is confusing

2023-04-24 Thread Melanie Plageman

On Mon, Apr 24, 2023 at 02:14:32PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2023-04-24 16:39:36 -0400, Melanie Plageman wrote:
> > On Wed, Apr 19, 2023 at 10:23:26AM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > I noticed that the numbers in pg_stat_io dont't quite add up to what I
> > > expected in write heavy workloads. Particularly for checkpointer, the 
> > > numbers
> > > for "write" in log_checkpoints output are larger than what is visible in
> > > pg_stat_io.
> > > 
> > > That partially is because log_checkpoints' "write" covers way too many 
> > > things,
> > > but there's an issue with pg_stat_io as well:
> > > 
> > > Checkpoints, and some other sources of writes, will often end up doing a 
> > > lot
> > > of smgrwriteback() calls - which pg_stat_io doesn't track. Nor do any
> > > pre-existing forms of IO statistics.
> > > 
> > > It seems pretty clear that we should track writeback as well. I wonder if 
> > > it's
> > > worth doing so for 16? It'd give a more complete picture that way. The
> > > counter-argument I see is that we didn't track the time for it in existing
> > > stats either, and that nobody complained - but I suspect that's mostly 
> > > because
> > > nobody knew to look.
> > 
> > Not complaining about making pg_stat_io more accurate, but what exactly
> > would we be tracking for smgrwriteback()? I assume you are talking about
> > IO timing. AFAICT, on Linux, it does sync_file_range() with
> > SYNC_FILE_RANGE_WRITE, which is asynchronous. Wouldn't we just be
> > tracking the system call overhead time?
> 
> It starts blocking once "enough" IO is in flight. For things like an immediate
> checkpoint, that can happen fairly quickly, unless you have a very fast IO
> subsystem. So often it'll not matter whether we track smgrwriteback(), but
> when it matter, it can matter a lot.

I see. So, it sounds like this is most likely to happen for checkpointer
and not likely to happen for other backends who call
ScheduleBufferTagForWriteback(). Also, it seems like this (given the
current code) is only reachable for permanent relations (i.e. not for IO
object temp relation). If other backend types than checkpointer may call
smgrwriteback(), we likely have to consider the IO context. I would
imagine that we want to smgrwriteback() timing to writes/write time for
the relevant IO context and backend type.

- Melanie

Re: Missing update of all_hasnulls in BRIN opclasses

2023-04-24 Thread Tomas Vondra

On 4/24/23 17:36, Alvaro Herrera wrote:
> On 2023-Apr-23, Tomas Vondra wrote:
> 
>> here's an updated version of the patch, including a backport version. I
>> ended up making the code yet a bit closer to master by introducing
>> add_values_to_range(). The current pre-14 code has the loop adding data
>> to the BRIN tuple in two places, but with the "fixed" logic handling
>> NULLs and the empty_range flag the amount of duplicated code got too
>> high, so this seem reasonable.
> 
> In backbranches, the new field to BrinMemTuple needs to be at the end of
> the struct, to avoid ABI breakage.
> 

Good point.

> There's a comment in add_values_to_range with a typo "If the range was had".
> Also, "So we should not see empty range that was not modified" should
> perhaps be "should not see an empty range".
> 

OK, I'll check the wording of the comments.

> (As for your FIXME comment in brin_memtuple_initialize, I think you're
> correct: we definitely need to reset bt_placeholder.  Otherwise, we risk
> places that call it in a loop using a BrinMemTuple with one range with
> the flag set, in a range where that doesn't hold.  It might be
> impossible for this to happen, given how narrow the conditions are on
> which bt_placeholder is used; but it seems safer to reset it anyway.)
> 

Yeah. But isn't that a separate preexisting issue, strictly speaking?

> Some pgindent noise would be induced by this patch.  I think it's worth
> cleaning up ahead of time.
> 

True. Will do.

> I did a quick experiment of turning the patches over -- applying test
> first, code fix after (requires some conflict fixing).  With that I
> verified that the behavior of 'hasnulls' indeed changes with the code
> fix.
> 

Thanks. Could you do some testing of the union_tuples stuff too? It's a
bit tricky part - the behavior is timing sensitive, so testing it
requires gdb breakpoints breakpoints or something like that. I think
it's correct, but it'd be nice to check that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_stat_io not tracking smgrwriteback() is confusing

2023-04-24 Thread Andres Freund

Hi,

On 2023-04-24 16:39:36 -0400, Melanie Plageman wrote:
> On Wed, Apr 19, 2023 at 10:23:26AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > I noticed that the numbers in pg_stat_io dont't quite add up to what I
> > expected in write heavy workloads. Particularly for checkpointer, the 
> > numbers
> > for "write" in log_checkpoints output are larger than what is visible in
> > pg_stat_io.
> > 
> > That partially is because log_checkpoints' "write" covers way too many 
> > things,
> > but there's an issue with pg_stat_io as well:
> > 
> > Checkpoints, and some other sources of writes, will often end up doing a lot
> > of smgrwriteback() calls - which pg_stat_io doesn't track. Nor do any
> > pre-existing forms of IO statistics.
> > 
> > It seems pretty clear that we should track writeback as well. I wonder if 
> > it's
> > worth doing so for 16? It'd give a more complete picture that way. The
> > counter-argument I see is that we didn't track the time for it in existing
> > stats either, and that nobody complained - but I suspect that's mostly 
> > because
> > nobody knew to look.
> 
> Not complaining about making pg_stat_io more accurate, but what exactly
> would we be tracking for smgrwriteback()? I assume you are talking about
> IO timing. AFAICT, on Linux, it does sync_file_range() with
> SYNC_FILE_RANGE_WRITE, which is asynchronous. Wouldn't we just be
> tracking the system call overhead time?

It starts blocking once "enough" IO is in flight. For things like an immediate
checkpoint, that can happen fairly quickly, unless you have a very fast IO
subsystem. So often it'll not matter whether we track smgrwriteback(), but
when it matter, it can matter a lot.

As an example, I inited' a pgbench w/ scale 1000, on a decent but not great
NVMe SSD. Created dirty data with:

  c=96;/srv/dev/build/m-opt/src/bin/pgbench/pgbench --random-seed=0 -n -M 
prepared -c$c -j$c -T30 -P1
and then measured the checkpoint:
  perf trace -s -p $pid_of_checkpointer psql -x -c "SELECT 
pg_stat_reset_shared('io')" -c "checkpoint"

 postgres (367444), 1891280 events, 100.0%

   syscallcalls  errors  total   min   avg   max   
stddev
 (msec)(msec)(msec)(msec)   
 (%)
   ---   --  - - - 
--
   sync_file_range   359176  0  4560.670 0.002 0.013   238.955 
10.36%
   pwrite64  582964  0  2874.634 0.003 0.005 0.156  
0.06%
   fsync242  0   251.631 0.001 1.04042.166 
18.81%
   openat   317 65 2.171 0.002 0.007 0.068  
5.69%
   rename69  0 1.973 0.012 0.029 0.084  
5.81%
   fdatasync  1  0 1.543 1.543 1.543 1.543  
0.00%
   unlink   150137 1.278 0.002 0.009 0.062 
10.69%
   close250  0 0.694 0.001 0.003 0.007  
3.14%
   newfstatat   140 68 0.667 0.001 0.005 0.022  
7.26%
   write  5  0 0.067 0.005 0.013 0.025 
24.55%
   lseek 14  0 0.050 0.001 0.004 0.018 
33.87%
   getdents64 8  0 0.047 0.002 0.006 0.022 
39.51%
   kill   3  0 0.029 0.008 0.010 0.011 
10.18%
   epoll_wait 2  1 0.006 0.000 0.003 0.006
100.00%
   read   1  0 0.004 0.004 0.004 0.004  
0.00%

Log output:

2023-04-24 14:11:59.234 PDT [367444][checkpointer][:0][] LOG:  checkpoint 
starting: immediate force wait
2023-04-24 14:12:09.236 PDT [367444][checkpointer][:0][] LOG:  checkpoint 
complete: wrote 595974 buffers (28.4%); 0 WAL file(s) added, 0 removed, 68 
recycled; write=9.740 s, sync=0.057 s, total=10.002 s; sync files=27, 
longest=0.043 s, average=0.003 s; distance=4467386 kB, estimate=4467386 kB; 
lsn=6/E5D33F98, redo lsn=6/E5D33F28


# SELECT writes, write_time, fsyncs, fsync_time FROM pg_stat_io WHERE 
backend_type = 'checkpointer';
┌┬┬┬┐
│ writes │ write_time │ fsyncs │ fsync_time │
├┼┼┼┤
│ 595914 │ 4002.17302 │ 24 │ 46.359 │
└┴┴┴┘


Greetings,

Andres Freund

Re: Missing update of all_hasnulls in BRIN opclasses

2023-04-24 Thread Tomas Vondra




On 4/24/23 23:05, Tomas Vondra wrote:
> 
> 
> On 4/24/23 17:46, Tom Lane wrote:
>> Alvaro Herrera  writes:
>>> On 2023-Apr-23, Tomas Vondra wrote:
 Both cases have a patch extending pageinspect to show the new flag, but
 obviously we should commit that only in master. In backbranches it's
 meant only to make testing easier.
>>
>>> In backbranches, I think it should be reasonably easy to add a
>>> --1.7--1.7.1.sql file and set the default version to 1.7.1; that then
>>> enables us to have the functionality (and the tests) in older branches
>>> too.  If you just add a --1.X.1--1.12.sql version to each branch that's
>>> identical to that branch's current pageinspect version upgrade script,
>>> that would let us have working upgrade paths for all branches.  This is
>>> a bit laborious but straightforward enough.
>>
>> "A bit laborious"?  That seems enormously out of proportion to the
>> benefit of putting this test case into back branches.  It will have
>> costs for end users too, not only us.  As an example, it would
>> unecessarily block some upgrade paths, if the upgraded-to installation
>> is slightly older and lacks the necessary --1.X.1--1.12 script.
>>
> 
> Why would that block the upgrade? Presumably we'd add two upgrade
> scripts in the master, to allow upgrade both from 1.X and 1.X.1.
> 

D'oh! I just realized I misunderstood the issue. Yes, if the target
version is missing the new script, that won't work. I'm not sure how
likely that is - in my experience people refresh versions at the same
time - but it's certainly an assumption we shouldn't do, I guess.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Missing update of all_hasnulls in BRIN opclasses

2023-04-24 Thread Tom Lane

Tomas Vondra  writes:
> On 4/24/23 17:46, Tom Lane wrote:
>> "A bit laborious"?  That seems enormously out of proportion to the
>> benefit of putting this test case into back branches.  It will have
>> costs for end users too, not only us.  As an example, it would
>> unecessarily block some upgrade paths, if the upgraded-to installation
>> is slightly older and lacks the necessary --1.X.1--1.12 script.

> Why would that block the upgrade? Presumably we'd add two upgrade
> scripts in the master, to allow upgrade both from 1.X and 1.X.1.

It would for example block updating from 14.8 or later to 15.2, since
a 15.2 installation would not have the script to update from 1.X.1.

Yeah, people could work around that by only installing the latest
version, but there are plenty of real-world scenarios where you'd be
creating friction, or at least confusion.  I do not think that this
test case is worth it.

regards, tom lane

Re: Missing update of all_hasnulls in BRIN opclasses

2023-04-24 Thread Tomas Vondra




On 4/24/23 17:46, Tom Lane wrote:
> Alvaro Herrera  writes:
>> On 2023-Apr-23, Tomas Vondra wrote:
>>> Both cases have a patch extending pageinspect to show the new flag, but
>>> obviously we should commit that only in master. In backbranches it's
>>> meant only to make testing easier.
> 
>> In backbranches, I think it should be reasonably easy to add a
>> --1.7--1.7.1.sql file and set the default version to 1.7.1; that then
>> enables us to have the functionality (and the tests) in older branches
>> too.  If you just add a --1.X.1--1.12.sql version to each branch that's
>> identical to that branch's current pageinspect version upgrade script,
>> that would let us have working upgrade paths for all branches.  This is
>> a bit laborious but straightforward enough.
> 
> "A bit laborious"?  That seems enormously out of proportion to the
> benefit of putting this test case into back branches.  It will have
> costs for end users too, not only us.  As an example, it would
> unecessarily block some upgrade paths, if the upgraded-to installation
> is slightly older and lacks the necessary --1.X.1--1.12 script.
> 

Why would that block the upgrade? Presumably we'd add two upgrade
scripts in the master, to allow upgrade both from 1.X and 1.X.1.

>> If you don't feel like adding that, I volunteer to add the necessary
>> scripts to all branches after you commit the bugfix, and ensure that all
>> upgrade paths work correctly.
> 
> I do not think this should happen at all, whether you're willing to
> do the work or not.

FWIW I'm fine with not doing that. As mentioned, I only included this
patch to make testing the patch easier (otherwise the flag is impossible
to inspect directly).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2023-04-24 Thread Melanie Plageman

On Mon, Apr 10, 2023 at 3:41 AM Pavel Luzanov  wrote:
>
> On 05.04.2023 03:41, Melanie Plageman wrote:
> > On Tue, Apr 4, 2023 at 4:35 PM Pavel Luzanov  
> > wrote:
> >
> >> After a little thought... I'm not sure about the term 'bootstrap
> >> process'. I can't find this term in the documentation.
> > There are various mentions of "bootstrap" peppered throughout the docs
> > but no concise summary of what it is. For example, initdb docs mention
> > the "bootstrap backend" [1].
> >
> > Interestingly, 910cab820d0 added "Bootstrap superuser" in November. This
> > doesn't really cover what bootstrapping is itself, but I wonder if that
> > is useful? If so, you could propose a glossary entry for it?
> > (preferably in a new thread)
>
> I'm not sure if this is the reason for adding a new entry in the glossary.
>
> >> Do I understand correctly that this is a postmaster? If so, then the
> >> postmaster process is not shown in pg_stat_activity.
> > No, bootstrap process is for initializing the template database. You
> > will not be able to see pg_stat_activity when it is running.
>
> Oh, it's clear to me now. Thank you for the explanation.
>
> > You can query pg_stat_activity from single user mode, so it is relevant
> > to pg_stat_activity also. I take your point that bootstrap mode isn't
> > relevant for pg_stat_activity, but I am hesitant to add that distinction
> > to the pg_stat_io docs since the reason you won't see it in
> > pg_stat_activity is because it is ephemeral and before a user can access
> > the database and not because stats are not tracked for it.
> >
> > Can you think of a way to convey this?
>
> See my attempt attached.
> I'm not sure about the wording. But I think we can avoid the term
> 'bootstrap process'
> by replacing it with "database cluster initialization", which should be
> clear to everyone.

I like that idea.

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3f33a1c56c..45e20efbfb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -991,6 +991,9 @@ postgres   27093  0.0  0.0  30096  2752 ?
Ss   11:34   0:00 postgres: ser
archiver,
startup, walreceiver,
walsender and walwriter.
+   The special type standalone backend is used

I think referring to it as a "special type" is a bit confusing. I think
you can just start the sentence with "standalone backend". You could
even include it in the main list of backend_types since it is possible
to see it in pg_stat_activity when in single user mode.

+   when initializing a database cluster by 
+   and when running in the .
In addition, background workers registered by extensions may have
additional types.
   

I like the rest of this.

I copied the committer who most recently touched pg_stat_io (Michael
Paquier) to see if we could get someone interested in committing this
docs update.

- Melanie

Re: pg_stat_io not tracking smgrwriteback() is confusing

2023-04-24 Thread Melanie Plageman

On Wed, Apr 19, 2023 at 10:23:26AM -0700, Andres Freund wrote:
> Hi,
> 
> I noticed that the numbers in pg_stat_io dont't quite add up to what I
> expected in write heavy workloads. Particularly for checkpointer, the numbers
> for "write" in log_checkpoints output are larger than what is visible in
> pg_stat_io.
> 
> That partially is because log_checkpoints' "write" covers way too many things,
> but there's an issue with pg_stat_io as well:
> 
> Checkpoints, and some other sources of writes, will often end up doing a lot
> of smgrwriteback() calls - which pg_stat_io doesn't track. Nor do any
> pre-existing forms of IO statistics.
> 
> It seems pretty clear that we should track writeback as well. I wonder if it's
> worth doing so for 16? It'd give a more complete picture that way. The
> counter-argument I see is that we didn't track the time for it in existing
> stats either, and that nobody complained - but I suspect that's mostly because
> nobody knew to look.

Not complaining about making pg_stat_io more accurate, but what exactly
would we be tracking for smgrwriteback()? I assume you are talking about
IO timing. AFAICT, on Linux, it does sync_file_range() with
SYNC_FILE_RANGE_WRITE, which is asynchronous. Wouldn't we just be
tracking the system call overhead time?

- Melanie

Re: Should we remove vacuum_defer_cleanup_age?

2023-04-24 Thread Andres Freund

Hi,

On 2023-04-24 14:36:36 +0200, Alvaro Herrera wrote:
> On 2023-Apr-22, Andres Freund wrote:
> > I'm afraid we'll need TransactionIdRetreatSafely() again, when we convert 
> > more
> > things to 64bit xids (lest they end up with the same bug as fixed by
> > be504a3e974), so it's perhaps worth thinking about how to make it less
> > confusing.
> 
> The one thing that IMO makes it less confusing is to have it return the
> value rather than modifying it in place.

Partially I made it that way because you otherwise end up repeating long
variable names multiple times within one statement, yielding long repetitive
lines...  Not sure that's a good enough reason, but ...



> > >
> > > Replication slots overcome these disadvantages by retaining only the 
> > > number
> > > of segments known to be needed.
> > > On the other hand, replication slots can retain so
> > > many WAL segments that they fill up the space allocated
> > > for pg_wal;
> > >  limits the size of WAL 
> > > files
> > > retained by replication slots.
> > >
> > 
> > It seems a bit confusing now, because "by retaining only the number of
> > segments ..." now also should cover hs_feedback (due to merging), but 
> > doesn't.
> 
> Hah, ok.
> 

> > I think I'll push the version I had. Then we can separately word-smith the
> > section? Unless somebody protests I'm gonna do that soon.
> 
> No objection.

Cool. Pushed now.

RE: Should we remove vacuum_defer_cleanup_age?

2023-04-24 Thread Phil Florent



Hi,
Not very convenient but if autovacuum is enabled isn't vacuum_defer_cleanup_age 
the way to make extensions like pg_dirtyread more effective for temporal 
queries to quickly correct human DML mistakes without the need of a complete 
restore, even if no standby is in use ? vacuum_defer_cleanup_age+pg_dirtyread 
give PostgreSQL something like "flashback query" in Oracle.
Best regards,
Phil


De : Andres Freund 
Envoyé : dimanche 23 avril 2023 00:47
À : Alvaro Herrera 
Cc : Justin Pryzby ; pgsql-hack...@postgresql.org 
; Amit Kapila 
Objet : Re: Should we remove vacuum_defer_cleanup_age?

Hi,

On 2023-04-13 13:18:38 +0200, Alvaro Herrera wrote:
> On 2023-Apr-11, Andres Freund wrote:
>
> > Updated patch attached. I think we should either apply something like that
> > patch, or at least add a  to the docs.
>
> I gave this patch a look.  The only code change is that
> ComputeXidHorizons() and GetSnapshotData() no longer handle the case
> where vacuum_defer_cleanup_age is different from zero.  It looks good.
> The TransactionIdRetreatSafely() code being removed is pretty weird (I
> spent a good dozen minutes writing a complaint that your rewrite was
> faulty, but it turns out I had misunderstood the function), so I'm glad
> it's being retired.

My rewrite of what? The creation of TransactionIdRetreatSafely() in
be504a3e974?

I'm afraid we'll need TransactionIdRetreatSafely() again, when we convert more
things to 64bit xids (lest they end up with the same bug as fixed by
be504a3e974), so it's perhaps worth thinking about how to make it less
confusing.


> > 
> > -Similarly, 
> > -and  provide protection 
> > against
> > -relevant rows being removed by vacuum, but the former provides no
> > -protection during any time period when the standby is not connected,
> > -and the latter often needs to be set to a high value to provide 
> > adequate
> > -protection.  Replication slots overcome these disadvantages.
> > +Similarly,  on its own, 
> > without
> > +also using a replication slot, provides protection against relevant 
> > rows
> > +being removed by vacuum, but provides no protection during any time 
> > period
> > +when the standby is not connected.  Replication slots overcome these
> > +disadvantages.
>
> I think it made sense to have this paragraph be separate from the
> previous one when it was talking about two separate variables, but now
> that it's just one, it looks a bit isolated.  I would merge it with the
> one above, which is talking about pretty much the same thing, and
> reorder the whole thing approximately like this
>
>
> In lieu of using replication slots, it is possible to prevent the removal
> of old WAL segments using , or by
> storing the segments in an archive using
>  or  linkend="guc-archive-library"/>.
> However, these methods often result in retaining more WAL segments than
> required.
> Similarly,  without
> a replication slot provides protection against relevant rows
> being removed by vacuum, but provides no protection during any time period
> when the standby is not connected.
>
>
> Replication slots overcome these disadvantages by retaining only the 
> number
> of segments known to be needed.
> On the other hand, replication slots can retain so
> many WAL segments that they fill up the space allocated
> for pg_wal;
>  limits the size of WAL files
> retained by replication slots.
>

It seems a bit confusing now, because "by retaining only the number of
segments ..." now also should cover hs_feedback (due to merging), but doesn't.


> Though the "However," looks a poor fit; I would do this:

I agree, I don't like the however.


I think I'll push the version I had. Then we can separately word-smith the
section? Unless somebody protests I'm gonna do that soon.

Greetings,

Andres Freund

RE: [PATCH] Support % wildcard in extension upgrade filenames

2023-04-24 Thread Regina Obe

> This change also makes it easier for extensions that use versioned .so files 
> (by that I mean uses extension-.so rather than extension.so). 
> Because such extensions can't really use the chaining property of the 
> existing upgrade system and so need to write a direct X--Y.sql migration file 
> for 
> every prior version X. I know the system wasn't designed for this, but in 
> reality a lot of extensions do this. Especially the more complex ones.

> Hopefully this is helpful,
> Mat

Thanks for the feedback.  Yes this idea of versioned .so is also one of the 
reasons why we always have to run a replace on all functions.

Like for example on PostGIS side, we have option of full minor (so the lib 
could be postgis-3.3.so  or postgis-3.so)

For development I always include the minor version and the windows interim 
builds I build against our master branch has the minor.
But the idea is someone can upgrade to stable version, so the script needs to 
always have the .so version in it to account for this shifting of options.

Thanks,
Regina

Re: [PATCH] Extend the length of BackgroundWorker.bgw_library_name

2023-04-24 Thread Yurii Rashkovskii

You're absolutely right. Here's v3.


On Mon, Apr 24, 2023 at 6:30 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi,
>
> > This is a very good catch and a suggestion. I've slightly reworked the
> patch, and I also made this static assertion to have less indirection and,
> therefore, make it easier to understand the premise.
> > Version 2 is attached.
>
> ```
> sizeof((BackgroundWorker *)NULL)->bgw_library_name
> ```
>
> I'm pretty confident something went wrong with the parentheses in v2.
>
>
> --
> Best regards,
> Aleksander Alekseev
>


-- 
Y.


v3-0001-Extend-the-length-of-BackgroundWorker.bgw_library_na.patch
Description: Binary data

Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-04-24 Thread Mat Arye

Hi All,

I've done upgrade maintenance for multiple extensions now (TimescaleDB
core, Promscale) and I think the original suggestion (wildcard filenames
with control-file switch to enable) here is a good one.

Having one big upgrade file, at its core, enables you to do is move the
logic for "what parts of the script to run for an upgrade from version V1
to V2" from a script to generate .sql files to Plpgsql in the form of
something like:

DO$$

  IF  THEN



   

  END IF;

$$;

Where the guard can check postgres catalogs, internal extension catalogs,
or anything else the extension wants. What we have done in the past is
record executions of non-idempotent migration scripts in an
extension catalog table and simply check if that row exists in the guard.
This allows for much easier management of backporting patches, for
instance. Having the full power of Plpgsql makes all of this much easier
and (more flexible) than in a build script that compiles various V1--v2.sql
upgrade files.

Currently, when we did this in Promscale, we also added V1--v2.sql upgrade
files as symlinks to "the one big file". Not the end of the world, but I
think a patch like the one suggested in this thread will make things a lot
nicer.

As for Tom's concern about downgrades, I think it's valid but it's a case
that is easy to test for in Plpgsql and either handle or error. For
example, we use semver so testing for a downgrade at the top of the upgrade
script is trivial. I think this concern is a good reason to have this
functionality enabled in the control file. That way, this issue can be
documented and then only extensions that test for valid upgrades in the
script can enable this. Such an "opt-in" prevents people who haven't
thought of the need to validate the version changes from using this by
accident.

I'll add two more related points:
1) When the extension framework was first implemented I don't believe DO
blocks existed, so handling upgrade logic in the script itself just wasn't
possible. That's why I think rethinking some of the design now makes sense.
2) This change also makes it easier for extensions that use versioned .so
files (by that I mean uses extension-.so rather than
extension.so). Because such extensions can't really use the chaining
property of the existing upgrade system and so need to write a
direct X--Y.sql migration file for every prior version X. I know the system
wasn't designed for this, but in reality a lot of extensions do this.
Especially the more complex ones.

Hopefully this is helpful,
Mat

Re: Memory leak in CachememoryContext

2023-04-24 Thread Tom Lane

Alvaro Herrera  writes:
> On 2023-Apr-24, Tom Lane wrote:
>> I guess we could have the back branches continue to create a
>> shared_cast_context and just not use it in core.  Seems rather
>> expensive for a very hypothetical compatibility measure, though.

> I think a session-long memory leak is not so bad, compared to a possible
> crash.  However, after looking at the code again, as well as pldebugger
> and plpgsql_check, I agree that there's no point in doing anything other
> than keeping the field there.

Yeah, I can't see any plausible reason for outside code to be using
that field (and I don't see any evidence in Debian Code Search that
anyone is).  I'll push it like this then.  Thanks for looking!

regards, tom lane

Re: Move un-parenthesized syntax docs to "compatibility" for few SQL commands

2023-04-24 Thread Nathan Bossart

On Sat, Apr 22, 2023 at 10:38:58AM -0400, Melanie Plageman wrote:
> If you are planning to wait and commit the change to support CLUSTER
> (VERBOSE) until July, then you can consolidate the two and say before
> v17. If you plan to commit it before then (making it available in v16),
> I would move CLUSTER [VERBOSE] down to Compatibility and say that both
> of the un-parenthesized syntaxes were used before v16. Since all of the
> syntaxes still work, I think it is probably more confusing to split them
> apart so granularly. The parenthesized syntax was effectively not
> "feature complete" without your patch to support CLUSTER (VERBOSE).

I think this can wait for v17, but if there's a strong argument for doing
some of this sooner, we can reevaluate.

> Also, isn't this:
>   CLUSTER [VERBOSE] [  [ USING  ] ]
> inclusive of this:
>   CLUSTER [VERBOSE]
> So, it would have been correct for them to be consolidated in the
> existing documentation?

Yes.  This appears to go pretty far back.  I traced it to 8bc717c (2002).
It looks like the VACUUM syntaxes have been consolidated since 37d2f76
(1998).  So AFAICT this small inconsistency has been around for a while.

>> The documentation for the CLUSTER syntax has also been consolidated
>> in anticipation of a follow-up change that will move the
>> unparenthesized syntax to the "Compatibility" section.
> 
> I suppose we should decide if unparenthesized is a word or if we are
> sticking with the hyphen.

The existing uses in the docs all omit the hypen, but I see it both ways in
some web searches.  Other than keeping the Postgres docs consistent, I
don't have a terribly ѕtrong opinion here.

>> +| CLUSTER opt_verbose
>> +{
>> +ClusterStmt *n = makeNode(ClusterStmt);
>>  
>> +n->relation = NULL;
>> +n->indexname = NULL;
>> +n->params = NIL;
>> +if ($2)
>> +n->params = lappend(n->params, 
>> makeDefElem("verbose", NULL, @2));
>> +$$ = (Node *) n;
>> +}
> 
> Maybe it is worth moving the un-parenthesized options all to the end and
> specifying what version they were needed for.

Good idea.

>>  | CLUSTER '(' utility_option_list ')' qualified_name 
>> cluster_index_specification
>>  {
>>  ClusterStmt *n = makeNode(ClusterStmt);
>> @@ -11603,15 +11612,13 @@ ClusterStmt:
>>  n->params = $3;
>>  $$ = (Node *) n;
>>  }
>> -| CLUSTER opt_verbose
>> +| CLUSTER '(' utility_option_list ')'
> 
> It is too bad we can't do this the way VACUUM and ANALYZE do -- but
> since qualified_name is required if USING is included, I suppose we
> can't.

It might be possible to extract the name and index part to a separate
optional rule.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Extend the length of BackgroundWorker.bgw_library_name

2023-04-24 Thread Aleksander Alekseev

Hi,

> This is a very good catch and a suggestion. I've slightly reworked the patch, 
> and I also made this static assertion to have less indirection and, 
> therefore, make it easier to understand the premise.
> Version 2 is attached.

```
sizeof((BackgroundWorker *)NULL)->bgw_library_name
```

I'm pretty confident something went wrong with the parentheses in v2.


-- 
Best regards,
Aleksander Alekseev

Re: Correct the documentation for work_mem

2023-04-24 Thread Imseih (AWS), Sami

Based on the feedback, here is a v1 of the suggested doc changes.

I modified Gurjeets suggestion slightly to make it clear that a specific
query execution could have operations simultaneously using up to 
work_mem.

I also added the small hash table memory limit clarification.


Regards,

Sami Imseih
Amazon Web Services (AWS)







v1-0001-Fix-documentation-for-work_mem.patch
Description: v1-0001-Fix-documentation-for-work_mem.patch

Re: Should we remove vacuum_defer_cleanup_age?

2023-04-24 Thread Robert Haas

On Mon, Apr 24, 2023 at 8:36 AM Alvaro Herrera  wrote:
> The one thing that IMO makes it less confusing is to have it return the
> value rather than modifying it in place.

Yeah, I don't understand why we have these functions that modify the
value in place. Those are probably convenient here and there, but
overall they seem to make things more confusing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Memory leak in CachememoryContext

2023-04-24 Thread Alvaro Herrera

On 2023-Apr-24, Tom Lane wrote:

> I wrote:
> > Alvaro Herrera  writes:
> >> Hmm, we can leave it unused in our code, but it still needs to be
> >> initialized to some valid memory context anyway; otherwise hypothetical
> >> code that uses it would still crash.
> 
> > I think we want that to happen, actually, because it's impossible
> > to guess what such hypothetical code needs the context to be.
> 
> I guess we could have the back branches continue to create a
> shared_cast_context and just not use it in core.  Seems rather
> expensive for a very hypothetical compatibility measure, though.

I think a session-long memory leak is not so bad, compared to a possible
crash.  However, after looking at the code again, as well as pldebugger
and plpgsql_check, I agree that there's no point in doing anything other
than keeping the field there.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Hay dos momentos en la vida de un hombre en los que no debería
especular: cuando puede permitírselo y cuando no puede" (Mark Twain)

Re: could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

2023-04-24 Thread Melanie Plageman

On Mon, Apr 24, 2023 at 10:53:35AM +0200, Christoph Berg wrote:
> Re: Andres Freund
> > Add smgrzeroextend(), FileZero(), FileFallocate()
> 
> Hi,
> 
> I'm often seeing PG16 builds erroring out in the pgbench tests:
> 
> 00:33:12 make[2]: Entering directory '/<>/build/src/bin/pgbench'
> 00:33:12 echo "# +++ tap check in src/bin/pgbench +++" && rm -rf 
> '/<>/build/src/bin/pgbench'/tmp_check && /bin/mkdir -p 
> '/<>/build/src/bin/pgbench'/tmp_check && cd 
> /<>/build/../src/bin/pgbench && 
> TESTLOGDIR='/<>/build/src/bin/pgbench/tmp_check/log' 
> TESTDATADIR='/<>/build/src/bin/pgbench/tmp_check' 
> PATH="/<>/build/tmp_install/usr/lib/postgresql/16/bin:/<>/build/src/bin/pgbench:$PATH"
>  
> LD_LIBRARY_PATH="/<>/build/tmp_install/usr/lib/aarch64-linux-gnu"
>   PGPORT='65432' 
> top_builddir='/<>/build/src/bin/pgbench/../../..' 
> PG_REGRESS='/<>/build/src/bin/pgbench/../../../src/test/regress/pg_regress'
>  /usr/bin/prove -I /<>/build/../src/test/perl/ -I 
> /<>/build/../src/bin/pgbench --verbose t/*.pl
> 00:33:12 # +++ tap check in src/bin/pgbench +++
> 00:33:14 #   Failed test 'concurrent OID generation status (got 2 vs expected 
> 0)'
> 00:33:14 #   at t/001_pgbench_with_server.pl line 31.
> 00:33:14 #   Failed test 'concurrent OID generation stdout /(?^:processed: 
> 125/125)/'
> 00:33:14 #   at t/001_pgbench_with_server.pl line 31.
> 00:33:14 #   'pgbench (16devel (Debian 
> 16~~devel-1.pgdg100+~20230423.1656.g8bbd0cc))
> 00:33:14 # transaction type: 
> /<>/build/src/bin/pgbench/tmp_check/t_001_pgbench_with_server_main_data/001_pgbench_concurrent_insert
> 00:33:14 # scaling factor: 1
> 00:33:14 # query mode: prepared
> 00:33:14 # number of clients: 5
> 00:33:14 # number of threads: 1
> 00:33:14 # maximum number of tries: 1
> 00:33:14 # number of transactions per client: 25
> 00:33:14 # number of transactions actually processed: 118/125
> 00:33:14 # number of failed transactions: 0 (0.000%)
> 00:33:14 # latency average = 26.470 ms
> 00:33:14 # initial connection time = 66.583 ms
> 00:33:14 # tps = 188.889760 (without initial connection time)
> 00:33:14 # '
> 00:33:14 # doesn't match '(?^:processed: 125/125)'
> 00:33:14 #   Failed test 'concurrent OID generation stderr /(?^:^$)/'
> 00:33:14 #   at t/001_pgbench_with_server.pl line 31.
> 00:33:14 #   'pgbench: error: client 2 script 0 aborted in 
> command 0 query 0: ERROR:  could not extend file "base/5/3501" with 
> FileFallocate(): Interrupted system call
> 00:33:14 # HINT:  Check free disk space.
> 00:33:14 # pgbench: error: Run was aborted; the above results are incomplete.
> 00:33:14 # '
> 00:33:14 # doesn't match '(?^:^$)'
> 00:33:26 # Looks like you failed 3 tests of 428.
> 00:33:26 t/001_pgbench_with_server.pl ..
> 00:33:26 not ok 1 - concurrent OID generation status (got 2 vs expected 0)
> 
> I don't think the disk is full since it's always hitting that same
> spot, on some of the builds:
> 
> https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/
> 
> This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running
> that test works though, and the FS seems to support posix_fallocate:
> 
> #include 
> #include 
> 
> int main ()
> {
> int f;
> int err;
> 
> if (!(f = open("moo", O_CREAT | O_RDWR, 0666)))
> perror("open");
> 
> err = posix_fallocate(f, 0, 10);
> perror("posix_fallocate");
> 
> return 0;
> }
> 
> $ ./a.out
> posix_fallocate: Success
> 
> The problem has been there for some weeks - I didn't report it earlier
> as I was on vacation, in the free time trying to bootstrap s390x
> support for apt.pg.o, and there was this other direct IO problem
> making all the builds fail for some time.

I noticed that dsm_impl_posix_resize() does a do while rc==EINTR and
FileFallocate() doesn't. From what the comment says in
dsm_impl_posix_resize() and some cursory googling, posix_fallocate()
doesn't restart automatically on most systems, so a do while() rc==EINTR
is often used. Is there a reason it isn't used in FileFallocate() I
wonder?

- Melanie

Re: Missing update of all_hasnulls in BRIN opclasses

2023-04-24 Thread Tom Lane

Alvaro Herrera  writes:
> On 2023-Apr-23, Tomas Vondra wrote:
>> Both cases have a patch extending pageinspect to show the new flag, but
>> obviously we should commit that only in master. In backbranches it's
>> meant only to make testing easier.

> In backbranches, I think it should be reasonably easy to add a
> --1.7--1.7.1.sql file and set the default version to 1.7.1; that then
> enables us to have the functionality (and the tests) in older branches
> too.  If you just add a --1.X.1--1.12.sql version to each branch that's
> identical to that branch's current pageinspect version upgrade script,
> that would let us have working upgrade paths for all branches.  This is
> a bit laborious but straightforward enough.

"A bit laborious"?  That seems enormously out of proportion to the
benefit of putting this test case into back branches.  It will have
costs for end users too, not only us.  As an example, it would
unecessarily block some upgrade paths, if the upgraded-to installation
is slightly older and lacks the necessary --1.X.1--1.12 script.

> If you don't feel like adding that, I volunteer to add the necessary
> scripts to all branches after you commit the bugfix, and ensure that all
> upgrade paths work correctly.

I do not think this should happen at all, whether you're willing to
do the work or not.

regards, tom lane

Re: base backup vs. concurrent truncation

2023-04-24 Thread Robert Haas

On Fri, Apr 21, 2023 at 5:19 PM Aleksander Alekseev
 wrote:
> Naturally the database is consistent too. So it looks like the
> recovery protocol worked as expected after all, at least in the
> upcoming PG16 and this specific scenario.
>
> Did I miss anything? If not, perhaps we should just update the comments a bit?

I was confused about what was happening here but after trying to
reproduce I think I figured it out. In my test, I saw the .1 file grow
to a full 1GB and then the truncate happened afterward.
Unsurprisingly, that works. I believe that what's happening here is
that the DELETE statement triggers FPIs for all of the blocks it
touches. Therefore, when the DELETE records are replayed, those blocks
get written back to the relation, extending it. That gets it up to the
required 1GB size after which the .2 file is visible to the smgr
again, so the truncate works fine. I think that to reproduce the
scenario, you want the truncate to happen in its own checkpoint cycle.

I also found out from Andres that he's complained about pretty much
the same problem just a couple of months ago:

https://www.postgresql.org/message-id/20230223010147.32oir7sb66slq...@awork3.anarazel.de

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Missing update of all_hasnulls in BRIN opclasses

2023-04-24 Thread Alvaro Herrera

On 2023-Apr-23, Tomas Vondra wrote:

> here's an updated version of the patch, including a backport version. I
> ended up making the code yet a bit closer to master by introducing
> add_values_to_range(). The current pre-14 code has the loop adding data
> to the BRIN tuple in two places, but with the "fixed" logic handling
> NULLs and the empty_range flag the amount of duplicated code got too
> high, so this seem reasonable.

In backbranches, the new field to BrinMemTuple needs to be at the end of
the struct, to avoid ABI breakage.

There's a comment in add_values_to_range with a typo "If the range was had".
Also, "So we should not see empty range that was not modified" should
perhaps be "should not see an empty range".

(As for your FIXME comment in brin_memtuple_initialize, I think you're
correct: we definitely need to reset bt_placeholder.  Otherwise, we risk
places that call it in a loop using a BrinMemTuple with one range with
the flag set, in a range where that doesn't hold.  It might be
impossible for this to happen, given how narrow the conditions are on
which bt_placeholder is used; but it seems safer to reset it anyway.)

Some pgindent noise would be induced by this patch.  I think it's worth
cleaning up ahead of time.

I did a quick experiment of turning the patches over -- applying test
first, code fix after (requires some conflict fixing).  With that I
verified that the behavior of 'hasnulls' indeed changes with the code
fix.

> Both cases have a patch extending pageinspect to show the new flag, but
> obviously we should commit that only in master. In backbranches it's
> meant only to make testing easier.

In backbranches, I think it should be reasonably easy to add a
--1.7--1.7.1.sql file and set the default version to 1.7.1; that then
enables us to have the functionality (and the tests) in older branches
too.  If you just add a --1.X.1--1.12.sql version to each branch that's
identical to that branch's current pageinspect version upgrade script,
that would let us have working upgrade paths for all branches.  This is
a bit laborious but straightforward enough.

If you don't feel like adding that, I volunteer to add the necessary
scripts to all branches after you commit the bugfix, and ensure that all
upgrade paths work correctly.

> I plan to do a bit more testing, I'd welcome some feedback - it's a
> long-standing bug, and it'd be good to finally get this fixed. I don't
> think the patch can be made any simpler.

The approach looks good to me.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Oh, great altar of passive entertainment, bestow upon me thy discordant images
at such speed as to render linear thought impossible" (Calvin a la TV)

Re: Order changes in PG16 since ICU introduction

2023-04-24 Thread Peter Eisentraut


On 22.04.23 01:00, Jeff Davis wrote:

On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote:

And the fact that "C" or "POSIX" gets transformed into
"en-US-u-va-posix"


I already expressed, on reflection, that we should probably just not do
that. So I think we're in agreement on this point; patch attached.


This makes sense to me.  This way, if someone specifies 'C' locale 
together with ICU provider they get an error.  They can then choose to 
use the libc provider, to get the performance path, or stick with ICU by 
using the native spelling of the locale.

Re: Order changes in PG16 since ICU introduction

2023-04-24 Thread Peter Eisentraut


On 21.04.23 19:14, Peter Eisentraut wrote:

On 21.04.23 19:09, Sandro Santilli wrote:

On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:

"Regina Obe"  writes:


https://trac.osgeo.org/postgis/ticket/5375


If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.


Sounds like a bug to me. This is happening with a PostgreSQL cluster
created and served by a build of commit c04c6c5d6f :

   =# select version();
   PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit

   =# show lc_collate;
   C
   =# select '+' < '-';
   f


If the database is created with locale provider ICU, then lc_collate 
does not apply here, so the result might be correct (depending on what 
locale you have set).


The GUC settings lc_collate and lc_ctype are from a time when those 
locale settings were cluster-global.  When we made those locale settings 
per-database (PG 8.4), we kept them as read-only.  As of PG 15, you can 
use ICU as the per-database locale provider, so what is being attempted 
in the above example is already meaningless before PG 16, since you need 
to look into pg_database to find out what is really happening.


I think we should just remove the GUC parameters lc_collate and lc_ctype.

Re: Memory leak in CachememoryContext

2023-04-24 Thread Tom Lane

I wrote:
> Alvaro Herrera  writes:
>> Hmm, we can leave it unused in our code, but it still needs to be
>> initialized to some valid memory context anyway; otherwise hypothetical
>> code that uses it would still crash.

> I think we want that to happen, actually, because it's impossible
> to guess what such hypothetical code needs the context to be.

I guess we could have the back branches continue to create a
shared_cast_context and just not use it in core.  Seems rather
expensive for a very hypothetical compatibility measure, though.

regards, tom lane

Re: Minor code de-duplication in fe-connect.c

2023-04-24 Thread Gurjeet Singh

On Mon, Apr 24, 2023 at 5:14 AM Daniel Gustafsson  wrote:

> > On 21 Apr 2023, at 18:38, Gurjeet Singh  wrote:
> >
> > On Fri, Apr 21, 2023 at 7:47 AM Robert Haas 
> wrote:
> >>
> >> On Fri, Apr 21, 2023 at 8:25 AM Daniel Gustafsson 
> wrote:
> >>> The reason I left it like this when reviewing and committing is that I
> think it
> >>> makes for more readable code.  The amount of lines saved is pretty
> small, and
> >>> "shuffle" isn't an exact term so by reading the code it isn't
> immediate clear
> >>> what such a function would do.  By having the shuffle algorithm where
> it's used
> >>> it's clear what the code does and what the outcome is.  If others
> disagree I
> >>> can go ahead and refactor of course, but I personally would not deem
> it a net
> >>> win in code quality.
> >>
> >> I think we should avoid nitpicking stuff like this. I likely would
> >> have used a subroutine if I'd done it myself, but I definitely
> >> wouldn't have submitted a patch to change whatever the last person did
> >> without some tangible reason for so doing.
> >
> > Code duplication, and the possibility that a change in one location
> > (bugfix or otherwise) does not get applied to the other locations, is
> > a good enough reason to submit a patch, IMHO.
> >
> >> It's not a good use of
> >> reviewer and committer time to litigate things like this.
> >
> > Postgres has a very high bar for code quality, and for documentation.
> > It is a major reason that attracts people to, and keeps them in the
> > Postgres ecosystem, as users and contributors. If anything, we should
> > encourage folks to point out such inconsistencies in code and docs
> > that keep the quality high.
> >
> > This is not a attack on any one commit, or author/committer of the
> > commit; sorry if it appeared that way. I was merely reviewing the
> > commit that introduced a nice libpq feature. This patch is merely a
> > minor improvement to the code that I think deserves a consideration.
> > It's not a litigation, by any means.
>
> I didn't actually read the patch earlier, but since Robert gave a +1 to the
> idea of refactoring I had a look. I have a few comments:
>
> +static void
> +shuffleAddresses(PGConn *conn)
> This fails to compile since the struct is typedefed PGconn.
>
>
> -   /*
> -* This is the "inside-out" variant of the Fisher-Yates shuffle
> -* algorithm. Notionally, we append each new value to the array
> -* and then swap it with a randomly-chosen array element (possibly
> -* including itself, else we fail to generate permutations with
> -* the last integer last).  The swap step can be optimized by
> -* combining it with the insertion.
> -*
>  * We don't need to initialize conn->prng_state here, because that
>  * already happened in connectOptions2.
>  */
> This also fails to compile since it removes the starting /* marker of the
> comment resulting in a comment starting on *.
>
>
> -   for (i = 1; i < conn->nconnhost; i++)
> -   for (int i = 1; i < conn->naddr; i++)
> You are replacing these loops for shuffling an array with a function that
> does
> this:
> +   for (int i = 1; i < conn->naddr; i++)
> This is not the same thing, one is shuffling the hosts and the other is
> shuffling the addresses resolved for the hosts (which may be a n:m
> relationship).  If you had compiled and run the tests you would have seen
> that
> the libpq tests now fail with this applied, as would be expected.
>
>
> Shuffling the arrays can of course be made into a subroutine, but what to
> shuffle and where needs to be passed in to it and at that point I doubt the
> code readability is much improved over the current coding.
>
>
Sorry about the errors. This seems to be the older version of the patch
that I had generated before fixing these mistakes. I do remember
encountering the compiler errors and revising the code to fix these,
especially the upper vs. lower case and the partial removal of the coment.
Away from my keyboard, so please expect the newer patch some time later.

Best regards,
Gurjeet
http://Gurje.et

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-04-24 Thread Julien Rouhaud

Hi,

On Mon, Apr 24, 2023 at 12:03:05PM +, Hayato Kuroda (Fujitsu) wrote:
>
> > I think that this test should be different when just checking for the
> > prerequirements (live_check / --check) compared to actually doing the 
> > upgrade,
> > as it's almost guaranteed that the slots won't have sent everything when the
> > source server is up and running.
>
> Hmm, you assumed that the user application is still running and data is coming
> continuously when doing --check, right? Personally I have thought that the
> --check operation is executed just before the actual upgrading, therefore I'm 
> not
> sure your assumption is real problem.

The checks are always executed before doing the upgrade, to prevent it if
something isn't right.  But you can also just do those check on a live
instance, so you can get a somewhat strong guarantee that the upgrade operation
will succeed before needing to stop all services and shut down postgres.  It's
basically free to run those checks and can avoid an unnecessary service
interruption so I'm pretty sure people use it quite often.

> And I could not find any checks which their
> contents are changed based on the --check option.

Yes, because other checks are things that you can actually fix when the
instance is running, like getting rid of tables with oids.  The only semi
exception if for 2pc which can be continuously prepared and committed, but if
you hit that problem at least you know you have to stop cleanly your XA-like
application and make sure there are no 2pc left.

> Yeah, if we support the case checking pg_replication_slots.active may be 
> sufficient.
> Actually this cannot handle the case that pg_create_logical_replication_slot()
> is executed just before upgrading, but I'm not sure it should be.

It shouldn't, same for any of the other checks.  The live check can't predict
the future, it just tells you if there's anything that would prevent the
upgrade *at the moment it's executed*.

Re: [PATCH] Allow Postgres to pick an unused port to listen

2023-04-24 Thread Peter Eisentraut


On 19.04.23 06:21, Stephen Frost wrote:

I don't think involving pg_ctl is necessary or desirable, since it would
make any future changes like that even more complicated.

I'm a bit confused by this- if pg_ctl is invoked then we have
more-or-less full control over parsing and reporting out the answer, so
while it might be a bit more complicated for us, it seems surely simpler
for the end user.  Or maybe you're referring to something here that I'm
not thinking of?


Getting pg_ctl involved just requires a lot more work.  We need to write 
actual code, documentation, tests, help output, translations, etc.  If 
we ever change anything, then we need to transition the command-line 
arguments somehow, add more documentation, etc.


A file is a much simpler interface: You just write to it, write two 
sentences of documentation, that's all.


Or to put it another way, if we don't think a file is an appropriate 
interface, then why is a PID file appropriate?



Independent of the above though ... this hand-wringing about what we
might do in the relative near-term when we haven't done much in the past
many-many years regarding listen_addresses or port strikes me as
unlikely to be necessary.  Let's pick something and get it done and
accept that we may have to change it at some point in the future, but
that's kinda what major releases are for, imv anyway.


Right.  I'm perfectly content with just allowing port number 0 and 
leaving it at that.

Re: run pgindent on a regular basis / scripted manner

2023-04-24 Thread Tom Lane

Peter Eisentraut  writes:
> Does anyone find perltidy useful?  To me, it functions more like a 
> JavaScript compiler in that once you process the source code, it is no 
> longer useful for manual editing.  If we are going to have the buildfarm 
> check indentation and that is going to be extended to Perl code, I have 
> some concerns about that.

I certainly don't like its current behavior where adding/changing one
line can have side-effects on nearby lines.  But we have a proposal
to clean that up, and I'm cautiously optimistic that it'll be better
in future.  Did you have other specific concerns?

regards, tom lane

Re: Add missing copyright for pg_upgrade/t/* files

2023-04-24 Thread Andrew Dunstan



On 2023-04-24 Mo 03:08, Hayato Kuroda (Fujitsu) wrote:

Dear David,


It is great to make sure each file has the Copyright and I see this
patch has already been committed.

Thanks!
While checking more, I was surprised because I found many files which do not
have Copyright via " grep -Lr Copyright --exclude-dir .git ..." command.
I'm not sure whether it is expected, but all sql files in src/test/regress/sql 
and
many files in contrib do not have. Do you know something about it?


Just curious, is there a rule to add Copyright to Postgres?

Sorry, I'm not sure about it. Before submitting a patch I have checked the
manual that "PostgreSQL Coding Conventions", but I could not find any.


For example,
if I run a command `grep -rn Copyright --include="*.pl" | awk -F ':'
{'print $2, $1'} | sort -nr` inside postgres/src/bin, It seems most
Copyright were added to the second line, but these two were added to the
very beginning (of course, there are three other files following this
pattern as well).

There seems a tendency that Copyright for recently added files have added it to
the very beginning, but I can suspect from the result that there are no specific
rules about it.

```
$ grep -rn Copyright --include="*.pl" | awk -F ':' {'print $2'} | sort -nr | 
uniq -c
   1 753
   1 752
   1 717
...
  22 3
 158 2
  24 1
```



I suspect many of those came from the last time I did this, at commit 
8fa6e6919c.


IIRC I added "\nCopyright...\n\n" at line 1 unless that was a "#!" line, 
in which case I added it after line 1 (it was done via a sed script IIRC)


I think since then perltidy has dissolved some of the extra blank lines 
added at the end.


I don't think we actually have a rule about it, but the pattern I 
described doesn't seem unreasonable.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Memory leak in CachememoryContext

2023-04-24 Thread Tom Lane

Alvaro Herrera  writes:
>> (Note to self: we can't remove the cast_hash_context field in
>> back branches for fear of causing an ABI break for pldebugger.
>> But we can leave it unused, I think.)

> Hmm, we can leave it unused in our code, but it still needs to be
> initialized to some valid memory context anyway; otherwise hypothetical
> code that uses it would still crash.

I think we want that to happen, actually, because it's impossible
to guess what such hypothetical code needs the context to be.
As things stand now, that field points to a long-lived context
in some cases and a short-lived one in others.  We risk either
data structure corruption or a session-lifespan memory leak
if we guess about such usage ... which really shouldn't exist
anyway.

regards, tom lane

Re: run pgindent on a regular basis / scripted manner

2023-04-24 Thread Peter Eisentraut


On 23.04.23 17:29, Jelte Fennema wrote:

On Sun, 23 Apr 2023 at 17:16, Tom Lane  wrote:

I think we could go ahead and commit the perltidyrc and README changes
now.  But the ensuing reformatting should happen as part of the mass
pgindent run, probably next month sometime.


I think it's better to make the changes close together, not with a
month in between. Otherwise no-one will be able to run perltidy on
their patches, because the config and the files are even more out of
sync than they are now. So I'd propose to commit the perltidyrc
changes right before the pgindent run.


Does anyone find perltidy useful?  To me, it functions more like a 
JavaScript compiler in that once you process the source code, it is no 
longer useful for manual editing.  If we are going to have the buildfarm 
check indentation and that is going to be extended to Perl code, I have 
some concerns about that.

Re: Memory leak in CachememoryContext

2023-04-24 Thread Alvaro Herrera

On 2023-Apr-21, Tom Lane wrote:

> (Note to self: we can't remove the cast_hash_context field in
> back branches for fear of causing an ABI break for pldebugger.
> But we can leave it unused, I think.)

Hmm, we can leave it unused in our code, but it still needs to be
initialized to some valid memory context anyway; otherwise hypothetical
code that uses it would still crash.  This seems halfway obvious, but
since the submitted patch doesn't have this part, I thought better to
point it out.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Request for comment on setting binary format output per session

2023-04-24 Thread Dave Cramer

>
>
>
>
> Backing up a level, the purpose of the protocol extension mechanism is
> to help us agree on the communication protocol -- that is, the set of
> messages that we can send and receive on a certain connection. The
> question for the protocol extension mechanism isn't "which types
> should always be sent in binary format?" but "would it be ok if I
> wanted you to always send certain types in binary format?", with the
> idea that if the answer is yes it will still be necessary for the
> client to let the server know which ones, but that's easy to do if
> we've agreed on the concept that it's OK for me to ask the server for
> that. And if it's OK for me to ask that once, it should also be OK for
> me to later ask for something different.
>
> This could, perhaps, be made even more general yet. We could define a
> concept of "protocol session parameters" and make "which types are
> always sent in binary format?" one of those parameters. So then the
> conversation could go like this:
>
> C: Hello! Do you know about protocol session parameters?
> S: Why yes, actually I do.
> C: Cool. I would like to set the protocol session parameter
> types_always_in_binary_format=timestamptz. Does that work for you?
> S: Sure thing! (or alternatively: Sadly, I've not heard of that
> particular protocol session parameter, sorry to disappoint.)
>
> The reason why I suggest this is that I feel like there could be a
> bunch of things like this. The set of things to be sent in binary
> format feels like a property of the wire protocol, not something
> SQL-level that should be configured via SET.  Clients, drivers, and
> connection poolers aren't going to want to have to worry about some
> user screwing up the session by changing that property inside of a
> function or procedure or whatever. But there could also be a bunch of
> different things like this that we want to support. For example, one
> that would be really useful for connection poolers is the session
> user. The pooler would like to change the session user whenever the
> connection is changed to talk to a different client, and it would like
> that to happen in a way that can't be reversed by issuing any SQL
> command. I expect in time we may find a bunch of others.
>
>

Ok, this looks like the way to go. I have some questions about
implementation.

Client sends _pq_.format_binary
server doesn't object so now the client implicitly knows that they can send
a new protocol message.
At this point the client sends some new message 'F" for example, with OID's
the client wants in binary for the remainder of the session.

Ideally, I'd like to avoid this second message. Is the above correct ?

Dave

Re: Should we remove vacuum_defer_cleanup_age?

2023-04-24 Thread Alvaro Herrera

On 2023-Apr-22, Andres Freund wrote:

> On 2023-04-13 13:18:38 +0200, Alvaro Herrera wrote:
> > 
> > > Updated patch attached. I think we should either apply something like that
> > > patch, or at least add a  to the docs.
> > 
> > I gave this patch a look.  The only code change is that
> > ComputeXidHorizons() and GetSnapshotData() no longer handle the case
> > where vacuum_defer_cleanup_age is different from zero.  It looks good.
> > The TransactionIdRetreatSafely() code being removed is pretty weird (I
> > spent a good dozen minutes writing a complaint that your rewrite was
> > faulty, but it turns out I had misunderstood the function), so I'm glad
> > it's being retired.
> 
> My rewrite of what? The creation of TransactionIdRetreatSafely() in
> be504a3e974?

I meant the code that used to call TransactionIdRetreatSafely() and that
you're changing in the proposed patch.

> I'm afraid we'll need TransactionIdRetreatSafely() again, when we convert more
> things to 64bit xids (lest they end up with the same bug as fixed by
> be504a3e974), so it's perhaps worth thinking about how to make it less
> confusing.

The one thing that IMO makes it less confusing is to have it return the
value rather than modifying it in place.

> >
> > Replication slots overcome these disadvantages by retaining only the 
> > number
> > of segments known to be needed.
> > On the other hand, replication slots can retain so
> > many WAL segments that they fill up the space allocated
> > for pg_wal;
> >  limits the size of WAL 
> > files
> > retained by replication slots.
> >
> 
> It seems a bit confusing now, because "by retaining only the number of
> segments ..." now also should cover hs_feedback (due to merging), but doesn't.

Hah, ok.

> I think I'll push the version I had. Then we can separately word-smith the
> section? Unless somebody protests I'm gonna do that soon.

No objection.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Minor code de-duplication in fe-connect.c

2023-04-24 Thread Daniel Gustafsson

> On 21 Apr 2023, at 18:38, Gurjeet Singh  wrote:
> 
> On Fri, Apr 21, 2023 at 7:47 AM Robert Haas  wrote:
>> 
>> On Fri, Apr 21, 2023 at 8:25 AM Daniel Gustafsson  wrote:
>>> The reason I left it like this when reviewing and committing is that I 
>>> think it
>>> makes for more readable code.  The amount of lines saved is pretty small, 
>>> and
>>> "shuffle" isn't an exact term so by reading the code it isn't immediate 
>>> clear
>>> what such a function would do.  By having the shuffle algorithm where it's 
>>> used
>>> it's clear what the code does and what the outcome is.  If others disagree I
>>> can go ahead and refactor of course, but I personally would not deem it a 
>>> net
>>> win in code quality.
>> 
>> I think we should avoid nitpicking stuff like this. I likely would
>> have used a subroutine if I'd done it myself, but I definitely
>> wouldn't have submitted a patch to change whatever the last person did
>> without some tangible reason for so doing.
> 
> Code duplication, and the possibility that a change in one location
> (bugfix or otherwise) does not get applied to the other locations, is
> a good enough reason to submit a patch, IMHO.
> 
>> It's not a good use of
>> reviewer and committer time to litigate things like this.
> 
> Postgres has a very high bar for code quality, and for documentation.
> It is a major reason that attracts people to, and keeps them in the
> Postgres ecosystem, as users and contributors. If anything, we should
> encourage folks to point out such inconsistencies in code and docs
> that keep the quality high.
> 
> This is not a attack on any one commit, or author/committer of the
> commit; sorry if it appeared that way. I was merely reviewing the
> commit that introduced a nice libpq feature. This patch is merely a
> minor improvement to the code that I think deserves a consideration.
> It's not a litigation, by any means.

I didn't actually read the patch earlier, but since Robert gave a +1 to the
idea of refactoring I had a look. I have a few comments:

+static void
+shuffleAddresses(PGConn *conn)
This fails to compile since the struct is typedefed PGconn.


-   /*
-* This is the "inside-out" variant of the Fisher-Yates shuffle
-* algorithm. Notionally, we append each new value to the array
-* and then swap it with a randomly-chosen array element (possibly
-* including itself, else we fail to generate permutations with
-* the last integer last).  The swap step can be optimized by
-* combining it with the insertion.
-*
 * We don't need to initialize conn->prng_state here, because that
 * already happened in connectOptions2.
 */
This also fails to compile since it removes the starting /* marker of the
comment resulting in a comment starting on *.


-   for (i = 1; i < conn->nconnhost; i++)
-   for (int i = 1; i < conn->naddr; i++)
You are replacing these loops for shuffling an array with a function that does
this:
+   for (int i = 1; i < conn->naddr; i++)
This is not the same thing, one is shuffling the hosts and the other is
shuffling the addresses resolved for the hosts (which may be a n:m
relationship).  If you had compiled and run the tests you would have seen that
the libpq tests now fail with this applied, as would be expected.


Shuffling the arrays can of course be made into a subroutine, but what to
shuffle and where needs to be passed in to it and at that point I doubt the
code readability is much improved over the current coding.

--
Daniel Gustafsson

Re: Add two missing tests in 035_standby_logical_decoding.pl

2023-04-24 Thread Drouvot, Bertrand


Hi,

On 4/24/23 11:45 AM, Amit Kapila wrote:

On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila  wrote:


On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
 wrote:




Few comments:




+# We can not test if the WAL file still exists immediately.
+# We need to let some time to the standby to actually "remove" it.
+my $i = 0;
+while (1)
+{
+ last if !-f $standby_walfile;
+ if ($i++ == 10 * $default_timeout)
+ {
+ die
+   "could not determine if WAL file has been retained or not, can't continue";
+ }
+ usleep(100_000);
+}

Is this adhoc wait required because we can't guarantee that the
checkpoint is complete on standby even after using wait_for_catchup?


Yes, the restart point on the standby is not necessary completed even after 
wait_for_catchup is done.


Is there a guarantee that it can never fail on some slower machines?



We are waiting here at a maximum for 10 * $default_timeout (means 3 minutes) 
before
we time out. Would you prefer to wait more than 3 minutes at a maximum?


BTW, for the second test is it necessary that we first ensure that the
WAL file has not been retained on the primary?



I was not sure it's worth it too. Idea was more: it's useless to verify it is 
removed on
the standby if we are not 100% sure it has been removed on the primary first. 
But yeah, we can get
rid of this test if you prefer.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-04-24 Thread Hayato Kuroda (Fujitsu)

Dear Vignesh,

Thank you for giving comments. New patchset can be available in [1].

> Thanks for the updated patch.
> A Few comments:
> 1) if the verbose option is enabled, we should print the new slot
> information, we could add a function print_slot_infos similar to
> print_rel_infos which could print slot name and two_phase is enabled
> or not.
> +   end_progress_output();
> +   check_ok();
> +
> +   /* update new_cluster info now that we have objects in the databases 
> */
> +   get_db_and_rel_infos(&new_cluster);
> +}

I was not sure we should add the print because any other objects like 
publication
and subscription seem not to be printed, but added.
While implementing it, I thought that calling get_db_and_rel_infos() again
was not efficient because free_db_and_rel_infos() will be called at that time. 
So I added
get_logical_slot_infos() instead.
Additionally, I added a check for check_new_cluster_is_empty() for making 
ensure that
there are no logical slots on new node.

> 2) Since we will be using this option with pg_upgrade, should we use
> this along with the --binary-upgrade option only?
> +   if (dopt.logical_slots_only && dopt.dataOnly)
> +   pg_fatal("options --logical-replication-slots-only and
> -a/--data-only cannot be used together");
> +   if (dopt.logical_slots_only && dopt.schemaOnly)
> +   pg_fatal("options --logical-replication-slots-only and
> -s/--schema-only cannot be used together");

Right, I added the check.

> 3) Since it two_phase is boolean, can we use bool data type instead of string:
> +   slotinfo[i].dobj.objType = DO_LOGICAL_REPLICATION_SLOT;
> +
> +   slotinfo[i].dobj.catId.tableoid = InvalidOid;
> +   slotinfo[i].dobj.catId.oid = InvalidOid;
> +   AssignDumpId(&slotinfo[i].dobj);
> +
> +   slotinfo[i].dobj.name = pg_strdup(PQgetvalue(res, i,
> i_slotname));
> +
> +   slotinfo[i].plugin = pg_strdup(PQgetvalue(res, i, i_plugin));
> +   slotinfo[i].twophase = pg_strdup(PQgetvalue(res, i,
> i_twophase));
> 
> We can change it to something like:
> if (strcmp(PQgetvalue(res, i, i_twophase), "t") == 0)
> slotinfo[i].twophase = true;
> else
> slotinfo[i].twophase = false;

Seems right, fixed.

> 4) The comments are inconsistent, some have termination characters and
> some don't. We can keep it consistent:
> +# Can be changed to test the other modes.
> +my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
> +
> +# Initialize old publisher node
> +my $old_publisher = PostgreSQL::Test::Cluster->new('old_publisher');
> +$old_publisher->init(allows_streaming => 'logical');
> +$old_publisher->start;
> +
> +# Initialize subscriber node
> +my $subscriber = PostgreSQL::Test::Cluster->new('subscriber');
> +$subscriber->init(allows_streaming => 'logical');
> +$subscriber->start;
> +
> +# Schema setup
> +$old_publisher->safe_psql('postgres',
> +   "CREATE TABLE tbl AS SELECT generate_series(1,10) AS a");
> +$subscriber->safe_psql('postgres', "CREATE TABLE tbl (a int)");
> +
> +# Initialize new publisher node

Removed all termination.

> 5) should we use free instead of pfree as used in other function like
> dumpForeignServer:
> +   appendPQExpBuffer(query, ");");
> +
> +   ArchiveEntry(fout, slotinfo->dobj.catId, 
> slotinfo->dobj.dumpId,
> +ARCHIVE_OPTS(.tag = slotname,
> +
> .description = "REPLICATION SLOT",
> +
> .section = SECTION_POST_DATA,
> +
> .createStmt = query->data));
> +
> +   pfree(slotname);
> +   destroyPQExpBuffer(query);
> +   }
> +}

Actually it works because for the client, the pfree() is just a wrapper of 
pg_free(),
but I agreed that it should be fixed. So did that.

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB58669413A5A2E3E50BD0B7E7F5679%40TYAPR01MB5866.jpnprd01.prod.outlook.com


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

RE: [PoC] pg_upgrade: allow to upgrade publisher node

2023-04-24 Thread Hayato Kuroda (Fujitsu)

Dear Julien,

Thank you for giving comments! PSA new version.

> I think that this test should be different when just checking for the
> prerequirements (live_check / --check) compared to actually doing the upgrade,
> as it's almost guaranteed that the slots won't have sent everything when the
> source server is up and running.

Hmm, you assumed that the user application is still running and data is coming
continuously when doing --check, right? Personally I have thought that the
--check operation is executed just before the actual upgrading, therefore I'm 
not
sure your assumption is real problem. And I could not find any checks which 
their
contents are changed based on the --check option.

Anyway, I included your opinion in 0004 patch. We can ask some other reviewers
about the necessity.

> Maybe simply check that all logical slots are currently active when running 
> the
> live check,

Yeah, if we support the case checking pg_replication_slots.active may be 
sufficient.
Actually this cannot handle the case that pg_create_logical_replication_slot()
is executed just before upgrading, but I'm not sure it should be.

> so at least there's a good chance that they will still be at
> shutdown, and will therefore send all the data to the subscribers? Having a
> regression tests for that scenario would also be a good idea.  Having an
> uncommitted write transaction should be enough to cover it.

I think background_psql() can be used for the purpose. Before doing pg_upgrade
--check, a transaction is opened and kept. It means that the confirmed_flush has
been not reached to the current WAL position yet, but the checking says OK
because all slots are active.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v9-0001-pg_upgrade-Add-include-logical-replication-slots-.patch
Description:  v9-0001-pg_upgrade-Add-include-logical-replication-slots-.patch


v9-0002-Always-persist-to-disk-logical-slots-during-a-shu.patch
Description:  v9-0002-Always-persist-to-disk-logical-slots-during-a-shu.patch


v9-0003-pg_upgrade-Add-check-function-for-include-logical.patch
Description:  v9-0003-pg_upgrade-Add-check-function-for-include-logical.patch


v9-0004-Change-the-method-used-to-check-logical-replicati.patch
Description:  v9-0004-Change-the-method-used-to-check-logical-replicati.patch

Re: [PATCH] Extend the length of BackgroundWorker.bgw_library_name

2023-04-24 Thread Yurii Rashkovskii

Aleksander,

On Mon, Apr 24, 2023 at 1:01 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> > The patch is backwards-compatible and ensures that bgw_library_name
> stays *at least* as long as BGW_MAXLEN. Existing external code that uses
> BGW_MAXLEN is a length boundary (for example, in `strncpy`) will continue
> to work as expected.
>
> There is a mistake in the comment though:
>
> ```
> +/*
> + * Ensure bgw_function_name's size is backwards-compatible and sensible
> + */
> +StaticAssertDecl(MAXPGPATH >= BGW_MAXLEN, "MAXPGPATH must be at least
> equal to BGW_MAXLEN");
> ```
>
> library_name, not function_name. Also I think the comment should be
> more detailed, something like "prior to PG17 we used ... but since
> PG17 ... which may cause problems if ...".
>

This is a very good catch and a suggestion. I've slightly reworked the
patch, and I also made this static assertion to have less indirection and,
therefore, make it easier to understand the premise.
Version 2 is attached.

-- 
Y.


v2-0001-Extend-the-length-of-BackgroundWorker.bgw_library_na.patch
Description: Binary data

Re: Memory leak in CachememoryContext

2023-04-24 Thread Ajit Awekar

Tom, Thanks a lot for your patch. I applied  the changes and confirmed
there is no memory leak with the V2 patch.
We are not using MemoryContext variables "cast_hash_context" and
"shared_cast_context".

Thanks & Best Regards,
Ajit

On Sat, Apr 22, 2023 at 4:49 AM Tom Lane  wrote:

> Ajit Awekar  writes:
> > I have implemented the approach by splitting the hash table into two
> parts.
> > Please find the attached patch for the same.
>
> I found a few things not to like about this:
>
> * You didn't update the comment describing these hash tables.
>
> * I wasn't really thrilled about renaming the plpgsql_CastHashEntry
> typedef, as that seemed to just create uninteresting diff noise.
> Also, "SessionCastHashEntry" versus "PrivateCastHashEntry" seems a
> very misleading choice of names, since one of the "PrivateCastHashEntry"
> hash tables is in fact session-lifespan.  After some thought I left
> the "upper" hash table entry type as plpgsql_CastHashEntry so that
> code outside the immediate area needn't be affected, and named the
> "lower" table cast_expr_hash, with entry type plpgsql_CastExprHashEntry.
> I'm not wedded to those names though, if you have a better idea.
>
> (BTW, it's completely reasonable to rename the type as an intermediate
> step in making a patch like this, since it ensures you'll examine
> every existing usage to choose the right thing to change it to.  But
> I generally rename things back afterwards.)
>
> * I didn't like having to do two hashtable lookups during every
> call even after we've fully cached the info.  That's easy to avoid
> by keeping a link to the associated "lower" hashtable entry in the
> "upper" ones.
>
> * You removed the reset of cast_exprstate etc from the code path where
> we've just reconstructed the cast_expr.  That's a mistake since it
> might allow us to skip rebuilding the derived expression state after
> a DDL change.
>
>
> Also, while looking at this I noticed that we are no longer making
> any use of estate->cast_hash_context.  That's not the fault of
> your patch; it's another oversight in the one that added the
> CachedExpression mechanism.  The compiled expressions used to be
> stored in that context, but now the plancache is responsible for
> them and we are never putting anything in the cast_hash_context.
> So we might as well get rid of that and save 8K of wasted memory.
> This allows some simplification in the hashtable setup code too.
>
> In short, I think we need something more like the attached.
>
> (Note to self: we can't remove the cast_hash_context field in
> back branches for fear of causing an ABI break for pldebugger.
> But we can leave it unused, I think.)
>
> regards, tom lane
>
>

Re: [PATCH] Extend the length of BackgroundWorker.bgw_library_name

2023-04-24 Thread Aleksander Alekseev

Hi,

> The trade-off of this patch is that the `BackgroundWorker` structure becomes 
> larger. From my perspective, this is a reasonable cost (less than a kilobyte 
> of extra space per worker).

Agree.

> The patch is backwards-compatible and ensures that bgw_library_name stays *at 
> least* as long as BGW_MAXLEN. Existing external code that uses BGW_MAXLEN is 
> a length boundary (for example, in `strncpy`) will continue to work as 
> expected.

There is a mistake in the comment though:

```
+/*
+ * Ensure bgw_function_name's size is backwards-compatible and sensible
+ */
+StaticAssertDecl(MAXPGPATH >= BGW_MAXLEN, "MAXPGPATH must be at least
equal to BGW_MAXLEN");
```

library_name, not function_name. Also I think the comment should be
more detailed, something like "prior to PG17 we used ... but since
PG17 ... which may cause problems if ...".

-- 
Best regards,
Aleksander Alekseev

Re: postgres_fdw: wrong results with self join + enable_nestloop off

2023-04-24 Thread Etsuro Fujita

On Mon, Apr 24, 2023 at 3:31 PM Nishant Sharma
 wrote:
> Any updates? -- did you get a chance to look into this?

Sorry, I have not looked into this yet, because I have been busy with
some other work recently.  I  plan to do so early next week.

Best regards,
Etsuro Fujita

Re: Add two missing tests in 035_standby_logical_decoding.pl

2023-04-24 Thread Drouvot, Bertrand


Hi,

On 4/24/23 8:24 AM, Amit Kapila wrote:

On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
 wrote:




Few comments:



Thanks for looking at it!


1.
+$node_subscriber->init(allows_streaming => 'logical');
+$node_subscriber->append_conf('postgresql.conf', 'max_replication_slots = 4');

Why do we need slots on the subscriber?



Good point, it's not needed. I guess it has been missed during my initial patch 
clean up.

Fixed in V3 attached.


2.
+# Speed up the subscription creation
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+
+# Explicitly shut down psql instance gracefully - to avoid hangs
+# or worse on windows
+$psql_subscriber{subscriber_stdin} .= "\\q\n";
+$psql_subscriber{run}->finish;
+
+# Insert some rows on the primary
+$node_primary->safe_psql('postgres',
+ qq[INSERT INTO tab_rep select generate_series(1,10);]);
+
+$node_primary->wait_for_replay_catchup($node_standby);
+
+# To speed up the wait_for_subscription_sync
+$node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");
+$node_subscriber->wait_for_subscription_sync($node_standby, 'tap_sub');

It is not clear to me why you need to do pg_log_standby_snapshot() twice.


That's because there is 2 logical slot creations that have the be done on the 
standby.

The one for the subscription:

"
CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput (SNAPSHOT 'nothing')
"

And the one for the data sync:

"
CREATE_REPLICATION_SLOT "pg_16389_sync_16384_7225540800768250444" LOGICAL 
pgoutput (SNAPSHOT 'use')
"

Without the second "pg_log_standby_snapshot()" then 
wait_for_subscription_sync() would be waiting
some time on the poll for "SELECT count(1) = 0 FROM pg_subscription_rel WHERE 
srsubstate NOT IN ('r', 's');"

Adding a comment in V3 to explain the need for the second 
pg_log_standby_snapshot().



3. Why do you need $psql_subscriber to be used in a different way
instead of using safe_psql as is used for node_primary?



Because safe_psql() would wait for activity on the primary without being able 
to launch
pg_log_standby_snapshot() on the primary while waiting. psql_subscriber() allows
to not wait synchronously.

Also adding a comment in V3 to explain why safe_psql() is not being used here.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom f74fb99ff67e47bb78bdff62bdf1ac824a8fb24b Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 24 Apr 2023 05:13:23 +
Subject: [PATCH v3 2/2] Add two tests in 035_standby_logical_decoding.pl

Adding two tests, to verify that:

- subscribtion to the standby is possible
- invalidated logical slots do not lead to retaining WAL
---
 src/test/perl/PostgreSQL/Test/Cluster.pm  |  11 +-
 .../t/035_standby_logical_decoding.pl | 167 +-
 2 files changed, 173 insertions(+), 5 deletions(-)
   4.4% src/test/perl/PostgreSQL/Test/
  95.5% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm 
b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 6f7f4e5de4..819667d42a 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2644,7 +2644,16 @@ sub wait_for_catchup
}
if (!defined($target_lsn))
{
-   $target_lsn = $self->lsn('write');
+   my $isrecovery = $self->safe_psql('postgres', "SELECT 
pg_is_in_recovery()");
+   chomp($isrecovery);
+   if ($isrecovery eq 't')
+   {
+   $target_lsn = $self->lsn('replay');
+   }
+   else
+   {
+   $target_lsn = $self->lsn('write');
+   }
}
print "Waiting for replication conn "
  . $standby_name . "'s "
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
b/src/test/recovery/t/035_standby_logical_decoding.pl
index b8f5311fe9..647e203b0f 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -9,13 +9,19 @@ use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
+use Time::HiRes qw(usleep);
 
-my ($stdin, $stdout, $stderr, $cascading_stdout, $cascading_stderr, $ret, 
$handle, $slot);
+my ($stdin, $stdout,$stderr,
+   $cascading_stdout,  $cascading_stderr,  $subscriber_stdin,
+   $subscriber_stdout, $subscriber_stderr, $ret,
+   $handle,$slot);
 
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 my $node_standby = PostgreSQL::Test::Cluster->new('standby');
 my $node_cascading_standby = 
PostgreSQL::Test::Cluster->new('cascading_standby');
+my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber');
 my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+my $psql_timeout= IPC::Run::timer(2 * $default_timeout);
 my $res;
 
 # Name for the physical slot on pr

RE: Test failures of 100_bugs.pl

2023-04-24 Thread Yu Shi (Fujitsu)

On Fri, Apr 21, 2023 1:48 PM Yu Shi (Fujitsu)  wrote:
> 
> I wrote a patch to dump rel state in wait_for_subscription_sync() as 
> suggested.
> Please see the attached patch.
> I will try to add some debug logs in code later.
> 

Please see the attached v2 patch.

I added some debug logs when invalidating syncing table states and updating
table_states_not_ready list. I also adjusted the message level in the tests
which failed before.

Regards,
Shi Yu


v2-0001-dump-rel-state-in-wait_for_subscription_sync.patch
Description:  v2-0001-dump-rel-state-in-wait_for_subscription_sync.patch


v2-0002-Add-logs-to-help-investigate-subscription-test-fa.patch
Description:  v2-0002-Add-logs-to-help-investigate-subscription-test-fa.patch

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2023-04-24 Thread Daniel Gustafsson

> On 21 Apr 2023, at 18:56, Jacob Champion  wrote:
> 
> On Fri, Apr 21, 2023 at 12:55 AM Daniel Gustafsson  wrote:
>> This has been done and the open item marked as completed.
> 
> Thanks! Now that the weirdness is handled by the tests, I think we can
> remove the Cirrus workaround. Something like the attached, which
> passes the macOS Meson suite for me.

Agreed, I had this on my TODO list for when the test fix patch had landed.
Verified in CI for me as well so pushed.  Thanks!

--
Daniel Gustafsson

Re: Add two missing tests in 035_standby_logical_decoding.pl

2023-04-24 Thread Amit Kapila

On Mon, Apr 24, 2023 at 11:54 AM Amit Kapila  wrote:
>
> On Mon, Apr 24, 2023 at 11:24 AM Drouvot, Bertrand
>  wrote:
> >
>
> Few comments:
> 
>

+# We can not test if the WAL file still exists immediately.
+# We need to let some time to the standby to actually "remove" it.
+my $i = 0;
+while (1)
+{
+ last if !-f $standby_walfile;
+ if ($i++ == 10 * $default_timeout)
+ {
+ die
+   "could not determine if WAL file has been retained or not, can't continue";
+ }
+ usleep(100_000);
+}

Is this adhoc wait required because we can't guarantee that the
checkpoint is complete on standby even after using wait_for_catchup?
Is there a guarantee that it can never fail on some slower machines?

BTW, for the second test is it necessary that we first ensure that the
WAL file has not been retained on the primary?

-- 
With Regards,
Amit Kapila.

Re: duplicate function declaration in multirangetypes_selfuncs.c

2023-04-24 Thread Daniel Gustafsson

> On 24 Apr 2023, at 03:35, Alexander Korotkov  wrote:
> On Sun, Apr 23, 2023 at 3:36 PM Daniel Gustafsson  wrote:

>> I had planned to backpatch these two fixes for just that reason, to avoid 
>> the risk for other backpatches not applying.
> 
> OK. I'm good with this plan.

Done.

--
Daniel Gustafsson

could not extend file "base/5/3501" with FileFallocate(): Interrupted system call

2023-04-24 Thread Christoph Berg

Re: Andres Freund
> Add smgrzeroextend(), FileZero(), FileFallocate()

Hi,

I'm often seeing PG16 builds erroring out in the pgbench tests:

00:33:12 make[2]: Entering directory '/<>/build/src/bin/pgbench'
00:33:12 echo "# +++ tap check in src/bin/pgbench +++" && rm -rf 
'/<>/build/src/bin/pgbench'/tmp_check && /bin/mkdir -p 
'/<>/build/src/bin/pgbench'/tmp_check && cd 
/<>/build/../src/bin/pgbench && 
TESTLOGDIR='/<>/build/src/bin/pgbench/tmp_check/log' 
TESTDATADIR='/<>/build/src/bin/pgbench/tmp_check' 
PATH="/<>/build/tmp_install/usr/lib/postgresql/16/bin:/<>/build/src/bin/pgbench:$PATH"
 LD_LIBRARY_PATH="/<>/build/tmp_install/usr/lib/aarch64-linux-gnu" 
 PGPORT='65432' top_builddir='/<>/build/src/bin/pgbench/../../..' 
PG_REGRESS='/<>/build/src/bin/pgbench/../../../src/test/regress/pg_regress'
 /usr/bin/prove -I /<>/build/../src/test/perl/ -I 
/<>/build/../src/bin/pgbench --verbose t/*.pl
00:33:12 # +++ tap check in src/bin/pgbench +++
00:33:14 #   Failed test 'concurrent OID generation status (got 2 vs expected 
0)'
00:33:14 #   at t/001_pgbench_with_server.pl line 31.
00:33:14 #   Failed test 'concurrent OID generation stdout /(?^:processed: 
125/125)/'
00:33:14 #   at t/001_pgbench_with_server.pl line 31.
00:33:14 #   'pgbench (16devel (Debian 
16~~devel-1.pgdg100+~20230423.1656.g8bbd0cc))
00:33:14 # transaction type: 
/<>/build/src/bin/pgbench/tmp_check/t_001_pgbench_with_server_main_data/001_pgbench_concurrent_insert
00:33:14 # scaling factor: 1
00:33:14 # query mode: prepared
00:33:14 # number of clients: 5
00:33:14 # number of threads: 1
00:33:14 # maximum number of tries: 1
00:33:14 # number of transactions per client: 25
00:33:14 # number of transactions actually processed: 118/125
00:33:14 # number of failed transactions: 0 (0.000%)
00:33:14 # latency average = 26.470 ms
00:33:14 # initial connection time = 66.583 ms
00:33:14 # tps = 188.889760 (without initial connection time)
00:33:14 # '
00:33:14 # doesn't match '(?^:processed: 125/125)'
00:33:14 #   Failed test 'concurrent OID generation stderr /(?^:^$)/'
00:33:14 #   at t/001_pgbench_with_server.pl line 31.
00:33:14 #   'pgbench: error: client 2 script 0 aborted in 
command 0 query 0: ERROR:  could not extend file "base/5/3501" with 
FileFallocate(): Interrupted system call
00:33:14 # HINT:  Check free disk space.
00:33:14 # pgbench: error: Run was aborted; the above results are incomplete.
00:33:14 # '
00:33:14 # doesn't match '(?^:^$)'
00:33:26 # Looks like you failed 3 tests of 428.
00:33:26 t/001_pgbench_with_server.pl ..
00:33:26 not ok 1 - concurrent OID generation status (got 2 vs expected 0)

I don't think the disk is full since it's always hitting that same
spot, on some of the builds:

https://pgdgbuild.dus.dg-i.net/job/postgresql-16-binaries-snapshot/833/

This is overlayfs with tmpfs (upper)/ext4 (lower). Manually running
that test works though, and the FS seems to support posix_fallocate:

#include 
#include 

int main ()
{
int f;
int err;

if (!(f = open("moo", O_CREAT | O_RDWR, 0666)))
perror("open");

err = posix_fallocate(f, 0, 10);
perror("posix_fallocate");

return 0;
}

$ ./a.out
posix_fallocate: Success

The problem has been there for some weeks - I didn't report it earlier
as I was on vacation, in the free time trying to bootstrap s390x
support for apt.pg.o, and there was this other direct IO problem
making all the builds fail for some time.

Christoph

Re: Add two missing tests in 035_standby_logical_decoding.pl

2023-04-24 Thread Amit Kapila

On Wed, Apr 12, 2023 at 9:45 PM Drouvot, Bertrand
 wrote:
>
>
> The attached patch also removes:
>
> "
> -log_min_messages = 'debug2'
> -log_error_verbosity = verbose
> "
>
> as also discussed in [1].
>

I agree that we should reduce the log level here. It is discussed in
an email [1]. I'll push this part tomorrow unless Andres or someone
else thinks that we still need this.

[1] - https://www.postgresql.org/message-id/523315.1681245505%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.

Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-24 Thread Jesper Pedersen


On 4/20/23 13:40, Tom Lane wrote:

The Core Team would like to extend our congratulations to
Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
accepted invitations to become our newest Postgres committers.

Please join me in wishing them much success and few reverts.



Huge congrats !


Best regards,

 Jesper

Re: xmlserialize bug - extra empty row at the end

2023-04-24 Thread Jim Jones


On 24.04.23 03:18, Tom Lane wrote:
I wouldn't actually *use* pchomp here, because that induces an 
unnecessary

copy of the result string.  I had in mind more like copying pchomp's code
to count up the trailing newline(s) and then pass a corrected length
to cstring_to_text_with_len.

Changed.

You could simplify matters by doing that in all cases, too.  It should
never find anything to remove in the non-indented case, but the check
should be of negligible cost in context.


I'm not sure I understood it correctly.

The non-indented cases should never find anything and indented cases 
with CONTENT strings do not add trailing newlines, so this is only 
applicable with DOCUMENT .. INDENT, right?


Something like this would suffice?

if(xmloption_arg != XMLOPTION_DOCUMENT)
    result = (text *) xmlBuffer_to_xmltype(buf);
else
{
    int    len = xmlBufferLength(buf);
    const char *xmloutput = (const char *) xmlBufferContent(buf);

    while (len > 0 && xmloutput[len - 1] == '\n')
    len--;

    result = cstring_to_text_with_len(xmloutput, len);
}

If we really agree on manually removing the trailing newlines I will 
open a CF entry for this.


Best, Jim

From aa5eafb319da04d2e67a1540af0d088af6d82edb Mon Sep 17 00:00:00 2001
From: Jim Jones 
Date: Mon, 24 Apr 2023 10:02:32 +0200
Subject: [PATCH v1] Remove trailing newlines from xmlserialize indent output

This removes the trailing newlines added to xmlserialize indent
output from xml strings of type DOCUMENT.

Reported by: Pavel Stehule
Discussion : https://www.postgresql.org/message-id/CAFj8pRCNTi2yHBXcdYf-cYZ63R8Laf9L49Q_uxt%2BA5WXKPPhxg%40mail.gmail.com
---
 src/backend/utils/adt/xml.c | 20 +++-
 src/test/regress/expected/xml.out   | 15 +--
 src/test/regress/expected/xml_2.out | 15 +--
 3 files changed, 29 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c
index 15adbd6a01..404a2f455d 100644
--- a/src/backend/utils/adt/xml.c
+++ b/src/backend/utils/adt/xml.c
@@ -771,7 +771,25 @@ xmltotext_with_options(xmltype *data, XmlOptionType xmloption_arg, bool indent)
 		"could not close xmlSaveCtxtPtr");
 		}
 
-		result = (text *) xmlBuffer_to_xmltype(buf);
+		/*
+		* This is necessary to remove the trailing newline created
+		* by xmlSaveDoc - it only affects DOCUMENT xml strings.
+		* The fragments of CONTENT strings are stored into the
+		* xmlBufferPtr using xmlSaveTree, which does not add a
+		* trailing newline.
+		*/
+		if(xmloption_arg != XMLOPTION_DOCUMENT)
+			result = (text *) xmlBuffer_to_xmltype(buf);
+		else
+		{
+			int	len = xmlBufferLength(buf);
+			const char *xmloutput = (const char *) xmlBufferContent(buf);
+
+			while (len > 0 && xmloutput[len - 1] == '\n')
+len--;
+
+			result = cstring_to_text_with_len(xmloutput, len);
+		}
 	}
 	PG_CATCH();
 	{
diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out
index 398345ca67..b689f86fe6 100644
--- a/src/test/regress/expected/xml.out
+++ b/src/test/regress/expected/xml.out
@@ -494,8 +494,7 @@ SELECT xmlserialize(DOCUMENT '42' AS text
+
  42+
   +
-  +
- 
+ 
 (1 row)
 
 SELECT xmlserialize(CONTENT  '42' AS text INDENT);
@@ -555,8 +554,7 @@ SELECT xmlserialize(DOCUMENT '42text node<
  42+
  text node73+
   +
-  +
- 
+ 
 (1 row)
 
 SELECT xmlserialize(CONTENT  '42text node73' AS text INDENT);
@@ -610,8 +608,7 @@ SELECT xmlserialize(DOCUMENT '   +
  73 +
  +
- +
- 
+ 
 (1 row)
 
 SELECT xmlserialize(CONTENT  '73' AS text INDENT);
@@ -629,8 +626,7 @@ SELECT xmlserialize(DOCUMENT '' AS text INDENT);
  xmlserialize 
 --
  +
- +
- 
+ 
 (1 row)
 
 SELECT xmlserialize(CONTENT  '' AS text INDENT);
@@ -647,8 +643,7 @@ SELECT xmlserialize(DOCUMENT '' AS text INDENT);
 --
 +
+
-   +
- 
+ 
 (1 row)
 
 SELECT xmlserialize(CONTENT  '' AS text INDENT);
diff --git a/src/test/regress/expected/xml_2.out b/src/test/regress/expected/xml_2.out
index 43c2558352..a2eeff8369 100644
--- a/src/test/regress/expected/xml_2.out
+++ b/src/test/regress/expected/xml_2.out
@@ -474,8 +474,7 @@ SELECT xmlserialize(DOCUMENT '42' AS text
+
  42+
   +
-  +
- 
+ 
 (1 row)
 
 SELECT xmlserialize(CONTENT  '42' AS text INDENT);
@@ -535,8 +534,7 @@ SELECT xmlserialize(DOCUMENT '42text node<
  42+
  text node73+
   +
-  +
- 
+ 
 (1 row)
 
 SELECT xmlserialize(CONTENT  '42text node73' AS text INDENT);
@@ -590,8 +588,7 @@ SELECT xmlserialize(DOCUMENT '   +
  73 +

Re: New committers: Nathan Bossart, Amit Langote, Masahiko Sawada

2023-04-24 Thread Andrey M. Borodin




> On 20 Apr 2023, at 22:40, Tom Lane  wrote:
> 
> The Core Team would like to extend our congratulations to
> Nathan Bossart, Amit Langote, and Masahiko Sawada, who have
> accepted invitations to become our newest Postgres committers.
> 
> Please join me in wishing them much success and few reverts.


Cool! Congratulations Nathan, Amit, and Masahiko!


Best regards, Andrey Borodin.

Re: Improving worst-case merge join performance with often-null foreign key

2023-04-24 Thread Richard Guo

On Sun, Apr 23, 2023 at 5:29 PM Richard Guo  wrote:

> On Sat, Apr 22, 2023 at 11:21 PM Tom Lane  wrote:
>
>> Steinar Kaldager  writes:
>> > First-time potential contributor here. We recently had an incident due
>> > to a sudden 1000x slowdown of a Postgres query (from ~10ms to ~10s)
>> > due to a join with a foreign key that was often null. We found that it
>> > was caused by a merge join with an index scan on one join path --
>> > whenever the non-null data happened to be such that the merge join
>> > couldn't be terminated early, the index would proceed to scan all of
>> > the null rows and filter each one out individually. Since this was an
>> > inner join, this was pointless; the nulls would never have matched the
>> > join clause anyway.
>>
>> Hmm.  I don't entirely understand why the existing stop-at-nulls logic
>> in nodeMergejoin.c didn't fix this for you.  Maybe somebody has broken
>> that?  See the commentary for MJEvalOuterValues/MJEvalInnerValues.
>
>
> I think it's just because the MergeJoin didn't see a NULL foo_id value
> from test_bar tuples because all such tuples are removed by the filter
> 'test_bar.active', thus it does not have a chance to stop at nulls.
>
> # select count(*) from test_bar where foo_id is null and active;
>  count
> ---
>  0
> (1 row)
>
> Instead, the index scan on test_bar will have to scan all the tuples
> with NULL foo_id because none of them satisfies the qual clause.
>

BTW, in Steinar's case the query runs much faster with nestloop with a
parameterized inner path, since test_foo is small and test_bar is very
large, and there is an index on test_bar.foo_id.  You can see that by
"set enable_mergejoin to off":

# EXPLAIN (costs off)
SELECT SUM(test_foo.id) FROM test_bar, test_foo WHERE test_bar.foo_id =
test_foo.id AND test_foo.active AND test_bar.active;
  QUERY PLAN
--
 Aggregate
   ->  Nested Loop
 ->  Seq Scan on test_foo
   Filter: active
 ->  Index Scan using test_bar_foo_id_idx on test_bar
   Index Cond: (foo_id = test_foo.id)
   Filter: active
(7 rows)

In my box the total cost and execution time of mergejoin vs nestloop
are:

mergejoin   nestloop

Cost estimate   1458.40 12355.15
Actual (best of 3)  3644.685 ms 13.114 ms

So it seems we have holes in cost estimate for mergejoin or nestloop
with parameterized inner path, or both.

Thanks
Richard

Re: pg_upgrade and logical replication

2023-04-24 Thread Julien Rouhaud

Hi,

On Tue, Apr 18, 2023 at 01:40:51AM +, Hayato Kuroda (Fujitsu) wrote:
>
> I found a cfbot failure on macOS [1]. According to the log,
> "SELECT count(*) FROM t2" was executed before synchronization was done.
>
> ```
> [09:24:21.018](0.132s) not ok 18 - Table t2 should now have 3 rows on the new 
> subscriber
> ```
>
> With the patch present, wait_for_catchup() is executed after REFRESH, but
> it may not be sufficient because it does not check pg_subscription_rel.
> wait_for_subscription_sync() seems better for the purpose.

Fixed, thanks!

v5 attached with all previously mentioned fixes.
>From c6755ea3318220dc41bc315cc7acce4954e9b252 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Wed, 22 Feb 2023 09:19:32 +0800
Subject: [PATCH v5] Optionally preserve the full subscription's state during
 pg_upgrade

Previously, only the subscription metadata information was preserved.  Without
the list of relations and their state it's impossible to re-enable the
subscriptions without missing some records as the list of relations can only be
refreshed after enabling the subscription (and therefore starting the apply
worker).  Even if we added a way to refresh the subscription while enabling a
publication, we still wouldn't know which relations are new on the publication
side, and therefore should be fully synced, and which shouldn't.

Similarly, the subscription's replication origin are needed to ensure
that we don't replicate anything twice.

To fix this problem, this patch teaches pg_dump in binary upgrade mode to emit
additional ALTER SUBSCRIPTION subcommands that will restore the content of
pg_subscription_rel, and also provides an additional LSN parameter for CREATE
SUBSCRIPTION to restore the underlying replication origin remote LSN.  The new
ALTER SUBSCRIPTION subcommand and the new LSN parameter are not exposed to
users and only accepted in binary upgrade mode.

The new ALTER SUBSCRIPTION subcommand has the following syntax:

ALTER SUBSCRIPTION name ADD TABLE (relid = XYZ, state = 'x' [, lsn = 'X/Y'])

The relation is identified by its oid, as it's preserved during pg_upgrade.
The lsn is optional, and defaults to NULL / InvalidXLogRecPtr if not provided.
Explicitly passing InvalidXLogRecPtr (0/0) is however not allowed.

This mode is optional and not enabled by default.  A new
--preserve-subscription-state option is added to pg_upgrade to use it.  For
now, pg_upgrade will check that all the subscription have a valid replication
origin remote_lsn, and that all underlying relations are in 'r' (ready) state,
and will error out if that's not the case, logging the reason for the failure.

Author: Julien Rouhaud
Reviewed-by: FIXME
Discussion: https://postgr.es/m/20230217075433.u5mjly4d5cr4hcfe@jrouhaud
---
 doc/src/sgml/ref/pgupgrade.sgml  |  23 +++
 src/backend/commands/subscriptioncmds.c  |  75 +++-
 src/backend/parser/gram.y|  11 ++
 src/bin/pg_dump/common.c |  22 +++
 src/bin/pg_dump/pg_backup.h  |   2 +
 src/bin/pg_dump/pg_dump.c| 136 +-
 src/bin/pg_dump/pg_dump.h|  15 ++
 src/bin/pg_upgrade/check.c   |  81 +
 src/bin/pg_upgrade/dump.c|   3 +-
 src/bin/pg_upgrade/meson.build   |   1 +
 src/bin/pg_upgrade/option.c  |   6 +
 src/bin/pg_upgrade/pg_upgrade.h  |   1 +
 src/bin/pg_upgrade/t/003_subscription.pl | 220 +++
 src/include/nodes/parsenodes.h   |   3 +-
 src/tools/pgindent/typedefs.list |   1 +
 15 files changed, 595 insertions(+), 5 deletions(-)
 create mode 100644 src/bin/pg_upgrade/t/003_subscription.pl

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7816b4c685..6af790c986 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -240,6 +240,29 @@ PostgreSQL documentation
   
  
 
+ 
+  --preserve-subscription-state
+  
+   
+Fully preserve the logical subscription state if any.  That includes
+the underlying replication origin with their remote LSN and the list of
+relations in each subscription so that replication can be simply
+resumed if the subscriptions are reactivated.
+   
+   
+If this option isn't used, it is up to the user to reactivate the
+subscriptions in a suitable way; see the subscription part in  for more information.
+   
+   
+If this option is used and any of the subscription on the old cluster
+has an unknown remote_lsn (0/0), or has any relation
+in a state different from r (ready), the
+pg_upgrade run will error.
+   
+  
+ 
+
  
   -?
   --help
diff --git a/src/backend/commands/subscriptioncmds.c 
b/src/backend/commands/subscriptioncmds.c
index 56eafbff10..657db3791e 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -7

Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns

2023-04-24 Thread Nikita Malakhov

Hi,

This is a production case for large databases with high update rates, but
is mistaken
with reaching table size limit, although size limit is processed correctly.

The note on TOAST limitation does not mention that TOAST values are not
actually
updated on UPDATE operation - old value is marked as dead and new one is
inserted,
and dead values should be vacuumed before value OID could be reused. The
worst
is that the INSERT/UPDATE clause does not fail if there is no OID available
- it is
looped in an infinite loop of sorting out OIDs.

On Sat, Apr 22, 2023 at 6:42 PM Gurjeet Singh  wrote:

> On Fri, Apr 21, 2023 at 12:14 AM Nikita Malakhov 
> wrote:
> > This limitation applies not only to wide tables - it also applies to
> tables where TOASTed values
> > are updated very often. You would soon be out of available TOAST value
> ID because in case of
> > high frequency updates autovacuum cleanup rate won't keep up with the
> updates. It is discussed
> > in [1].
> >
> > On Fri, Apr 21, 2023 at 9:39 AM Jakub Wartak <
> jakub.war...@enterprisedb.com> wrote:
> >> I would like to ask if it wouldn't be good idea to copy the
> >> https://wiki.postgresql.org/wiki/TOAST#Total_table_size_limit
> >> discussion (out-of-line OID usage per TOAST-ed columns / potential
> >> limitation) to the official "Appendix K. PostgreSQL Limits" with also
> >> little bonus mentioning the "still searching for an unused OID in
> >> relation" notice. Although it is pretty obvious information for some
> >> and from commit 7fbcee1b2d5f1012c67942126881bd492e95077e and the
> >> discussion in [1], I wonder if the information shouldn't be a little
> >> more well known via the limitation (especially to steer people away
> >> from designing very wide non-partitioned tables).
> >>
> >> [1] -
> https://www.postgresql.org/message-id/flat/16722-93043fb459a41073%40postgresql.org
> >
> > [1]
> https://www.postgresql.org/message-id/CAN-LCVPRvRzxeUdYdDCZ7UwZQs1NmZpqBUCd%3D%2BRdMPFTyt-bRQ%40mail.gmail.com
>
> These 2 discussions show that it's a painful experience to run into
> this problem, and that the hackers have ideas on how to fix it, but
> those fixes haven't materialized for years. So I would say that, yes,
> this info belongs in the hard-limits section, because who knows how
> long it'll take this to be fixed.
>
> Please submit a patch.
>
> I anticipate that edits to Appendix K Postgres Limits will prompt
> improving the note in there about the maximum column limit, That note
> is too wordy, and sometimes confusing, especially for the audience
> that it's written for: newcomers to Postgres ecosystem.
>
> Best regards,
> Gurjeet https://Gurje.et
> http://aws.amazon.com
>


-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/

RE: Add missing copyright for pg_upgrade/t/* files

2023-04-24 Thread Hayato Kuroda (Fujitsu)

Dear David,

> It is great to make sure each file has the Copyright and I see this
> patch has already been committed.

Thanks!
While checking more, I was surprised because I found many files which do not
have Copyright via " grep -Lr Copyright --exclude-dir .git ..." command.
I'm not sure whether it is expected, but all sql files in src/test/regress/sql 
and
many files in contrib do not have. Do you know something about it?

> Just curious, is there a rule to add Copyright to Postgres?

Sorry, I'm not sure about it. Before submitting a patch I have checked the
manual that "PostgreSQL Coding Conventions", but I could not find any.

> For example,
> if I run a command `grep -rn Copyright --include="*.pl" | awk -F ':'
> {'print $2, $1'} | sort -nr` inside postgres/src/bin, It seems most
> Copyright were added to the second line, but these two were added to the
> very beginning (of course, there are three other files following this
> pattern as well).

There seems a tendency that Copyright for recently added files have added it to
the very beginning, but I can suspect from the result that there are no specific
rules about it.

```
$ grep -rn Copyright --include="*.pl" | awk -F ':' {'print $2'} | sort -nr | 
uniq -c
  1 753
  1 752
  1 717
...
 22 3
158 2
 24 1
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Autogenerate some wait events code and documentation

2023-04-24 Thread Drouvot, Bertrand


Hi,

On 4/24/23 5:15 AM, Michael Paquier wrote:

On Sat, Apr 22, 2023 at 03:36:05PM +0200, Drouvot, Bertrand wrote:

On 4/20/23 3:09 AM, Michael Paquier wrote:

It is clear that the format of the file is:
- category
- C symbol in enums.
- Format in the system views.
- Description in the docs.
Or perhaps it would be better to divide this file by sections (like
errcodes.txt) for each category so as we eliminate entirely the first
column?


Yeah, that could be an option. V3 is still using the category as the first 
column
but I'm ok to change it by a section if you prefer (though I don't really see 
the need).


It can make the file width shorter, at least..


Right.



[ .. thinks .. ]

+my $waitclass;
+my @wait_classes = ("PG_WAIT_ACTIVITY", "PG_WAIT_CLIENT", "PG_WAIT_IPC", 
"PG_WAIT_TIMEOUT", "PG_WAIT_IO");

Actually, having a "Section" in waiteventnames.txt would remove the
need to store this knowledge in generate-waiteventnames.pl, which is
a duplicate of the txt contents.  If somebody adds a new class in the
future, it would be necessary to update this path as well.  Well, that
would not be a huge effort in itself, but IMO the script translating
the .txt to the docs and the code should have no need to know the
types of classes.  I guess that a format like that would make the most
sense to me, then:
Section: ClassName PG_WAIT_CLASS_NAME

# ClassName would be "IO", "IPC", "Timeout", etc.

WAIT_EVENT_NAME_1 "WaitEventName1" "Description of wait event 1"
WAIT_EVENT_NAME_N "WaitEventNameN" "Description of wait event N"



I gave another thought on it, and do agree that's better to use sections
in the .txt file. This is done in V4 attached.


utils/adt/jsonpath_scan.c \
+ utils/activity/waiteventnames.c \
+ utils/activity/waiteventnames.h \
+ utils/adt/jsonpath_scan.c \

Looks like a copy-pasto.


Why do you think so? both files have to be removed.


jsonpath_scan.c is listed twice, and that's still the case in v3.


Oh I see, I misunderstood what you thought the typo was.
Fixed in V4 thanks!


 The
list of files deleted for maintainer-clean in src/backend/Makefile
should be listed alphabetically (utils/activity/ before utils/adt/),
but that's a nit ;)


Oh right, fixed.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comFrom 2d3fa2567a80c3431bf9682510d483ac0f1d7cec Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Sat, 22 Apr 2023 10:37:56 +
Subject: [PATCH v4] Generating wait_event_types.h, pgstat_wait_event.c and
 waiteventnames.sgml

Purpose is to auto-generate those files based on the newly created 
waiteventnames.txt file.

Having auto generated files would help to avoid:

- wait event without description like observed in 
https://www.postgresql.org/message-id/CA%2BhUKGJixAHc860Ej9Qzd_z96Z6aoajAgJ18bYfV3Lfn6t9%3D%2BQ%40mail.gmail.com
- orphaned wait event like observed in 
https://www.postgresql.org/message-id/CA%2BhUKGK6tqm59KuF1z%2Bh5Y8fsWcu5v8%2B84kduSHwRzwjB2aa_A%40mail.gmail.com
---
 doc/src/sgml/.gitignore   |   1 +
 doc/src/sgml/Makefile |   5 +-
 doc/src/sgml/filelist.sgml|   1 +
 doc/src/sgml/meson.build  |  10 +
 doc/src/sgml/monitoring.sgml  | 792 +-
 src/backend/Makefile  |  13 +-
 src/backend/utils/activity/Makefile   |  10 +
 .../utils/activity/generate-waiteventnames.pl | 178 
 src/backend/utils/activity/meson.build|  24 +
 src/backend/utils/activity/wait_event.c   | 567 +
 src/backend/utils/activity/waiteventnames.txt | 202 +
 src/include/Makefile  |   2 +-
 src/include/utils/.gitignore  |   1 +
 src/include/utils/meson.build |  18 +
 src/include/utils/wait_event.h| 214 +
 src/tools/msvc/Solution.pm|  19 +
 src/tools/msvc/clean.bat  |   3 +
 17 files changed, 488 insertions(+), 1572 deletions(-)
  36.8% doc/src/sgml/
  52.0% src/backend/utils/activity/
   8.8% src/include/utils/

diff --git a/doc/src/sgml/.gitignore b/doc/src/sgml/.gitignore
index d8e3dab338..e8cbd687d5 100644
--- a/doc/src/sgml/.gitignore
+++ b/doc/src/sgml/.gitignore
@@ -17,6 +17,7 @@
 /errcodes-table.sgml
 /keywords-table.sgml
 /version.sgml
+/waiteventnames.sgml
 # Assorted byproducts from building the above
 /postgres-full.xml
 /INSTALL.html
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 71cbef230f..01a6aa8187 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -58,7 +58,7 @@ override XSLTPROCFLAGS += --stringparam pg.version 
'$(VERSION)'
 
 GENERATED_SGML = version.sgml \
features-supported.sgml features-unsupported.sgml errcodes-table.sgml \
-   keywords-table.sgml
+   keywords-table.sgml waiteventnames.sgml
 
 ALLSGML := $(wildcard $(srcdir)/*.sgml $(sr

83 matches

Mail list logo