date:20221213

Inconsistency in reporting checkpointer stats

2022-12-13 Thread Nitin Jadhav

Hi,

While working on checkpoint related stuff, I have encountered that
there is some inconsistency while reporting checkpointer stats. When a
checkpoint gets completed, a checkpoint complete message gets logged.
This message has a lot of information including the buffers written
(CheckpointStats.ckpt_bufs_written). This variable gets incremented in
2 cases. First is in BufferSync() and the second is in
SlruInternalWritePage(). On the other hand the checkpointer stats
exposed using pg_stat_bgwriter contains a lot of information including
buffers written (PendingCheckpointerStats.buf_written_checkpoints).
This variable gets incremented in only one place and that is in
BufferSync(). So there is inconsistent behaviour between these two
data. Please refer to the sample output below.

postgres=# select * from pg_stat_bgwriter;
 checkpoints_timed | checkpoints_req | checkpoint_write_time |
checkpoint_sync_time | buffers_checkpoint | buffers_clean |
maxwritten_clean | buffers_backend | buffers_backend_fsync |
buffers_alloc |  stats_reset
---+-+---+--++---+--+-+---+---+---
 0 |   1 |75 |
 176 |   4702 | 0 |0 |
   4656 | 0 |  5023 | 2022-12-14
07:01:01.494672+00

2022-12-14 07:03:18.052 UTC [6087] LOG:  checkpoint starting:
immediate force wait
2022-12-14 07:03:18.370 UTC [6087] LOG:  checkpoint complete: wrote
4705 buffers (28.7%); 0 WAL file(s) added, 0 removed, 4 recycled;
write=0.075 s, sync=0.176 s, total=0.318 s; sync files=34,
longest=0.159 s, average=0.006 s; distance=66180 kB, estimate=66180
kB; lsn=0/5565E38, redo lsn=0/5565E00


In order to fix this, the
PendingCheckpointerStats.buf_written_checkpoints should be incremented
in SlruInternalWritePage() similar to
CheckpointStats.ckpt_bufs_written. I have attached the patch for the
same. Please share your thoughts.


Thanks & Regards,
Nitin Jadhav


v1-0001-Fix-inconsistency-in-checkpointer-stats.patch
Description: Binary data

Common function for percent placeholder replacement

2022-12-13 Thread Peter Eisentraut

There are a number of places where a shell command is constructed with 
percent-placeholders (like %x).  First, it's obviously cumbersome to 
have to open-code this several times.  Second, each of those pieces of 
code silently encodes some edge case behavior, such as what to do with 
unrecognized placeholders.  (I remember when I last did one of these, I 
stared very hard at the existing code instances to figure out what they 
would do.)  We now also have a newer instance in basebackup_to_shell.c 
that has different behavior in such cases.  (Maybe it's better, but it 
would be good to be explicit and consistent about this.)


This patch factors out this logic into a separate function.  I have 
documented to "old" error handling (which is to not handle them) and 
brutally converted basebackup_to_shell.c to use that.  We could also 
adopt the new behavior; now there is only a single place to change for that.


Note that this is only used for shell commands with placeholders, not 
for other places with placeholders, such as prompts and log line 
prefixes, which would (IMO) need a different API that wouldn't be quite 
as compact.  This is explained in the code comments.From e0e44dbffafa50bf15bca6a03e98783a82ce072c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 14 Dec 2022 08:17:25 +0100
Subject: [PATCH v1] Common function for percent placeholder replacement

There are a number of places where a shell command is constructed with
percent-placeholders (like %x).  It's cumbersome to have to open-code
this several times.  This factors out this logic into a separate
function.  This also allows us to ensure consistency for and document
some subtle behaviors, such as what to do with unrecognized
placeholders.
---
 .../basebackup_to_shell/basebackup_to_shell.c |  55 +
 src/backend/access/transam/xlogarchive.c  |  45 +---
 src/backend/libpq/be-secure-common.c  |  38 ++
 src/backend/postmaster/shell_archive.c|  59 ++
 src/common/Makefile   |   1 +
 src/common/archive.c  |  81 ++---
 src/common/meson.build|   1 +
 src/common/percentrepl.c  | 108 ++
 src/include/common/percentrepl.h  |  18 +++
 9 files changed, 161 insertions(+), 245 deletions(-)
 create mode 100644 src/common/percentrepl.c
 create mode 100644 src/include/common/percentrepl.h

diff --git a/contrib/basebackup_to_shell/basebackup_to_shell.c 
b/contrib/basebackup_to_shell/basebackup_to_shell.c
index bdaf67a4c8..9a3d57c64b 100644
--- a/contrib/basebackup_to_shell/basebackup_to_shell.c
+++ b/contrib/basebackup_to_shell/basebackup_to_shell.c
@@ -12,6 +12,7 @@
 
 #include "access/xact.h"
 #include "backup/basebackup_target.h"
+#include "common/percentrepl.h"
 #include "miscadmin.h"
 #include "storage/fd.h"
 #include "utils/acl.h"
@@ -208,59 +209,7 @@ static char *
 shell_construct_command(const char *base_command, const char *filename,
const char *target_detail)
 {
-   StringInfoData buf;
-   const char *c;
-
-   initStringInfo();
-   for (c = base_command; *c != '\0'; ++c)
-   {
-   /* Anything other than '%' is copied verbatim. */
-   if (*c != '%')
-   {
-   appendStringInfoChar(, *c);
-   continue;
-   }
-
-   /* Any time we see '%' we eat the following character as well. 
*/
-   ++c;
-
-   /*
-* The following character determines what we insert here, or 
may
-* cause us to throw an error.
-*/
-   if (*c == '%')
-   {
-   /* '%%' is replaced by a single '%' */
-   appendStringInfoChar(, '%');
-   }
-   else if (*c == 'f')
-   {
-   /* '%f' is replaced by the filename */
-   appendStringInfoString(, filename);
-   }
-   else if (*c == 'd')
-   {
-   /* '%d' is replaced by the target detail */
-   appendStringInfoString(, target_detail);
-   }
-   else if (*c == '\0')
-   {
-   /* Incomplete escape sequence, expected a character 
afterward */
-   ereport(ERROR,
-   errcode(ERRCODE_SYNTAX_ERROR),
-   errmsg("shell command ends unexpectedly 
after escape character \"%%\""));
-   }
-   else
-   {
-   /* Unknown escape sequence */
-   ereport(ERROR,
-   errcode(ERRCODE_SYNTAX_ERROR),
-   errmsg("shell command contains 
unexpected escape

Re: pg_upgrade: Make testing different transfer modes easier

2022-12-13 Thread Peter Eisentraut


On 07.12.22 17:33, Peter Eisentraut wrote:
I think if we want to make this configurable on the fly, and environment 
variable would be much easier, like


     my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';


Here is an updated patch set that incorporates this idea.From c0f72bb9f50a36bc158943f3a51fbbc749d7f93c Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 14 Dec 2022 07:52:58 +0100
Subject: [PATCH v2 1/2] pg_upgrade: Add --copy option

This option selects the default transfer mode.  Having an explicit
option is handy to make scripts and tests more explicit.  It also
makes it easier to talk about a "copy" mode rather than "the default
mode" or something like that, since until now the default mode didn't
have an externally visible name.

Discussion: 
https://www.postgresql.org/message-id/flat/50a97009-8ff9-ca4d-a0f6-6086a6775a5b%40enterprisedb.com
---
 doc/src/sgml/ref/pgupgrade.sgml | 10 ++
 src/bin/pg_upgrade/option.c |  6 ++
 2 files changed, 16 insertions(+)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 8f7a3025c3..7816b4c685 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -230,6 +230,16 @@ Options
   
  
 
+ 
+  --copy
+  
+   
+Copy files to the new cluster.  This is the default.  (See also
+--link and --clone.)
+   
+  
+ 
+
  
   -?
   --help
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 2939f584b4..51fc7aede0 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -56,6 +56,7 @@ parseCommandLine(int argc, char *argv[])
{"socketdir", required_argument, NULL, 's'},
{"verbose", no_argument, NULL, 'v'},
{"clone", no_argument, NULL, 1},
+   {"copy", no_argument, NULL, 2},
 
{NULL, 0, NULL, 0}
};
@@ -194,6 +195,10 @@ parseCommandLine(int argc, char *argv[])
user_opts.transfer_mode = TRANSFER_MODE_CLONE;
break;
 
+   case 2:
+   user_opts.transfer_mode = TRANSFER_MODE_COPY;
+   break;
+
default:
fprintf(stderr, _("Try \"%s --help\" for more 
information.\n"),
os_info.progname);
@@ -283,6 +288,7 @@ usage(void)
printf(_("  -v, --verbose enable verbose internal 
logging\n"));
printf(_("  -V, --version display version information, 
then exit\n"));
printf(_("  --clone   clone instead of copying 
files to new cluster\n"));
+   printf(_("  --copycopy files to new cluster 
(default)\n"));
printf(_("  -?, --helpshow this help, then 
exit\n"));
printf(_("\n"
 "Before running pg_upgrade you must:\n"

base-commit: 60684dd834a222fefedd49b19d1f0a6189c1632e
-- 
2.38.1

From 49861815c1cf76658a77cf8ade59c4bb61c4064b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 14 Dec 2022 08:02:57 +0100
Subject: [PATCH v2 2/2] pg_upgrade: Make testing different transfer modes
 easier

The environment variable PG_TEST_PG_UPGRADE_MODE can be set to
override the default transfer mode for the pg_upgrade tests.
(Automatically running the pg_upgrade tests for all supported modes
would be too slow.)

Discussion: 
https://www.postgresql.org/message-id/flat/50a97009-8ff9-ca4d-a0f6-6086a6775a5b%40enterprisedb.com
---
 src/bin/pg_upgrade/t/002_pg_upgrade.pl | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl 
b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index add6ea9c34..1d5c80907c 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -12,6 +12,9 @@
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Can be changed to test the other modes.
+my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
+
 # Generate a database with a name made of a range of ASCII characters.
 sub generate_db
 {
@@ -75,6 +78,8 @@ sub filter_dump
 my $dump1_file = "$tempdir/dump1.sql";
 my $dump2_file = "$tempdir/dump2.sql";
 
+note "testing using transfer mode $mode";
+
 # Initialize node to upgrade
 my $oldnode =
   PostgreSQL::Test::Cluster->new('old_node',
@@ -128,6 +133,7 @@ sub filter_dump
# --inputdir points to the path of the input files.
my $inputdir = "$srcdir/src/test/regress";
 
+   note 'running regression tests in old instance';
my $rc =
  system($ENV{PG_REGRESS}
  . " $extra_opts "
@@ -256,6 +262,7 @@ sub filter_dump
'-s', $newnode->host,
'-p', $oldnode->port,
'-P', $newnode->port,
+   $mode,

Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-13 Thread Amit Kapila

On Wed, Dec 14, 2022 at 9:50 AM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, December 13, 2022 11:25 PM Masahiko Sawada 
>  wrote:
> >
> > Here are comments on v59 0001, 0002 patches:
>
> Thanks for the comments!
>
> > +void
> > +pa_increment_stream_block(ParallelApplyWorkerShared *wshared) {
> > +while (1)
> > +{
> > +SpinLockAcquire(>mutex);
> > +
> > +/*
> > + * Don't try to increment the count if the parallel
> > apply worker is
> > + * taking the stream lock. Otherwise, there would be
> > a race condition
> > + * that the parallel apply worker checks there is no
> > pending streaming
> > + * block and before it actually starts waiting on a
> > lock, the leader
> > + * sends another streaming block and take the stream
> > lock again. In
> > + * this case, the parallel apply worker will start
> > waiting for the next
> > + * streaming block whereas there is actually a
> > pending streaming block
> > + * available.
> > + */
> > +if (!wshared->pa_wait_for_stream)
> > +{
> > +wshared->pending_stream_count++;
> > +SpinLockRelease(>mutex);
> > +break;
> > +}
> > +
> > +SpinLockRelease(>mutex);
> > +}
> > +}
> >
> > I think we should add an assertion to check if we don't hold the stream 
> > lock.
> >
> > I think that waiting for pa_wait_for_stream to be false in a busy loop is 
> > not a
> > good idea. It's not interruptible and there is not guarantee that we can 
> > break
> > from this loop in a short time. For instance, if PA executes
> > pa_decr_and_wait_stream_block() a bit earlier than LA executes
> > pa_increment_stream_block(), LA has to wait for PA to acquire and release 
> > the
> > stream lock in a busy loop. It should not be long in normal cases but the
> > duration LA needs to wait for PA depends on PA, which could be long. Also
> > what if PA raises an error in
> > pa_lock_stream() due to some reasons? I think LA won't be able to detect the
> > failure.
> >
> > I think we should at least make it interruptible and maybe need to add some
> > sleep. Or perhaps we can use the condition variable for this case.
>

Or we can leave this while (true) logic altogether for the first
version and have a comment to explain this race. Anyway, after
restarting, it will probably be solved. We can always change this part
of the code later if this really turns out to be problematic.

> Thanks for the analysis, I will research this part.
>
> > ---
> > In worker.c, we have the following common pattern:
> >
> > case TRANS_LEADER_PARTIAL_SERIALIZE:
> > write change to the file;
> > do some work;
> > break;
> >
> > case TRANS_LEADER_SEND_TO_PARALLEL:
> > pa_send_data();
> >
> > if (winfo->serialize_changes)
> > {
> > do some worker required after writing changes to the file.
> > }
> > :
> > break;
> >
> > IIUC there are two different paths for partial serialization: (a) where
> > apply_action is TRANS_LEADER_PARTIAL_SERIALIZE, and (b) where
> > apply_action is TRANS_LEADER_PARTIAL_SERIALIZE and
> > winfo->serialize_changes became true. And we need to match what we do
> > in (a) and (b). Rather than having two different paths for the same case, 
> > how
> > about falling through TRANS_LEADER_PARTIAL_SERIALIZE when we could not
> > send the changes? That is, pa_send_data() just returns false when the 
> > timeout
> > exceeds and we need to switch to serialize changes, otherwise returns true. 
> > If it
> > returns false, we prepare for switching to serialize changes such as 
> > initializing
> > fileset, and fall through TRANS_LEADER_PARTIAL_SERIALIZE case. The code
> > would be like:
> >
> > case TRANS_LEADER_SEND_TO_PARALLEL:
> > ret = pa_send_data();
> >
> > if (ret)
> > {
> > do work for sending changes to PA.
> > break;
> > }
> >
> > /* prepare for switching to serialize changes */
> > winfo->serialize_changes = true;
> > initialize fileset;
> > acquire stream lock if necessary;
> >
> > /* FALLTHROUGH */
> > case TRANS_LEADER_PARTIAL_SERIALIZE:
> > do work for serializing changes;
> > break;
>
> I think that the suggestion is to extract the code that switch to serialize
> mode out of the pa_send_data(), and then we need to add that logic in all the
> functions which call pa_send_data(), I am not sure if it looks better as it
> might introduce some more codes in each handling function.
>

How about extracting the common code from apply_handle_stream_commit
and apply_handle_stream_prepare to a separate function say
pa_xact_finish_common()? I see there is a lot of common code (unlock
the stream, wait for the finish, store flush location, free worker
info) in both the functions for

Re: Non-decimal integer literals

2022-12-13 Thread Peter Eisentraut


On 08.12.22 12:16, Peter Eisentraut wrote:

On 29.11.22 21:22, David Rowley wrote:

There seems to be a small bug in the pg_strtointXX functions in the
code that checks that there's at least 1 digit.  This causes 0x to be
a valid representation of zero.  That does not seem to be allowed by
the parser, so I think we should likely reject it in COPY too.
-- probably shouldn't work
postgres=# copy a from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.

0x
\.

COPY 1


Fixed in new patch.  I moved the "require at least one digit" checks 
after the loops over the digits, to make it easier to write one check 
for all bases.


This patch is also incorporates your changes to the digit analysis 
algorithm.  I didn't check it carefully, but all the tests still pass. ;-)


committed

Re: Shortening the Scope of Critical Section in Creation of New MultiXacts

2022-12-13 Thread Kyotaro Horiguchi

Hello.

In short, the code as-is looks correct.

At Wed, 14 Dec 2022 00:14:34 +, "Bagga, Rishu"  wrote 
in 
>* Critical section from here until caller has written the data into the
>* just-reserved SLRU space; we don't want to error out with a partly

"the data" here means the whole this multi transaction, which includes
members. We shouldn't exit the critical section at least until the
very end of RecordNewMultiXact().

> This makes sense, as we need the multixact state and multixact offset 
> data to be consistent, but once we write data into the offsets, we can 
> end the critical section. Currently we wait until the members data is 
> also written before we end the critical section. 

Why do you think that the members are not a part of a
multitransaction?  A multitransaction is not complete without them.

Addition to the above, we cannot simply move the END_CRIT_SECTION() to
there since RecordNewMultiXact() has another caller that doesn't start
a critical section.

By the way, I didn't tested this for myself but..

> This passes regression tests

Maybe if you did the same with an assertion-build, you will get an
assertion-failure.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Use get_call_result_type() more widely

2022-12-13 Thread Bharath Rupireddy

On Tue, Dec 13, 2022 at 9:12 PM Tom Lane  wrote:
>
> Bharath Rupireddy  writes:
> > A review comment in another thread [1] by Michael Paquier about the
> > usage of get_call_result_type() instead of explicit building of
> > TupleDesc made me think about using it more widely. Actually, the
> > get_call_result_type() looks at the function definitions to figure the
> > column names and build the required TupleDesc, usage of which avoids
> > duplication of the column names between pg_proc.dat/function
> > definitions and source code. Also, it saves a good number of LOC ~415
> > [2] and the size of all the object files put together gets reduced by
> > ~4MB, which means, the postgres binary becomes leaner by ~4MB [3].
>
> Saving code is nice, but I'd assume the result is slower, because
> get_call_result_type has to do a pretty substantial amount of work
> to get the data to construct the tupdesc from.  Have you tried to
> quantify how much overhead this'd add?  Which of these functions
> can we safely consider to be non-performance-critical?

AFAICS, most of these functions have no direct source code callers,
they're user-facing functions and not in a hot code path. I measured
the test times of these functions and I don't see much difference [1].

[1]
pg_old_snapshot_time_mapping() - an extension function with no
internal source code callers, no test coverage.
pg_visibility_map_summary() - an extension function with no internal
source code callers, test coverage exists, test times on HEAD:25 ms
PATCHED:25 ms
pg_last_committed_xact() and pg_xact_commit_timestamp_origin() - no
internal source code callers, test coverage exists, test times on
HEAD:10 ms PATCHED:10 ms
pg_get_multixact_members() - no internal source code callers, no test coverage.
pg_prepared_xact() - no internal source code callers, test coverage
exists, test times on HEAD:50 ms, subscription 108 wallclock secs,
recovery 111 wallclock secs PATCHED:48 ms, subscription 110 wallclock
secs, recovery 112 wallclock secs
pg_walfile_name_offset() - no internal source code callers, no test coverage.
pg_get_object_address() - no internal source code callers, test
coverage exists, test times on HEAD:66 ms PATCHED:60 ms
pg_identify_object() - no internal source code callers, test coverage
exists, test times on HEAD:17 ms PATCHED:18 ms
pg_identify_object_as_address() - no internal source code callers,
test coverage exists, test times on HEAD:66 ms PATCHED:60 ms
pg_get_publication_tables() - internal source code callers exist, test
coverage exists, test times on HEAD:159 ms, subscription 108 wallclock
secs PATCHED:167 ms, subscription 110 wallclock secs
pg_sequence_parameters() - no internal source code callers, test
coverage exists, test times on HEAD:96 ms PATCHED:98 ms
ts_token_type_byid(), ts_token_type_byname(), ts_parse_byid() and
ts_parse_byname() - internal source code callers exists, test coverage
exists, test times on HEAD:195 ms, pg_dump 10 wallclock secs
PATCHED:186 ms, pg_dump 10 wallclock secs
aclexplode() - internal callers exists information_schema.sql,
indirect test coverage exists.
pg_timezone_abbrevs() - no internal source code callers, test coverage
exists, test times on HEAD:40 ms PATCHED:36 ms
pg_stat_file() - no internal source code callers, test coverage
exists, test times on HEAD:42 ms PATCHED:46 ms
pg_lock_status() - no internal source code callers, test coverage
exists, test times on HEAD:16 ms PATCHED:22 ms
pg_get_keywords() - no internal source code callers, test coverage
exists, test times on HEAD:129 ms PATCHED:130 ms
pg_get_catalog_foreign_keys() - no internal source code callers, test
coverage exists, test times on HEAD:114 ms PATCHED:111 ms
pg_partition_tree() - no internal source code callers, test coverage
exists, test times on HEAD:30 ms PATCHED:32 ms
pg_stat_get_wal(), pg_stat_get_archiver() and
pg_stat_get_replication_slot() - no internal source code callers, test
coverage exists, test times on HEAD:479 ms PATCHED:483 ms
pg_stat_get_subscription_stats() - no internal source code callers,
test coverage exists, test times on HEAD:subscription 108 wallclock
secs PATCHED:subscription 110 wallclock secs
tsvector_unnest() - no internal source code callers, test coverage
exists, test times on HEAD:26 ms PATCHED:26 ms
ts_setup_firstcall() - test coverage exists, test times on HEAD:195 ms
PATCHED:186 ms
show_all_settings(), pg_control_system(), pg_control_checkpoint(),
pg_control_recovery() and pg_control_init() - test coverage exists,
test times on HEAD:42 ms PATCHED:44 ms
test_predtest() - no internal source code callers, test coverage
exists, test times on HEAD:18 ms PATCHED:18 ms

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Schema variables - new implementation for Postgres 15 (typo)

2022-12-13 Thread Erik Rijkers


Op 14-12-2022 om 05:54 schreef Pavel Stehule:

Hi

fresh rebase


typo alert:

v20221214-0003-LET-command.patch contains

errmsg("target session varible is of type %s"

('varible' should be 'variable')

Erik

Re: Add index scan progress to pg_stat_progress_vacuum

2022-12-13 Thread Imseih (AWS), Sami

>First of all, I don't think we need to declare ParallelVacuumProgress
>in vacuum.c since it's set and used only in vacuumparallel.c. But I
>don't even think it's a good idea to declare it in vacuumparallel.c as
>a static variable. The primary reason is that it adds things we need
>   to care about. For example, what if we raise an error during index
>vacuum? The transaction aborts but ParallelVacuumProgress still refers
>to something old. Suppose further that the next parallel vacuum
>doesn't launch any workers, the leader process would still end up
>accessing the old value pointed by ParallelVacuumProgress, which
>causes a SEGV. So we need to reset it anyway at the beginning of the
>parallel vacuum. It's easy to fix at this time but once the parallel
>   vacuum code gets more complex, it could forget to care about it.

>IMO VacuumSharedCostBalance and VacuumActiveNWorkers have a different
>story. They are set in vacuumparallel.c and are used in vacuum.c for
>vacuum delay. If they weren't global variables, we would need to pass
>them to every function that could eventually call the vacuum delay
>function. So it makes sense to me to have them as global variables.On
>the other hand, for ParallelVacuumProgress, it's a common pattern that
>ambulkdelete(), amvacuumcleanup() or a common index scan routine like
>btvacuumscan() checks the progress. I don't think index AM needs to
>pass the value down to many of its functions. So it makes sense to me
>to pass it via IndexVacuumInfo.

Thanks for the detailed explanation and especially clearing up
my understanding of VacuumSharedCostBalance and VacuumActiveNWorker.

I do now think that passing ParallelVacuumState in IndexVacuumInfo is
a more optimal choice.

Attached version addresses the above and the previous comments.


Thanks

Sami Imseih
Amazon Web Services (AWS)



v17-0001-Add-2-new-columns-to-pg_stat_progress_vacuum.-Th.patch
Description: v17-0001-Add-2-new-columns-to-pg_stat_progress_vacuum.-Th.patch

Re: Direct I/O

2022-12-13 Thread Thomas Munro

On Wed, Nov 2, 2022 at 11:54 AM Andres Freund  wrote:
> On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:
> > On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby  wrote:
> > > On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
> > > >   io_data_direct = whether to use O_DIRECT for main data files
> > > >   io_wal_direct = ... for WAL
> > > >   io_wal_init_direct = ... for WAL-file initialisation
> > >
> > > You added 3 booleans, but I wonder if it's better to add a string GUC
> > > which is parsed for comma separated strings.

Done as io_direct=data,wal,wal_init.  Thanks Justin, this is better.
I resisted the urge to invent a meaning for 'on' and 'off', mainly
because it's not clear what values 'on' should enable and it'd be
strange to have off without on, so for now an empty string means off.
I suppose the meaning of this string could evolve over time: the names
of forks, etc.

> Perhaps we could use the guc assignment hook to transform the input value into
> a bitmask?

Makes sense.  The only tricky question was where to store the GUC.  I
went for fd.c for now, but it doesn't seem quite right...

> > > DIO is slower, but not so much that it can't run under CI.  I suggest to
> > > add an 099 commit to enable the feature during development.
> >
> > Good idea, will do.

Done.  The tests take 2-3x as long depending on the OS.

> Might be worth to additionally have a short tap test that does some basic
> stuff with DIO and leave that enabled? I think it'd be good to have
> check-world exercise DIO on dev machines, to reduce the likelihood of finding
> problems only in CI, which is somewhat painful.

Done.

> > > Note that this fails under linux with fsanitize=align:
> > > ../src/backend/storage/file/buffile.c:117:17: runtime error: member 
> > > access within misaligned address 0x561a4a8e40f8 for type 'struct 
> > > BufFile', which requires 4096 byte alignment
> >
> > Oh, so BufFile is palloc'd and contains one of these.  BufFile is not
> > even using direct I/O, but by these rules it would need to be
> > palloc_io_align'd.  I will think about what to do about that...
>
> It might be worth having two different versions of the struct, so we don't
> impose unnecessarily high alignment everywhere?

Done.  I now have PGAlignedBlock (unchanged) and PGIOAlignedBlock.
You have to use the latter for SMgr, because I added alignment
assertions there.  We might as well use it for any other I/O such as
frontend code too for a chance of a small performance boost as you
showed.  For now I have not use PGIOAlignedBlock for BufFile, even
though it would be a great candidate for a potential speedup, only
because I am afraid of adding padding to every BufFile in scenarios
where we allocate many (could be avoided, a subject for separate
research).

V2 comprises:

0001 -- David's palloc_aligned() patch
https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on
From 9af1dcc3ce36ce18e011183d5f2a97cdc07fe396 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 18 Oct 2022 09:47:45 -0700
Subject: [PATCH v2 1/4] Add allocator support for larger allocation alignment
 & use for IO

---
 src/backend/utils/cache/catcache.c   |   5 +-
 src/backend/utils/mmgr/Makefile  |   1 +
 src/backend/utils/mmgr/alignedalloc.c| 110 ++
 src/backend/utils/mmgr/mcxt.c| 141 +--
 src/backend/utils/mmgr/meson.build   |   1 +
 src/include/utils/memutils_internal.h|  13 ++-
 src/include/utils/memutils_memorychunk.h |   2 +-
 src/include/utils/palloc.h   |   3 +
 8 files changed, 263 insertions(+), 13 deletions(-)
 create mode 100644 src/backend/utils/mmgr/alignedalloc.c

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 30ef0ba39c..9e635177c8 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -763,7 +763,6 @@ InitCatCache(int id,
 {
 	CatCache   *cp;
 	MemoryContext oldcxt;
-	size_t		sz;
 	int			i;
 
 	/*
@@ -807,8 +806,8 @@ InitCatCache(int id,
 	 *
 	 * Note: we rely on zeroing to initialize all the dlist headers correctly
 	 */
-	sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE;
-	cp = (CatCache *) CACHELINEALIGN(palloc0(sz));
+	cp = (CatCache *) palloc_aligned(sizeof(CatCache), PG_CACHE_LINE_SIZE,
+	 MCXT_ALLOC_ZERO);
 	cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head));
 
 	/*
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index 3b4cfdbd52..dae3432c98 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -13,6 +13,7 @@ top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = \
+	alignedalloc.o \
 	aset.o \
 	dsa.o \
 	freepage.o \
diff --git a/src/backend/utils/mmgr/alignedalloc.c b/src/backend/utils/mmgr/alignedalloc.c
new file mode 100644
index 00..97cb1d2b0d
--- /dev/null

RE: Perform streaming logical transactions by background workers and parallel apply

2022-12-13 Thread houzj.f...@fujitsu.com

On Tuesday, December 13, 2022 11:25 PM Masahiko Sawada  
wrote:
> 
> On Sun, Dec 11, 2022 at 8:45 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Friday, December 9, 2022 3:14 PM Amit Kapila
>  wrote:
> > >
> > > On Thu, Dec 8, 2022 at 12:37 PM houzj.f...@fujitsu.com
> > >  wrote:
> > > >
> > >
> > > Review comments
> >
> > Thanks for the comments!
> >
> > > ==
> > > 1. Currently, we don't release the stream lock in LA (leade apply
> > > worker) for "rollback to savepoint" and the reason is mentioned in
> > > comments of
> > > apply_handle_stream_abort() in the patch. But, today, while testing,
> > > I found that can lead to deadlock which otherwise, won't happen on
> > > the publisher. The key point is rollback to savepoint releases the
> > > locks acquired by the particular subtransaction, so parallel apply
> > > worker should also do the same. Consider the following example where
> > > the transaction in session-1 is being performed by the parallel
> > > apply worker and the transaction in session-2 is being performed by the
> leader apply worker. I have simulated it by using GUC force_stream_mode.
> > > Publisher
> > > ==
> > > Session-1
> > > postgres=# begin;
> > > BEGIN
> > > postgres=*# savepoint s1;
> > > SAVEPOINT
> > > postgres=*# truncate t1;
> > > TRUNCATE TABLE
> > >
> > > Session-2
> > > postgres=# begin;
> > > BEGIN
> > > postgres=*# insert into t1 values(4);
> > >
> > > Session-1
> > > postgres=*# rollback to savepoint s1; ROLLBACK
> > >
> > > Session-2
> > > Commit;
> > >
> > > With or without commit of Session-2, this scenario will lead to
> > > deadlock on the subscriber because PA (parallel apply worker) is
> > > waiting for LA to send the next command, and LA is blocked by
> > > Exclusive of PA. There is no deadlock on the publisher because
> > > rollback to savepoint will release the lock acquired by truncate.
> > >
> > > To solve this, How about if we do three things before sending abort
> > > of sub-transaction (a) unlock the stream lock, (b) increment
> > > pending_stream_count,
> > > (c) take the stream lock again?
> > >
> > > Now, if the PA is not already waiting on the stop, it will not wait
> > > at stream_stop but will wait after applying abort of sub-transaction
> > > and if it is already waiting at stream_stop, the wait will be
> > > released. If this works then probably we should try to do (b) before (a) 
> > > to
> match the steps with stream_start.
> >
> > The solution works for me, I have changed the code as suggested.
> >
> >
> > > 2. There seems to be another general problem in the way the patch
> > > waits for stream_stop in PA (parallel apply worker). Currently, PA
> > > checks, if there are no more pending streams then it tries to wait
> > > for the next stream by waiting on a stream lock. However, it is
> > > possible after PA checks there is no pending stream and before it
> > > actually starts waiting on a lock, the LA sends another stream for
> > > which even stream_stop is sent, in this case, PA will start waiting
> > > for the next stream whereas there is actually a pending stream
> > > available. In this case, it won't lead to any problem apart from
> > > delay in applying the changes in such cases but for the case mentioned in
> the previous point (Pont 1), it can lead to deadlock even after we implement 
> the
> solution proposed to solve it.
> >
> > Thanks for reporting, I have introduced another flag in shared memory
> > and use it to prevent the leader from incrementing the
> > pending_stream_count if the parallel apply worker is trying to lock the 
> > stream
> lock.
> >
> >
> > > 3. The other point to consider is that for
> > > stream_commit/prepare/abort, in LA, we release the stream lock after
> > > sending the message whereas for stream_start we release it before
> > > sending the message. I think for the earlier cases
> > > (stream_commit/prepare/abort), the patch has done like this because
> > > pa_send_data() may need to require the lock again when it times out
> > > and start serializing, so there will be no sense in first releasing
> > > it, then re-acquiring it, and then again releasing it. Can't we also
> > > release the lock for stream_start after
> > > pa_send_data() only if it is not switched to serialize mode?
> >
> > Changed.
> >
> > Attach the new version patch set which addressed above comments.
> 
> Here are comments on v59 0001, 0002 patches:

Thanks for the comments!

> +void
> +pa_increment_stream_block(ParallelApplyWorkerShared *wshared) {
> +while (1)
> +{
> +SpinLockAcquire(>mutex);
> +
> +/*
> + * Don't try to increment the count if the parallel
> apply worker is
> + * taking the stream lock. Otherwise, there would be
> a race condition
> + * that the parallel apply worker checks there is no
> pending streaming
> + * block and before it actually starts waiting on a
> lock, the leader
> +

Re: Avoid extra "skipping" messages from VACUUM/ANALYZE

2022-12-13 Thread Nathan Bossart

On Tue, Dec 13, 2022 at 06:29:56PM -0800, Jeff Davis wrote:
> Right now, if an unprivileged user issues VACUUM/ANALYZE (without
> specifying a table), it will emit messages for each relation that it
> skips, including indexes, views, and other objects that can't be a
> direct target of VACUUM/ANALYZE anyway. Attached patch causes it to
> check the type of object first, and then check privileges second.

This also seems to be the case when a table name is specified:

postgres=# CREATE TABLE test (a INT);
CREATE TABLE
postgres=# CREATE INDEX ON test (a);
CREATE INDEX
postgres=# CREATE ROLE myuser;
CREATE ROLE
postgres=# SET ROLE myuser;
SET
postgres=> VACUUM test_a_idx;
WARNING:  permission denied to vacuum "test_a_idx", skipping it
VACUUM

Granted, this likely won't create as much noise as a database-wide VACUUM,
but perhaps we could add a relkind check in expand_vacuum_rel() and swap
the checks in vacuum_rel()/analyze_rel(), too.  I don't know if it's worth
the trouble, though.

> Found while reviewing the MAINTAIN privilege patch. Implemented with
> his suggested fix. I intend to commit soon.

LGTM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-13 Thread Nathan Bossart

On Tue, Dec 13, 2022 at 07:05:10PM -0800, Jeff Davis wrote:
> Committed.
> 
> The only significant change is that it also allows LOCK TABLE if you
> have the MAINTAIN privilege.

Thanks!

> I noticed a couple issues unrelated to your patch, and started separate
> patch threads[1][2].

Will take a look.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-13 Thread Jeff Davis

On Mon, 2022-12-12 at 13:01 -0800, Nathan Bossart wrote:
> On Mon, Dec 12, 2022 at 12:04:27PM -0800, Nathan Bossart wrote:
> > Patch attached.  I ended up reverting some parts of the
> > VACUUM/ANALYZE
> > patch that were no longer needed (i.e., if the user doesn't have
> > permission
> > to VACUUM, we don't need to separately check whether the user has
> > permission to ANALYZE).  Otherwise, I don't think there's anything
> > tremendously different between v1 and v2 besides the fact that all
> > the
> > privileges are grouped together.
> 
> Here is a v3 of the patch that fixes a typo in the docs.

Committed.

The only significant change is that it also allows LOCK TABLE if you
have the MAINTAIN privilege.

I noticed a couple issues unrelated to your patch, and started separate
patch threads[1][2].

[1] 
https://www.postgresql.org/message-id/c0a85c2e83158560314b576b6241c8ed0aea1745.ca...@j-davis.com
[2]
https://www.postgresql.org/message-id/9550c76535404a83156252b25a11babb4792ea1e.ca...@j-davis.com

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Rework confusing permissions for LOCK TABLE

2022-12-13 Thread Jeff Davis

The existing permissions for LOCK TABLE are surprising/confusing. For
instance, if you have UPDATE privileges on a table, you can lock in any
mode *except* ACCESS SHARE.

  drop table x cascade;
  drop user u1;
  create user u1;
  create table x(i int);
  grant update on x to u1;

  set session authorization u1;
  begin;
  lock table x in access exclusive mode; -- succeeds
  commit;
  begin;
  lock table x in share mode; -- succeeds
  commit;
  begin;
  lock table x in access share mode; -- fails
  commit;

I can't think of any reason for this behavior, and I didn't find an
obvious answer in the last commits to touch that (2ad36c4e44,
fa2642438f).

Patch attached to simplify it. It uses the philosophy that, if you have
permissions to lock at a given mode, you should be able to lock at
strictly less-conflicting modes as well.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From 53e27cfbe74f6b943fcf7969fb25ddf765219ff5 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 13 Dec 2022 17:41:55 -0800
Subject: [PATCH v1] Rework permissions for LOCK TABLE.

The prior behavior was confusing and hard to document. For instance,
if you had UPDATE privileges, you could lock a table in any lock mode
except ACCESS SHARE mode.

Now, if granted a privilege to lock at a given mode, one also has
privileges to lock at a less-conflicting mode. MAINTAIN, UPDATE,
DELETE, and TRUNCATE privileges allow any lock mode. INSERT privileges
allow ROW EXCLUSIVE (or below). SELECT privileges allow ACCESS SHARE.
---
 doc/src/sgml/ref/lock.sgml   | 25 -
 src/backend/commands/lockcmds.c  | 18 +++
 src/test/regress/expected/privileges.out | 66 ++--
 src/test/regress/sql/privileges.sql  | 61 +++---
 4 files changed, 79 insertions(+), 91 deletions(-)

diff --git a/doc/src/sgml/ref/lock.sgml b/doc/src/sgml/ref/lock.sgml
index d9c5bf9a1d..8524182211 100644
--- a/doc/src/sgml/ref/lock.sgml
+++ b/doc/src/sgml/ref/lock.sgml
@@ -165,18 +165,19 @@ LOCK [ TABLE ] [ ONLY ] name [ * ]
   Notes
 

-To lock a table, one must ordinarily have the MAINTAIN
-privilege on the table or be the table's owner, a superuser, or a role
-with privileges of the
-pg_maintain
-role. LOCK TABLE ... IN ACCESS SHARE MODE is allowed
-with SELECT privileges on the target
-table.  LOCK TABLE ... IN ROW EXCLUSIVE MODE is allowed
-with INSERT, UPDATE, DELETE,
-or TRUNCATE privileges on the target table. All other
-forms of LOCK are allowed with
-table-level UPDATE, DELETE,
-or TRUNCATE privileges.
+To lock a table, the user must have the right privilege for the specified
+lockmode, or be the table's
+owner, a superuser, or a role with privileges of the pg_maintain
+role. If the user has MAINTAIN,
+UPDATE, DELETE, or
+TRUNCATE privileges on the table, any lockmode is permitted. If the user has
+INSERT privileges on the table, ROW EXCLUSIVE
+MODE (or a less-conflicting mode as described in ) is permitted. If a user has
+SELECT privileges on the table, ACCESS SHARE
+MODE is permitted.

 

diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index e294efc67c..fceefafbce 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -292,16 +292,16 @@ LockTableAclCheck(Oid reloid, LOCKMODE lockmode, Oid userid)
 	AclResult	aclresult;
 	AclMode		aclmask;
 
-	/* Verify adequate privilege */
-	if (lockmode == AccessShareLock)
-		aclmask = ACL_SELECT;
-	else if (lockmode == RowExclusiveLock)
-		aclmask = ACL_INSERT | ACL_UPDATE | ACL_DELETE | ACL_TRUNCATE;
-	else
-		aclmask = ACL_UPDATE | ACL_DELETE | ACL_TRUNCATE;
+	/* any of these privileges permit any lock mode */
+	aclmask = ACL_MAINTAIN | ACL_UPDATE | ACL_DELETE | ACL_TRUNCATE;
+
+	/* SELECT privileges also permit ACCESS SHARE and below */
+	if (lockmode <= AccessShareLock)
+		aclmask |= ACL_SELECT;
 
-	/* MAINTAIN privilege allows all lock modes */
-	aclmask |= ACL_MAINTAIN;
+	/* INSERT privileges also permit ROW EXCLUSIVE and below */
+	if (lockmode <= RowExclusiveLock)
+		aclmask |= ACL_INSERT;
 
 	aclresult = pg_class_aclcheck(reloid, userid, aclmask);
 
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 169b364b22..58d9112ee8 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -2663,13 +2663,13 @@ CREATE TABLE lock_table (a int);
 GRANT SELECT ON lock_table TO regress_locktable_user;
 SET SESSION AUTHORIZATION regress_locktable_user;
 BEGIN;
+LOCK TABLE lock_table IN ACCESS SHARE MODE; -- should pass
+COMMIT;
+BEGIN;
 LOCK TABLE lock_table IN ROW EXCLUSIVE MODE; -- should fail
 ERROR:  permission denied for table lock_table
 ROLLBACK;
 BEGIN;
-LOCK TABLE lock_table IN ACCESS SHARE MODE; -- should pass
-COMMIT;
-BEGIN;
 LOCK TABLE lock_table IN ACCESS EXCLUSIVE MODE; -- should fail
 ERROR:  permission

Re: allowing for control over SET ROLE

2022-12-13 Thread Michael Paquier

On Mon, Nov 21, 2022 at 10:45:53AM -0500, Robert Haas wrote:
> Seems like a good idea but I'm not sure about this hunk:
> 
>   TailMatches("GRANT|REVOKE", "ALTER", "SYSTEM") ||
> - TailMatches("REVOKE", "GRANT", "OPTION", "FOR", "ALTER", "SYSTEM"))
> + TailMatches("REVOKE", "GRANT", "OPTION", "FOR", "ALTER", "SYSTEM") ||
> + TailMatches("REVOKE", "GRANT", "OPTION", "FOR", "SET"))
> 
> That might be a correct change for other reasons, but it doesn't seem
> related to this patch. The rest looks good.

(Forgot to press "Send" a few days ago..)

Hmm, right, I see your point.  I have just moved that to reorder the
terms alphabetically, but moving the check on REVOKE GRANT OPTION FOR
SET is not mandatory.  I have moved it back in its previous
position, leading to less noise in the diffs, and applied the rest as
of 9d0cf57.
Thanks!
--
Michael


signature.asc
Description: PGP signature

Refactor SCRAM code to dynamically handle hash type and key length

2022-12-13 Thread Michael Paquier

Hi all,
(Adding Daniel and Jonathan per recent threads)

While investigating on what it would take to extend SCRAM to use new
hash methods (say like the RFC draft for SCRAM-SHA-512), I have been
quickly reminded of the limitations created by SCRAM_KEY_LEN, which is
the key length that we use in the HMAC and hash computations when
creating a SCRAM verifier or when doing a SASL exchange.

Back in v10 when SCRAM was implemented, we have kept the
implementation simple and took the decision to rely heavily on buffers
with a static size of SCRAM_KEY_LEN during the exchanges.  This was a
good choice back then, because we did not really have a way to handle
errors and there were no need to worry about things like OOMs or even
SHA computations errors.  This was also incorrect in its own ways,
because we failed to go through OpenSSL for the hash/HMAC computations
which would be an issue with FIPS certifications.  This has led to the
introduction of the new cryptohash and HMAC APIs, and the SCRAM code
has been changed so as it is able to know and pass through its layers
any errors (OOM, hash/MAC computation, OpenSSL issue), as of 87ae969
and e6bdfd9.

Taking advantage of all the error stack and logic introduced
previously, it becomes rather straight-forward to remove the
hardcoded assumptions behind SHA-256 in the SCRAM code path.  Attached
is a patch that does so:
- SCRAM_KEY_LEN is removed from all the internal SCRAM routines, this
is replaced by a logic where the hash type and the key length are
stored in fe_scram_state for the frontend and scram_state for the
backend.
- The frontend assigns the hash type and the key length depending on
its choice of SASL mechanism in scram_init()@fe-auth-scram.c.
- The backend assigns the hash type and the key length based on the
parsed SCRAM entry from pg_authid, which works nicely when we need to
handle a raw password for a SASL exchange, for example.  That's
basically what we do now, but scram_state is just changed to work
through it.

We have currently on HEAD 68 references to SCRAM_KEY_LEN.  This brings
down these references to 6, that cannot really be avoided because we
still need to handle SCRAM-SHA-256 one way or another:
- When parsing a SCRAM password from pg_authid (to get a password type
or fill in scram_state in the backend).
- For the mock authentication, where SHA-256 is forced.  We are going
to fail anyway, so any hash would be fine as long as we just let the
user know about the failure transparently.
- When initializing the exchange in libpq based on the SASL mechanism
choice.
- scram-common.h itself.
- When building a verifier in the be-fe interfaces.  These could be
changed as well but I did not see a point in doing so yet.
- SCRAM_KEY_LEN is renamed to SCRAM_SHA_256_KEY_LEN to reflect its
dependency to SHA-256.

With this patch in place, the internals of SCRAM for SASL are
basically able to handle any hash methods.  There is much more to
consider like how we'd need to treat uaSCRAM (separate code for more
hash methods or use the same) or HBA entries, but this removes what I
consider to be 70%~80% of the pain in terms of extensibility with the
current code, and this is enough to be a submission on its own to move
towards more methods.  I am planning to tackle more things in terms of
pluggability after what's done here, btw :)

This patch passes check-world and the CI is green.  I have tested as
well the patch with SCRAM verifiers coming from a server initially on
HEAD, so it looks pretty solid seen from here, being careful of memory
leaks in the frontend, mainly.

Thoughts or comments?
--
Michael
From 378c86619933d9c712730e3d6a105a79854660cf Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 14 Dec 2022 11:35:33 +0900
Subject: [PATCH] Remove dependency to hash type and key length in internal
 SCRAM code

SCRAM_KEY_LEN had a hard dependency on SHA-256, making difficult the
addition of more hash methods in SCRAM with many statically-sized
buffers, as one problem.
---
 src/include/common/scram-common.h|  25 ++--
 src/include/libpq/scram.h|   8 +-
 src/backend/libpq/auth-scram.c   | 165 ++-
 src/backend/libpq/crypt.c|  10 +-
 src/common/scram-common.c| 189 ++-
 src/interfaces/libpq/fe-auth-scram.c | 175 +
 6 files changed, 380 insertions(+), 192 deletions(-)

diff --git a/src/include/common/scram-common.h b/src/include/common/scram-common.h
index 4acf2a78ad..5b647e4b81 100644
--- a/src/include/common/scram-common.h
+++ b/src/include/common/scram-common.h
@@ -21,7 +21,7 @@
 #define SCRAM_SHA_256_PLUS_NAME "SCRAM-SHA-256-PLUS"	/* with channel binding */
 
 /* Length of SCRAM keys (client and server) */
-#define SCRAM_KEY_LENPG_SHA256_DIGEST_LENGTH
+#define SCRAM_SHA_256_KEY_LENPG_SHA256_DIGEST_LENGTH
 
 /*
  * Size of random nonce generated in the authentication exchange.  This
@@ -43,17 +43,22 @@
  */
 #define SCRAM_DEFAULT_ITERATIONS	4096

Avoid extra "skipping" messages from VACUUM/ANALYZE

2022-12-13 Thread Jeff Davis

Right now, if an unprivileged user issues VACUUM/ANALYZE (without
specifying a table), it will emit messages for each relation that it
skips, including indexes, views, and other objects that can't be a
direct target of VACUUM/ANALYZE anyway. Attached patch causes it to
check the type of object first, and then check privileges second.

Found while reviewing the MAINTAIN privilege patch. Implemented with
his suggested fix. I intend to commit soon.


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS


From e1f0c310175285945ca0742b92fc70e2405cb831 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 13 Dec 2022 17:37:22 -0800
Subject: [PATCH v1] Clean up useless "skipping" messages for VACUUM/ANALYZE.

When VACUUM/ANALYZE are run on an entire database, it warns of
skipping relations for which the user doesn't have sufficient
privileges. That only makes sense for tables, so skip such messages
for indexes, etc.
---
 src/backend/commands/vacuum.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 293b84bbca..bb4847dfa0 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -874,10 +874,6 @@ get_all_vacuum_rels(int options)
 		MemoryContext oldcontext;
 		Oid			relid = classForm->oid;
 
-		/* check permissions of relation */
-		if (!vacuum_is_permitted_for_relation(relid, classForm, options))
-			continue;
-
 		/*
 		 * We include partitioned tables here; depending on which operation is
 		 * to be performed, caller will decide whether to process or ignore
@@ -888,6 +884,10 @@ get_all_vacuum_rels(int options)
 			classForm->relkind != RELKIND_PARTITIONED_TABLE)
 			continue;
 
+		/* check permissions of relation */
+		if (!vacuum_is_permitted_for_relation(relid, classForm, options))
+			continue;
+
 		/*
 		 * Build VacuumRelation(s) specifying the table OIDs to be processed.
 		 * We omit a RangeVar since it wouldn't be appropriate to complain
-- 
2.34.1

Re: Add index scan progress to pg_stat_progress_vacuum

2022-12-13 Thread Masahiko Sawada

On Tue, Dec 13, 2022 at 1:40 PM Imseih (AWS), Sami  wrote:
>
> Thanks for the feedback. I agree with the feedback, except
> for
>
> >need to have ParallelVacuumProgress. I see
> >parallel_vacuum_update_progress() uses this value but I think it's
> >better to pass ParallelVacuumState to via IndexVacuumInfo.
>
> I was trying to avoid passing a pointer to
> ParallelVacuumState in IndexVacuuminfo.
>
> ParallelVacuumProgress is implemented in the same
> way as VacuumSharedCostBalance and
> VacuumActiveNWorkers. See vacuum.h
>
> These values are reset at the start of a parallel vacuum cycle
> and reset at the end of an index vacuum cycle.
>
> This seems like a better approach and less invasive.
> What would be a reason not to go with this approach?

First of all, I don't think we need to declare ParallelVacuumProgress
in vacuum.c since it's set and used only in vacuumparallel.c. But I
don't even think it's a good idea to declare it in vacuumparallel.c as
a static variable. The primary reason is that it adds things we need
to care about. For example, what if we raise an error during index
vacuum? The transaction aborts but ParallelVacuumProgress still refers
to something old. Suppose further that the next parallel vacuum
doesn't launch any workers, the leader process would still end up
accessing the old value pointed by ParallelVacuumProgress, which
causes a SEGV. So we need to reset it anyway at the beginning of the
parallel vacuum. It's easy to fix at this time but once the parallel
vacuum code gets more complex, it could forget to care about it.

IMO VacuumSharedCostBalance and VacuumActiveNWorkers have a different
story. They are set in vacuumparallel.c and are used in vacuum.c for
vacuum delay. If they weren't global variables, we would need to pass
them to every function that could eventually call the vacuum delay
function. So it makes sense to me to have them as global variables.On
the other hand, for ParallelVacuumProgress, it's a common pattern that
ambulkdelete(), amvacuumcleanup() or a common index scan routine like
btvacuumscan() checks the progress. I don't think index AM needs to
pass the value down to many of its functions. So it makes sense to me
to pass it via IndexVacuumInfo.

Having said that, I'd like to hear opinions also from other hackers, I
might be wrong and it's more invasive as you pointed out.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-13 Thread Kyotaro Horiguchi

At Tue, 13 Dec 2022 17:05:35 +0530, Amit Kapila  wrote 
in 
> On Tue, Dec 13, 2022 at 7:35 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila  
> > wrote in
> Yeah, I think ideally it will timeout but if we have a solution like
> during delay, we keep sending ping messages time-to-time, it should
> work fine. However, that needs to be verified. Do you see any reasons
> why that won't work?

Ah. I meant that "I have no clear idea of whether" by "I'm not sure".

I looked there a bit further. Finally ProcessPendingWrites() waits for
streaming socket to be writable thus no critical problem found here.
That being said, it might be better ProcessPendingWrites() refrain
from sending consecutive keepalives while waiting, 30s ping timeout
and 1h delay may result in 120 successive pings. It might not be a big
deal but..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Raising the SCRAM iteration count

2022-12-13 Thread Michael Paquier

On Tue, Dec 13, 2022 at 12:17:58PM +0100, Daniel Gustafsson wrote:
> It does raise an interesting point though, if we in the future add suppprt for
> SCRAM-SHA-512 (which seems reasonable to do) it's not good enough to have a
> single GUC for SCRAM iterations; we'd need to be able to set the iteration
> count per algorithm. I'll account for that when updating the patch downthread.

So, you mean that the GUC should be named like password_iterations,
taking a grammar with a list like 'scram-sha-256=4096,algo2=5000'?
--
Michael


signature.asc
Description: PGP signature

Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-13 Thread Nathan Bossart

On Tue, Dec 13, 2022 at 07:20:14PM -0500, Tom Lane wrote:
> I certainly don't think that "wake the apply launcher every 1ms"
> is a sane configuration.  Unless I'm missing something basic about
> its responsibilities, it should seldom need to wake at all in
> normal operation.

This parameter appears to control how often the apply launcher starts new
workers.  If it starts new workers in a loop iteration, it updates its
last_start_time variable, and it won't start any more workers until another
wal_retrieve_retry_interval has elapsed.  If no new workers need to be
started, it only wakes up every 3 minutes.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Make ON_ERROR_STOP stop on shell script failure

2022-12-13 Thread David Zhang


On 2022-10-12 2:13 a.m., bt22nakamorit wrote:

There was a mistake in the error message for \! so I updated the patch.

Tried to apply this patch to the master branch, but got the error like 
below.

$ git apply --check patch-view.diff
error: patch failed: src/bin/psql/command.c:2693
error: src/bin/psql/command.c: patch does not apply

I think there are some tests related with "ON_ERROR_STOP" in 
src/bin/psql/t/001_basic.pl, and we should consider to add corresponding 
test cases for "on/off/shell/all" to this patch.



Best regards,

David

Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-13 Thread Tom Lane

Nathan Bossart  writes:
> On Tue, Dec 13, 2022 at 06:32:08PM -0500, Tom Lane wrote:
>> I've not chased it further than that, but I venture that the apply
>> launcher also needs a kick in the pants, and/or there needs to be
>> an interlock to ensure that it doesn't wake until after the old
>> apply worker quits.

> This is probably because the tests set wal_retrieve_retry_interval to
> 500ms.  Lowering that to 1ms in Cluster.pm seems to wipe out this
> particular wait, and the total src/test/subscription test time drops from
> 119 seconds to 95 seconds on my machine.

That's not really the direction we should be going in, though.  Ideally
there should be *no* situation where we are waiting for a timeout to
elapse for a process to wake up and notice it ought to do something.
If we have timeouts at all, they should be backstops for the possibility
of a lost interrupt, and it should be possible to set them quite high
without any visible impact on normal operation.  (This gets back to
the business about minimizing idle power consumption, which Simon was
bugging us about recently but that's been on the radar screen for years.)

I certainly don't think that "wake the apply launcher every 1ms"
is a sane configuration.  Unless I'm missing something basic about
its responsibilities, it should seldom need to wake at all in
normal operation.

regards, tom lane

Shortening the Scope of Critical Section in Creation of New MultiXacts

2022-12-13 Thread Bagga, Rishu

Hi all, 

I have been going through the multixact code over the last few 
weeks, and noticed a potential discrepancy between the need for critical 
sections in the creation of new multixacts and the current code.

As per the comment in GetNewMultiXact:


/*
 * Critical section from here until caller has written the data into the
 * just-reserved SLRU space; we don't want to error out with a partly
 * written MultiXact structure.  (In particular, failing to write our
 * start offset after advancing nextMXact would effectively corrupt the
 * previous MultiXact.)
 */
START_CRIT_SECTION()


This makes sense, as we need the multixact state and multixact offset 
data to be consistent, but once we write data into the offsets, we can 
end the critical section. Currently we wait until the members data is 
also written before we end the critical section. 


I’ve attached a patch that moves the end of the critical section into 
RecordNewMultiXact, right after we finish writing data into the offsets 
cache.

This passes regression tests, as well as some custom testing around 
injecting random failures while writing multixact members, and 
restarting.

I would appreciate any feedback on this.


Sincerely, 

Rishu Bagga, Amazon Web Services (AWS)



move_multixact_critical_section.patch
Description: move_multixact_critical_section.patch

Re: Proposal for internal Numeric to Uint64 conversion function.

2022-12-13 Thread Ian Lawrence Barwick

2022年10月3日(月) 1:55 Andres Freund :
>
> Hi,
>
> The patch for this CF entry doesn't apply currently, and it looks like that's
> been the case for quite a while...

As that's remained the case and the last update from the author was in
May, we'll
close this as "returned with feedback". A new entry can always be created in
future commitfest if work on the patch is resumed.

Regards

Ian Barwick

Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-13 Thread Nathan Bossart

On Tue, Dec 13, 2022 at 06:32:08PM -0500, Tom Lane wrote:
> Before, there was up to 1 second (with multiple "SELECT count(1) = 0"
> probes from the test script) between the ALTER SUBSCRIPTION command
> and the "apply worker will restart" log entry.  That wait is pretty
> well zapped, but instead now we're waiting hundreds of ms for the
> "apply worker has started" message.
> 
> I've not chased it further than that, but I venture that the apply
> launcher also needs a kick in the pants, and/or there needs to be
> an interlock to ensure that it doesn't wake until after the old
> apply worker quits.

This is probably because the tests set wal_retrieve_retry_interval to
500ms.  Lowering that to 1ms in Cluster.pm seems to wipe out this
particular wait, and the total src/test/subscription test time drops from
119 seconds to 95 seconds on my machine.  This probably lowers the amount
of test coverage we get on the wal_retrieve_retry_interval code paths, but
if that's a concern, perhaps we should write a test specifically for
wal_retrieve_retry_interval.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] random_normal function

2022-12-13 Thread Paul Ramsey



> On Dec 9, 2022, at 3:20 PM, Paul Ramsey  wrote:
> 
> 
> 
>> On Dec 9, 2022, at 9:17 AM, Paul Ramsey  wrote:
>> 
>> 
>>> On Dec 8, 2022, at 8:29 PM, Michael Paquier  wrote:
>>> 
>>> On Thu, Dec 08, 2022 at 04:44:56PM -0800, Paul Ramsey wrote:
 Final tme, with fixes from cirrusci.
>>> 
>>> Well, why not.  Seems like you would use that a lot with PostGIS.
>>> 
>>> #include   /* for ldexp() */
>>> +#include  /* for DBL_EPSILON */
>>> And be careful with the order here.
>> 
>> Should be ... alphabetical?
>> 
>>> +static void
>>> +drandom_check_default_seed()
>>> We always use (void) rather than empty parenthesis sets.
>> 
>> OK
>> 
>>> I would not leave that unchecked, so I think that you should add
>>> something in ramdom.sql.  Or would you prefer switching some of
>>> the regression tests be switched so as they use the new normal
>>> function?
>> 
>> Reading through those tests... seems like they will (rarely) fail. Is 
>> that... OK? 
>> The tests seem to be mostly worried that random() starts returning 
>> constants, which seems like a good thing to test for (is the random number 
>> generating returning randomness).
>> An obvious test for this function is that the mean and stddev converge on 
>> the supplied parameters, given enough inputs, which is actually kind of the 
>> opposite test. I use the same random number generator as the uniform 
>> distribution, so that aspect is already covered by the existing tests.
>> 
>>> (Ahem.  Bonus points for a random_string() returning a bytea, based on
>>> pg_strong_random().)
>> 
>> Would love to. Separate patch of bundled into this one?
> 
> Here's the original with suggestions applied and a random_string that applies 
> on top of it.
> 
> Thanks!
> 
> P

Clearing up one CI failure.




random_normal_06a.patch
Description: Binary data


random_normal_06b.patch
Description: Binary data

Re: wake up logical workers after ALTER SUBSCRIPTION

2022-12-13 Thread Tom Lane

Nathan Bossart  writes:
> After sleeping on this, I think we can do better.  IIUC we can simply check
> for AllTablesyncsReady() at the end of process_syncing_tables_for_apply()
> and wake up the logical replication workers (which should just consiѕt of
> setting the current process's latch) if we are ready for two_phase mode.

I independently rediscovered the need for something like this after
wondering why the subscription/t/031_column_list.pl test seemed to
take so much longer than its siblings.  I found that a considerable
amount of the elapsed time was wasted because we were waiting up to
a full second (NAPTIME_PER_CYCLE) for the logrep worker to notice
that something had changed in the local subscription state.  At least
on my machine, it seems that the worst-case timing is reliably hit
multiple times during this test.  Now admittedly, this is probably not
a significant problem in real-world usage; but it's sure annoying that
it eats time during check-world.

However, this patch seems to still be leaving quite a bit on the
table.  Here's the timings I see for the subscription suite in HEAD
(test is just "time make check PROVE_FLAGS=--timer" with an
assert-enabled build):

+++ tap check in src/test/subscription +++
[18:07:38] t/001_rep_changes.pl ... ok 6659 ms ( 0.00 usr  0.00 
sys +  0.89 cusr  0.52 csys =  1.41 CPU)
[18:07:45] t/002_types.pl . ok 1572 ms ( 0.00 usr  0.00 
sys +  0.70 cusr  0.27 csys =  0.97 CPU)
[18:07:47] t/003_constraints.pl ... ok 1436 ms ( 0.01 usr  0.00 
sys +  0.74 cusr  0.25 csys =  1.00 CPU)
[18:07:48] t/004_sync.pl .. ok 3007 ms ( 0.00 usr  0.00 
sys +  0.75 cusr  0.31 csys =  1.06 CPU)
[18:07:51] t/005_encoding.pl .. ok 1468 ms ( 0.00 usr  0.00 
sys +  0.74 cusr  0.21 csys =  0.95 CPU)
[18:07:53] t/006_rewrite.pl ... ok 1494 ms ( 0.00 usr  0.00 
sys +  0.72 cusr  0.24 csys =  0.96 CPU)
[18:07:54] t/007_ddl.pl ... ok 2005 ms ( 0.00 usr  0.00 
sys +  0.73 cusr  0.24 csys =  0.97 CPU)
[18:07:56] t/008_diff_schema.pl ... ok 1746 ms ( 0.01 usr  0.00 
sys +  0.70 cusr  0.28 csys =  0.99 CPU)
[18:07:58] t/009_matviews.pl .. ok 1878 ms ( 0.00 usr  0.00 
sys +  0.71 cusr  0.24 csys =  0.95 CPU)
[18:08:00] t/010_truncate.pl .. ok 2999 ms ( 0.00 usr  0.00 
sys +  0.77 cusr  0.38 csys =  1.15 CPU)
[18:08:03] t/011_generated.pl . ok 1467 ms ( 0.00 usr  0.00 
sys +  0.71 cusr  0.24 csys =  0.95 CPU)
[18:08:04] t/012_collation.pl . skipped: ICU not supported by 
this build
[18:08:04] t/013_partition.pl . ok 4787 ms ( 0.01 usr  0.00 
sys +  1.29 cusr  0.71 csys =  2.01 CPU)
[18:08:09] t/014_binary.pl  ok 2564 ms ( 0.00 usr  0.00 
sys +  0.72 cusr  0.28 csys =  1.00 CPU)
[18:08:12] t/015_stream.pl  ok 2531 ms ( 0.01 usr  0.00 
sys +  0.73 cusr  0.27 csys =  1.01 CPU)
[18:08:14] t/016_stream_subxact.pl  ok 1590 ms ( 0.00 usr  0.00 
sys +  0.70 cusr  0.24 csys =  0.94 CPU)
[18:08:16] t/017_stream_ddl.pl  ok 1610 ms ( 0.00 usr  0.00 
sys +  0.72 cusr  0.25 csys =  0.97 CPU)
[18:08:17] t/018_stream_subxact_abort.pl .. ok 1827 ms ( 0.00 usr  0.00 
sys +  0.73 cusr  0.24 csys =  0.97 CPU)
[18:08:19] t/019_stream_subxact_ddl_abort.pl .. ok 1474 ms ( 0.00 usr  0.00 
sys +  0.71 cusr  0.24 csys =  0.95 CPU)
[18:08:21] t/020_messages.pl .. ok 2423 ms ( 0.01 usr  0.00 
sys +  0.74 cusr  0.25 csys =  1.00 CPU)
[18:08:23] t/021_twophase.pl .. ok 4799 ms ( 0.00 usr  0.00 
sys +  0.82 cusr  0.39 csys =  1.21 CPU)
[18:08:28] t/022_twophase_cascade.pl .. ok 4346 ms ( 0.00 usr  0.00 
sys +  1.12 cusr  0.54 csys =  1.66 CPU)
[18:08:32] t/023_twophase_stream.pl ... ok 3656 ms ( 0.01 usr  0.00 
sys +  0.78 cusr  0.32 csys =  1.11 CPU)
[18:08:36] t/024_add_drop_pub.pl .. ok 3585 ms ( 0.00 usr  0.00 
sys +  0.73 cusr  0.29 csys =  1.02 CPU)
[18:08:39] t/025_rep_changes_for_schema.pl  ok 3631 ms ( 0.00 usr  0.00 
sys +  0.77 cusr  0.34 csys =  1.11 CPU)
[18:08:43] t/026_stats.pl . ok 4096 ms ( 0.00 usr  0.00 
sys +  0.77 cusr  0.32 csys =  1.09 CPU)
[18:08:47] t/027_nosuperuser.pl ... ok 4824 ms ( 0.01 usr  0.00 
sys +  0.77 cusr  0.39 csys =  1.17 CPU)
[18:08:52] t/028_row_filter.pl  ok 5321 ms ( 0.00 usr  0.00 
sys +  0.90 cusr  0.50 csys =  1.40 CPU)
[18:08:57] t/029_on_error.pl .. ok 3748 ms ( 0.00 usr  0.00 
sys +  0.75 cusr  0.32 csys =  1.07 CPU)
[18:09:01] t/030_origin.pl  ok 4496 ms ( 0.00 usr  0.00 
sys +  1.09 cusr  0.45 csys =  1.54 CPU)
[18:09:06] t/031_column_list.pl ... ok13802 ms ( 0.01 usr  0.00 
sys +  1.00 cusr  0.69 csys =  1.70 CPU)

Re: New strategies for freezing, advancing relfrozenxid early

2022-12-13 Thread Peter Geoghegan

On Tue, Dec 13, 2022 at 9:16 AM Peter Geoghegan  wrote:
> That's not the only thing we care about, though. And to the extent we
> care about it, we mostly care about the consequences of either
> freezing or not freezing eagerly. Concentration of unfrozen pages in
> one particular table is a lot more of a concern than the same number
> of heap pages being spread out across multiple tables. Those tables
> can all be independently vacuumed, and come with their own
> relfrozenxid, that can be advanced independently, and are very likely
> to be frozen as part of a vacuum that needed to happen anyway.

At the suggestion of Jeff, I wrote a Wiki page that shows motivating
examples for the patch series:

https://wiki.postgresql.org/wiki/Freezing/skipping_strategies_patch:_motivating_examples

These are all cases where VACUUM currently doesn't do the right thing
around freezing, in a way that is greatly ameliorated by the patch.
Perhaps this will help other hackers to understand the motivation
behind some of these mechanisms. There are plenty of details that only
make sense in the context of a certain kind of table, with certain
performance characteristics that the design is sensitive to, and seeks
to take advantage of in one way or another.

-- 
Peter Geoghegan

Re: Remove SHA256_HMAC_B from scram-common.h

2022-12-13 Thread Michael Paquier

On Tue, Dec 13, 2022 at 09:27:50AM -0800, Jacob Champion wrote:
> On Mon, Dec 12, 2022 at 8:57 PM Michael Paquier  wrote:
>> While doing some hackery on SCRAM, I have noticed $subject giving the
>> attached.  I guess that this is not going to cause any objections, but
>> feel free to comment just in case.
> 
> Yeah, no objection :D That cryptohash refactoring was quite nice.

Thanks.  I have much more refactoring work coming up in this area, and
the cryptohash move is helping quite a lot in terms of error handling.
--
Michael


signature.asc
Description: PGP signature

Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-12-13 Thread Tom Lane

I wrote:
> Peter Eisentraut  writes:
>> The case I was working on is the same as Israel's.  He has confirmed 
>> that this fixes the issue we have been working on.

> OK, I'll make this happen soon.

Pushed.  I left out the idea of making this conditional on whether
any preceding command had performed data modification, as that seemed
to greatly complicate the explanation (since "have we performed any
data modification" is a rather squishy question from a user's viewpoint).

regards, tom lane

Re: Progress report of CREATE INDEX for nested partitioned tables

2022-12-13 Thread Ilya Gladyshev

On Mon, 2022-12-12 at 22:43 -0600, Justin Pryzby wrote:
> On Mon, Dec 12, 2022 at 11:39:23PM +0400, Ilya Gladyshev wrote:
> > 
> > > Could you check what I've written as a counter-proposal ?
> > 
> > I think that this might be a good solution to start with, it gives
> > us the opportunity to improve the granularity later without any
> > surprising changes for the end user. We could use this patch for
> > previous versions and make more granular output in the latest. What
> > do you think?
> 
> Somehow, it hadn't occured to me that my patch "lost granularity" by
> incrementing the progress bar by more than one...  Shoot.
> 
> > I actually think that the progress view would be better off without
> > the total number of partitions, 
> 
> Just curious - why ?

We don't really know how many indexes we are going to create, unless we
have some kind of preliminary "planning" stage where we acumulate all
the relations that will need to have indexes created (rather than
attached). And if someone wants the total, it can be calculated
manually without this view, it's less user-friendly, but if we can't do
it well, I would leave it up to the user.

> 
> > With this in mind, I think your proposal to exclude catalog-only
> > indexes sounds reasonable to me, but I feel like the docs are off
> > in this case, because the attached indexes are not created, but we
> > pretend like they are in this metric, so we should fix one or the
> > other.
> 
> I agree that the docs should indicate whether we're counting "all
> partitions", "direct partitions", and whether or not that includes
> partitioned partitions, or just leaf partitions.

Agree. I think that docs should also be explicit about the attached
indexes, if we decide to count them in as "created".

> I have another proposal: since the original patch 3.5 years ago
> didn't
> consider or account for sub-partitions, let's not start counting them
> now.  It was never defined whether they were included or not (and I
> guess that they're not common) so we can take this opportunity to
> clarify the definition.

I have had this thought initially, but then I thought that it's not
what I would want, if I was to track progress of multi-level
partitioned tables (but yeah, I guess it's pretty uncommon). In this
respect, I like your initial counter-proposal more, because it leaves
us room to improve this in the future. Otherwise, if we commit to
reporting only top-level partitions now, I'm not sure we will have the
opportunity to change this.


> Alternately, if it's okay to add nparts_done to the IndexStmt, then
> that's easy.

Yeah, or we could add another argument to DefineIndex. I don't know if
it's ok, or which option is better here in terms of compatibility and
interface-wise, so I have tried both of them, see the attached patches.



From 4f55526f8a534d6b2f27b6b17cadaee0370e4cbc Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Tue, 13 Dec 2022 19:42:53 +0400
Subject: [PATCH] parts_done via DefineIndex args

---
 src/backend/commands/indexcmds.c | 52 +++-
 src/backend/commands/tablecmds.c |  7 +++--
 src/backend/tcop/utility.c   |  3 +-
 src/include/commands/defrem.h|  3 +-
 4 files changed, 53 insertions(+), 12 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b5b860c3ab..9fa9109b9e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -129,6 +129,29 @@ typedef struct ReindexErrorInfo
charrelkind;
 } ReindexErrorInfo;
 
+
+/*
+ * Count the number of direct and indirect leaf partitions, excluding foreign
+ * tables.
+ */
+static int
+num_leaf_partitions(Oid relid)
+{
+   int nleaves = 0;
+   List*childs = find_all_inheritors(relid, NoLock, NULL);
+   ListCell *lc;
+
+   foreach(lc, childs)
+   {
+   Oid partrelid = lfirst_oid(lc);
+   if (RELKIND_HAS_STORAGE(get_rel_relkind(partrelid)))
+   nleaves++;
+   }
+
+   list_free(childs);
+   return nleaves;
+}
+
 /*
  * CheckIndexCompatible
  * Determine whether an existing index definition is compatible 
with a
@@ -530,7 +553,8 @@ DefineIndex(Oid relationId,
bool check_rights,
bool check_not_in_use,
bool skip_build,
-   bool quiet)
+   bool quiet,
+   int *parts_done)
 {
boolconcurrent;
char   *indexRelationName;
@@ -568,6 +592,7 @@ DefineIndex(Oid relationId,
Oid root_save_userid;
int root_save_sec_context;
int root_save_nestlevel;
+   int root_parts_done;
 
root_save_nestlevel = NewGUCNestLevel();
 
@@ -1218,8 +1243,17 @@ DefineIndex(Oid relationId,
RelationparentIndex;

Re: Ordering behavior for aggregates

2022-12-13 Thread Vik Fearing


On 12/13/22 18:22, Tom Lane wrote:

"David G. Johnston"  writes:

I'm more keen on the idea of having the system understand when an ORDER BY
is missing - that seems like what users are more likely to actually do.


That side of it could perhaps be useful, but not if it's an unintelligent
analysis.  If someone has a perfectly safe query written according to
the old-school method:

SELECT string_agg(...) FROM (SELECT ... ORDER BY ...) ss;

they are not going to be too pleased with a nanny-ish warning (much
less an error) saying that the aggregate's input ordering is
underspecified.


That is a good point


I also wonder whether we'd accept any ORDER BY whatsoever, or try
to require one that produces a sufficiently-unique input ordering.


I would accept anything.  agg(x order by y) is a common thing.

--
Vik Fearing

Re: Remove SHA256_HMAC_B from scram-common.h

2022-12-13 Thread Jacob Champion

On Mon, Dec 12, 2022 at 8:57 PM Michael Paquier  wrote:
> While doing some hackery on SCRAM, I have noticed $subject giving the
> attached.  I guess that this is not going to cause any objections, but
> feel free to comment just in case.

Yeah, no objection :D That cryptohash refactoring was quite nice.

--Jacob

Re: Ordering behavior for aggregates

2022-12-13 Thread Tom Lane

"David G. Johnston"  writes:
> I'm more keen on the idea of having the system understand when an ORDER BY
> is missing - that seems like what users are more likely to actually do.

That side of it could perhaps be useful, but not if it's an unintelligent
analysis.  If someone has a perfectly safe query written according to
the old-school method:

SELECT string_agg(...) FROM (SELECT ... ORDER BY ...) ss;

they are not going to be too pleased with a nanny-ish warning (much
less an error) saying that the aggregate's input ordering is
underspecified.

I also wonder whether we'd accept any ORDER BY whatsoever, or try
to require one that produces a sufficiently-unique input ordering.

regards, tom lane

Re: New strategies for freezing, advancing relfrozenxid early

2022-12-13 Thread Peter Geoghegan

On Tue, Dec 13, 2022 at 12:29 AM John Naylor
 wrote:
> If the number of unfrozen heap pages is the thing we care about, perhaps 
> that, and not the total size of the table, should be the parameter that 
> drives freezing strategy?

That's not the only thing we care about, though. And to the extent we
care about it, we mostly care about the consequences of either
freezing or not freezing eagerly. Concentration of unfrozen pages in
one particular table is a lot more of a concern than the same number
of heap pages being spread out across multiple tables. Those tables
can all be independently vacuumed, and come with their own
relfrozenxid, that can be advanced independently, and are very likely
to be frozen as part of a vacuum that needed to happen anyway.

Pages become frozen pages because VACUUM freezes those pages. Same
with all-visible pages, which could in principle have been made
all-frozen instead, had VACUUM opted to do it that way back when it
processed the page. So VACUUM is not a passive, neutral observer here.
What happens over time and across multiple VACUUM operations is very
relevant. VACUUM needs to pick up where it left off last time, at
least with larger tables, where the time between VACUUMs is naturally
very high, and where each individual VACUUM has to process a huge
number of individual pages. It's not really practical to take a "wait
and see" approach with big tables.

At the very least, a given VACUUM operation has to choose its freezing
strategy based on how it expects the table will look when it's done
vacuuming the table, and how that will impact the next VACUUM against
the same table. Without that, then vacuuming an append-only table will
fall into a pattern of setting pages all-visible in one vacuum, and
then freezing those same pages all-frozen in the very next vacuum
because there are too many. Which makes little sense; we're far better
off freezing the pages at the earliest opportunity instead.

We're going to have to write a WAL record for the visibility map
anyway, so doing everything at the same time has a lot to recommend
it. Even if it turns out to be quite wrong, we may still come out
ahead in terms of absolute volume of WAL written, and especially in
terms of performance stability. To a limited extent we need to reason
about what will happen in the near future. But we also need to reason
about which kinds of mispredictions we cannot afford to make, and
which kinds are okay. Some mistakes hurt a lot more than others.

-- 
Peter Geoghegan

Re: Ordering behavior for aggregates

2022-12-13 Thread David G. Johnston

On Tue, Dec 13, 2022 at 9:45 AM Ronan Dunklau 
wrote:

> Le mardi 13 décembre 2022, 16:13:34 CET Tom Lane a écrit :
> > Accordingly, I find nothing at all attractive in this proposal.
> > I think the main thing it'd accomplish is to drive users back to
> > the bad old days of ordering-by-subquery, if they have a requirement
> > we failed to account for.
>
> I think the ability to mark certain aggregates as being able to completely
> ignore the ordering because they produce exactly the same results is still
> a
> useful optimization.
>
>
I seriously doubt that users are adding unnecessary ORDER BY clauses to
their aggregates. The more compelling use case would be existing ORMs that
produce such problematic SQL - are there any though?

I'm more keen on the idea of having the system understand when an ORDER BY
is missing - that seems like what users are more likely to actually do.
But it doesn't seem all that useful given the lack of aggregates that would
actually use it meaningfully.

David J.

Re: Ordering behavior for aggregates

2022-12-13 Thread Tom Lane

Ronan Dunklau  writes:
> Le mardi 13 décembre 2022, 16:13:34 CET Tom Lane a écrit :
>> Accordingly, I find nothing at all attractive in this proposal.
>> I think the main thing it'd accomplish is to drive users back to
>> the bad old days of ordering-by-subquery, if they have a requirement
>> we failed to account for.

> I think the ability to mark certain aggregates as being able to completely 
> ignore the ordering because they produce exactly the same results is still a 
> useful optimization.

That is *exactly* the position I do not accept.

I think it's fairly unlikely that a user would trouble to write ORDER BY
within an aggregate call if they didn't need it.  So my opinion of this
proposal is that it's a lot of work to create an optimization effect that
will be useless to nearly all users, and might actively break the queries
of some.

regards, tom lane

Re: Minimal logical decoding on standbys

2022-12-13 Thread Drouvot, Bertrand





On 12/13/22 5:49 PM, Robert Haas wrote:

On Tue, Dec 13, 2022 at 11:46 AM Drouvot, Bertrand
 wrote:

Agree and I don't think that there is other option than adding some payload in 
some WAL records (at the very beginning the proposal was to periodically log a 
new record
that announces the current catalog xmin horizon).


Hmm, why did we abandon that approach? It sounds like it has some promise.


I should have put the reference to the discussion up-thread, it's in [1].

[1]: 
https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3ges...@alap3.anarazel.de


Gotcha, thanks.



You're welcome!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Minimal logical decoding on standbys

2022-12-13 Thread Robert Haas

On Tue, Dec 13, 2022 at 11:46 AM Drouvot, Bertrand
 wrote:
> >> Agree and I don't think that there is other option than adding some 
> >> payload in some WAL records (at the very beginning the proposal was to 
> >> periodically log a new record
> >> that announces the current catalog xmin horizon).
> >
> > Hmm, why did we abandon that approach? It sounds like it has some promise.
>
> I should have put the reference to the discussion up-thread, it's in [1].
>
> [1]: 
> https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3ges...@alap3.anarazel.de

Gotcha, thanks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Minimal logical decoding on standbys

2022-12-13 Thread Drouvot, Bertrand





On 12/13/22 5:42 PM, Robert Haas wrote:

On Tue, Dec 13, 2022 at 11:37 AM Drouvot, Bertrand
 wrote:

It seems kind of unfortunate to have to add payload to a whole bevy of
record types for this feature. I think it's worth it, both because the
feature is extremely important,


Agree and I don't think that there is other option than adding some payload in 
some WAL records (at the very beginning the proposal was to periodically log a 
new record
that announces the current catalog xmin horizon).


Hmm, why did we abandon that approach? It sounds like it has some promise.



I should have put the reference to the discussion up-thread, it's in [1].

[1]: 
https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3ges...@alap3.anarazel.de

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Ordering behavior for aggregates

2022-12-13 Thread Ronan Dunklau

Le mardi 13 décembre 2022, 16:13:34 CET Tom Lane a écrit :
> Accordingly, I find nothing at all attractive in this proposal.
> I think the main thing it'd accomplish is to drive users back to
> the bad old days of ordering-by-subquery, if they have a requirement
> we failed to account for.

I think the ability to mark certain aggregates as being able to completely 
ignore the ordering because they produce exactly the same results is still a 
useful optimization.

--
Ronan Dunklau

Re: Minimal logical decoding on standbys

2022-12-13 Thread Robert Haas

On Tue, Dec 13, 2022 at 11:37 AM Drouvot, Bertrand
 wrote:
> > It seems kind of unfortunate to have to add payload to a whole bevy of
> > record types for this feature. I think it's worth it, both because the
> > feature is extremely important,
>
> Agree and I don't think that there is other option than adding some payload 
> in some WAL records (at the very beginning the proposal was to periodically 
> log a new record
> that announces the current catalog xmin horizon).

Hmm, why did we abandon that approach? It sounds like it has some promise.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Minimal logical decoding on standbys

2022-12-13 Thread Drouvot, Bertrand


Hi,

On 12/13/22 2:50 PM, Robert Haas wrote:

On Tue, Dec 13, 2022 at 5:49 AM Drouvot, Bertrand
 wrote:

I think the real problem here is that
RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
when the relation is a user-catalog table. It does so because it
relies on RelationIsUsedAsCatalogTable, and that macro relies on
checking whether the reloptions include user_catalog_table.


[ confusion ]


Sorry, I meant: RelationIsAccessibleInLogicalDecoding is returning
*the wrong answer* when the relation is an *INDEX*.



Yeah, agree. Will fix it in the next patch proposal (adding the index case in 
it as you proposed up-thread).


It would be very helpful if there were some place to refer to that
explained the design decisions here, like why the feature we're trying
to get requires this infrastructure around indexes to be added. It
could be in the commit messages, an email message, a README, or
whatever, but right now I don't see it anywhere in here.


Like adding something around those lines in the commit message?

"
On a primary database, any catalog rows that may be needed by a logical 
decoding replication slot are not removed.
This is done thanks to the catalog_xmin associated with the logical replication 
slot.

With logical decoding on standby, in the following cases:

- hot_standby_feedback is off
- hot_standby_feedback is on but there is no a physical slot between the 
primary and the standby. Then, hot_standby_feedback will work, but only while 
the connection is alive (for example a node restart would break it)

Then the primary may delete system catalog rows that could be needed by the 
logical decoding on the standby. Then, it’s mandatory to identify those rows 
and invalidate the slots that may need them if any.
"


This is very helpful, yes. I think perhaps we need to work some of
this into the code comments someplace, but getting it into the commit
message would be a good first step.


Thanks, will do.



What I infer from the above is that the overall design looks like this:

- We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.
- Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.
- To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.
- We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency. (Is this true? I
think so.)
- Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Does that sound right?



Yeah, that sounds all right to me.
One option could be to add my proposed wording in the commit message and put 
your wording above in a README.


It seems kind of unfortunate to have to add payload to a whole bevy of
record types for this feature. I think it's worth it, both because the
feature is extremely important,


Agree and I don't think that there is other option than adding some payload in 
some WAL records (at the very beginning the proposal was to periodically log a 
new record
that announces the current catalog xmin horizon).


and also because there aren't any
record types that fall into this category that are going to be emitted
so frequently as to make it a performance problem. 


+1

If no objections from your side, I'll submit a patch proposal by tomorrow, 
which:

- get rid of IndexIsAccessibleInLogicalDecoding
- let RelationIsAccessibleInLogicalDecoding deals with the index case
- takes care of the padding where the new bool is added
- convert this new bool to a flag for the xl_heap_visible case (adding a new 
bit to the already existing flag)
- Add my proposed wording above to the commit message
- Add your proposed wording above in a README

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add LSN along with offset to error messages reported for WAL file read/write/validate header failures

2022-12-13 Thread Bharath Rupireddy

On Tue, Dec 6, 2022 at 4:46 PM Bharath Rupireddy
 wrote:
>
> That said, I think we can have a single function
> pg_dissect_walfile_name(wal_file_name, offset optional) returning
> segno (XLogSegNo - physical log file sequence number), tli, lsn (if
> offset is given). This way there is no need for another SQL-callable
> function using pg_settings. Thoughts?
>
> > (If we assume that the file names are typed in letter-by-letter, I
> >  rather prefer to allow lower-case letters:p)
>
> It's easily doable if we convert the entered WAL file name to
> uppercase using pg_toupper() and then pass it to IsXLogFileName().

Okay, here's the v5 patch that I could come up with. It basically adds
functions for dissecting WAL file names and computing offset from lsn.
Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From e5f102911dd7670c96d9826b2e5dd377235f1717 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 13 Dec 2022 12:05:11 +
Subject: [PATCH v5] Add utility functions to disset WAL file name and compute
 offset

---
 doc/src/sgml/func.sgml   | 33 +++
 src/backend/access/transam/xlogfuncs.c   | 62 
 src/backend/catalog/system_functions.sql | 10 
 src/include/access/xlog_internal.h   |  8 +++
 src/include/catalog/pg_proc.dat  | 12 
 src/test/regress/expected/misc_functions.out | 53 +
 src/test/regress/sql/misc_functions.sql  | 17 ++
 7 files changed, 195 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ad31fdb737..894f36bcb6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26099,6 +26099,39 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560

   
 
+  
+   
+
+ pg_dissect_walfile_name
+
+pg_dissect_walfile_name ( file_name text )
+record
+( segno numeric,
+timeline_id integer )
+   
+   
+Computes physical write-ahead log (WAL) file sequence number and
+timeline ID from a given WAL file name.
+   
+  
+
+  
+   
+
+ pg_walfile_offset_lsn
+
+pg_walfile_offset_lsn ( file_name text, file_offset integer )
+pg_lsn
+   
+   
+Computes write-ahead log (WAL) location from a given WAL file name
+and byte offset within that file. When
+file_offset is negative or greater than
+equal to wal_segment_size, this function returns
+NULL.
+   
+  
+
   

 
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 487d5d9cac..32cf431aca 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -432,6 +432,68 @@ pg_walfile_name(PG_FUNCTION_ARGS)
 	PG_RETURN_TEXT_P(cstring_to_text(xlogfilename));
 }
 
+/*
+ * Compute segno (physical WAL file sequence number) and timeline ID given a
+ * WAL file name.
+ */
+Datum
+pg_dissect_walfile_name(PG_FUNCTION_ARGS)
+{
+#define PG_DISSECT_WALFILE_NAME_COLS 2
+	char	   *fname = text_to_cstring(PG_GETARG_TEXT_PP(0));
+	char	   *fname_upper;
+	char	   *p;
+	TimeLineID	tli;
+	XLogSegNo	segno;
+	uint32		log;
+	uint32		seg;
+	Datum		values[PG_DISSECT_WALFILE_NAME_COLS] = {0};
+	bool		isnull[PG_DISSECT_WALFILE_NAME_COLS] = {0};
+	TupleDesc	tupdesc;
+	HeapTuple	tuple;
+	char		buf[256];
+	Datum		result;
+
+	fname_upper = pstrdup(fname);
+
+	/* Capitalize WAL file name. */
+	for (p = fname_upper; *p; p++)
+		*p = pg_toupper((unsigned char) *p);
+
+	if (!IsXLogFileName(fname_upper))
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid WAL file name \"%s\"", fname)));
+
+	XLogIdFromFileName(fname_upper, , , , , wal_segment_size);
+
+	if (seg >= XLogSegmentsPerXLogId(wal_segment_size) ||
+		segno == 0 ||
+		tli == 0)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid WAL file name \"%s\"", fname)));
+
+	if (get_call_result_type(fcinfo, NULL, ) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	/* Convert to numeric. */
+	snprintf(buf, sizeof buf, UINT64_FORMAT, segno);
+	values[0] = DirectFunctionCall3(numeric_in,
+	CStringGetDatum(buf),
+	ObjectIdGetDatum(0),
+	Int32GetDatum(-1));
+
+	values[1] = UInt32GetDatum(tli);
+
+	tuple = heap_form_tuple(tupdesc, values, isnull);
+	result = HeapTupleGetDatum(tuple);
+
+	PG_RETURN_DATUM(result);
+
+#undef PG_DISSECT_WALFILE_NAME_COLS
+}
+
 /*
  * pg_wal_replay_pause - Request to pause recovery
  *
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 52517a6531..5dde5007ce 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -397,6 +397,16 @@ CREATE OR REPLACE

Re: Use get_call_result_type() more widely

2022-12-13 Thread Tom Lane

Bharath Rupireddy  writes:
> A review comment in another thread [1] by Michael Paquier about the
> usage of get_call_result_type() instead of explicit building of
> TupleDesc made me think about using it more widely. Actually, the
> get_call_result_type() looks at the function definitions to figure the
> column names and build the required TupleDesc, usage of which avoids
> duplication of the column names between pg_proc.dat/function
> definitions and source code. Also, it saves a good number of LOC ~415
> [2] and the size of all the object files put together gets reduced by
> ~4MB, which means, the postgres binary becomes leaner by ~4MB [3].

Saving code is nice, but I'd assume the result is slower, because
get_call_result_type has to do a pretty substantial amount of work
to get the data to construct the tupdesc from.  Have you tried to
quantify how much overhead this'd add?  Which of these functions
can we safely consider to be non-performance-critical?

regards, tom lane

Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-13 Thread Masahiko Sawada

On Sun, Dec 11, 2022 at 8:45 PM houzj.f...@fujitsu.com
 wrote:
>
> On Friday, December 9, 2022 3:14 PM Amit Kapila  
> wrote:
> >
> > On Thu, Dec 8, 2022 at 12:37 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> >
> > Review comments
>
> Thanks for the comments!
>
> > ==
> > 1. Currently, we don't release the stream lock in LA (leade apply
> > worker) for "rollback to savepoint" and the reason is mentioned in comments 
> > of
> > apply_handle_stream_abort() in the patch. But, today, while testing, I 
> > found that
> > can lead to deadlock which otherwise, won't happen on the publisher. The key
> > point is rollback to savepoint releases the locks acquired by the particular
> > subtransaction, so parallel apply worker should also do the same. Consider 
> > the
> > following example where the transaction in session-1 is being performed by 
> > the
> > parallel apply worker and the transaction in session-2 is being performed 
> > by the
> > leader apply worker. I have simulated it by using GUC force_stream_mode.
> > Publisher
> > ==
> > Session-1
> > postgres=# begin;
> > BEGIN
> > postgres=*# savepoint s1;
> > SAVEPOINT
> > postgres=*# truncate t1;
> > TRUNCATE TABLE
> >
> > Session-2
> > postgres=# begin;
> > BEGIN
> > postgres=*# insert into t1 values(4);
> >
> > Session-1
> > postgres=*# rollback to savepoint s1;
> > ROLLBACK
> >
> > Session-2
> > Commit;
> >
> > With or without commit of Session-2, this scenario will lead to deadlock on 
> > the
> > subscriber because PA (parallel apply worker) is waiting for LA to send the 
> > next
> > command, and LA is blocked by Exclusive of PA. There is no deadlock on the
> > publisher because rollback to savepoint will release the lock acquired by
> > truncate.
> >
> > To solve this, How about if we do three things before sending abort of
> > sub-transaction (a) unlock the stream lock, (b) increment 
> > pending_stream_count,
> > (c) take the stream lock again?
> >
> > Now, if the PA is not already waiting on the stop, it will not wait at 
> > stream_stop
> > but will wait after applying abort of sub-transaction and if it is already 
> > waiting at
> > stream_stop, the wait will be released. If this works then probably we 
> > should try
> > to do (b) before (a) to match the steps with stream_start.
>
> The solution works for me, I have changed the code as suggested.
>
>
> > 2. There seems to be another general problem in the way the patch waits for
> > stream_stop in PA (parallel apply worker). Currently, PA checks, if there 
> > are no
> > more pending streams then it tries to wait for the next stream by waiting 
> > on a
> > stream lock. However, it is possible after PA checks there is no pending 
> > stream
> > and before it actually starts waiting on a lock, the LA sends another 
> > stream for
> > which even stream_stop is sent, in this case, PA will start waiting for the 
> > next
> > stream whereas there is actually a pending stream available. In this case, 
> > it won't
> > lead to any problem apart from delay in applying the changes in such cases 
> > but
> > for the case mentioned in the previous point (Pont 1), it can lead to 
> > deadlock
> > even after we implement the solution proposed to solve it.
>
> Thanks for reporting, I have introduced another flag in shared memory and use 
> it to
> prevent the leader from incrementing the pending_stream_count if the parallel
> apply worker is trying to lock the stream lock.
>
>
> > 3. The other point to consider is that for stream_commit/prepare/abort, in 
> > LA, we
> > release the stream lock after sending the message whereas for stream_start 
> > we
> > release it before sending the message. I think for the earlier cases
> > (stream_commit/prepare/abort), the patch has done like this because
> > pa_send_data() may need to require the lock again when it times out and 
> > start
> > serializing, so there will be no sense in first releasing it, then 
> > re-acquiring it, and
> > then again releasing it. Can't we also release the lock for stream_start 
> > after
> > pa_send_data() only if it is not switched to serialize mode?
>
> Changed.
>
> Attach the new version patch set which addressed above comments.

Here are comments on v59 0001, 0002 patches:

+void
+pa_increment_stream_block(ParallelApplyWorkerShared *wshared)
+{
+while (1)
+{
+SpinLockAcquire(>mutex);
+
+/*
+ * Don't try to increment the count if the parallel
apply worker is
+ * taking the stream lock. Otherwise, there would be
a race condition
+ * that the parallel apply worker checks there is no
pending streaming
+ * block and before it actually starts waiting on a
lock, the leader
+ * sends another streaming block and take the stream
lock again. In
+ * this case, the parallel apply worker will start
waiting for the next
+ * streaming block whereas there is actually a

Re: Ordering behavior for aggregates

2022-12-13 Thread Tom Lane

Ronan Dunklau  writes:
> Le mardi 13 décembre 2022, 14:05:10 CET Vik Fearing a écrit :
>> On 12/13/22 13:55, Magnus Hagander wrote:
>>> On Tue, Dec 13, 2022 at 1:51 PM Vik Fearing  
>>> wrote:
 However, it is completely useless for things like AVG() or SUM().  If
 you include it, the aggregate will do the sort even though it is neither
 required nor desired.

> I'm not sure about this. For AVG and SUM, if you want reproducible results 
> with floating point numbers, you may want it.

Yeah, I was about to mention the floating-point issue.  IIRC, we went
over exactly this ground when we introduced aggregate ORDER BY, and
decided that it was not our business to legislate whether particular
aggregates need ordering or not.  We don't try to second-guess users'
inclusion of ORDER BY in subqueries either, and that's just about
the same thing.  (Indeed, if you're feeding the subquery output to
an aggregate, it's exactly the same thing.)

Accordingly, I find nothing at all attractive in this proposal.
I think the main thing it'd accomplish is to drive users back to
the bad old days of ordering-by-subquery, if they have a requirement
we failed to account for.

regards, tom lane

Re: refactor ExecGrant_*() functions

2022-12-13 Thread Antonin Houska

Peter Eisentraut  wrote:

> On 12.12.22 10:44, Antonin Houska wrote:
> > Peter Eisentraut  wrote:
> > 
> >> On 06.12.22 09:41, Antonin Houska wrote:
> >>> Attached are my proposals for improvements. One is to avoid memory leak, 
> >>> the
> >>> other tries to improve readability a little bit.
> >>
> >> I added the readability improvement to my v2 patch.  The pfree() calls 
> >> aren't
> >> necessary AFAICT.
> 
> It's something to consider, but since this is a refactoring patch and the old
> code didn't do it either, I think it's out of scope.

Well, the reason I brought this topic up is that the old code didn't even
palloc() those arrays. (Because the were located in the stack.)

> > I see that memory contexts exist and that the amount of memory freed is not
> > huge, but my style is to free the memory explicitly if it's allocated in a
> > loop.
> > v2 looks good to me.
> 
> Committed, thanks.

ok, I'll post rebased "USAGE privilege on PUBLICATION" patch [1] soon.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

[1] https://commitfest.postgresql.org/41/3641/

Re: Temporary tables versus wraparound... again

2022-12-13 Thread Greg Stark

On Wed, 7 Dec 2022 at 22:02, Greg Stark  wrote:
> > Seems like this should just be in
> > heapam_relation_nontransactional_truncate()?

So here I've done it that way. It is a bit of an unfortunate layering
since it means the heapam_handler is doing the catalog change but it
does seem inconvenient to pass relfrozenxid etc back up and have the
caller make the changes when there are no other changes to make.

Also, I'm not sure what changed but maybe there was some other commits
in vacuum.c in the meantime. I remember being frustrated previously
trying to reuse that code but now it works fine. So I was able to
reduce the copy-pasted code significantly.

(The tests are probably not worth committing, they're just here for my
own testing to be sure it's doing anything)

-- 
greg
From f6f800819c497817c3f2957683fc521abe82c2b7 Mon Sep 17 00:00:00 2001
From: Greg Stark 
Date: Thu, 31 Mar 2022 15:49:19 -0400
Subject: [PATCH v8 2/3] Update relfrozenxmin when truncating temp tables

Make ON COMMIT DELETE ROWS reset relfrozenxmin and other table stats
to the same values used in initial table creation. Otherwise even
typical short-lived transactions in long-lived sessions using
temporary tables can easily cause them to reach transaction wraparound
and autovacuum cannot come to the rescue for temporary tables.

Also optimize the relminmxid used for for temporary tables to be our
own oldest MultiXactId instead of the globally oldest one. This avoids
the expensive calculation of the latter on every transaction commit.

This code path is also used by truncation of tables created within the
same transaction.
---
 src/backend/access/heap/heapam_handler.c | 51 +++-
 src/backend/access/transam/multixact.c   | 15 +++
 src/backend/catalog/heap.c   | 14 ++-
 src/include/access/multixact.h   |  1 +
 4 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ab1bcf3522..b01a25884f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -33,6 +33,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "commands/progress.h"
+#include "commands/vacuum.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -587,9 +588,18 @@ heapam_relation_set_new_filelocator(Relation rel,
 	 * could reuse values from their local cache, so we are careful to
 	 * consider all currently running multis.
 	 *
-	 * XXX this could be refined further, but is it worth the hassle?
+	 * In the case of temporary tables we can refine this slightly and use a
+	 * our own oldest visible MultiXactId. This is also cheaper to calculate
+	 * which is nice since temporary tables might be getting created often.
 	 */
-	*minmulti = GetOldestMultiXactId();
+	if (persistence == RELPERSISTENCE_TEMP)
+	{
+		*minmulti = GetOurOldestMultiXactId();
+	}
+	else
+	{
+		*minmulti = GetOldestMultiXactId();
+	}
 
 	srel = RelationCreateStorage(*newrlocator, persistence, true);
 
@@ -618,6 +628,43 @@ heapam_relation_set_new_filelocator(Relation rel,
 static void
 heapam_relation_nontransactional_truncate(Relation rel)
 {
+	/* This function from vacuum.c does a non-transactional update of the
+	 * catalog entry for this relation. That's ok because these values are
+	 * always safe regardless of whether we commit and in any case this is
+	 * either a temporary table or a filenode created in this transaction so
+	 * this tuple will be irrelevant if we do not commit. It's also important
+	 * to avoid lots of catalog bloat due to temporary tables being truncated
+	 * on every transaction commit.
+	 *
+	 * We set in_outer_transaction=false because that controls whether
+	 * vac_update_relstats updates other columns like relhasindex,
+	 * relhasrules, relhastriggers which is not changing here. This is a bit
+	 * of a hack, perhaps this parameter should change name.
+	 *
+	 * These parameters should stay in sync with
+	 * heapam_relation_set_new_filelocator() above and AddNewRelationTuple()
+	 * in heap.c. In theory this should probably return the relfrozenxid and
+	 * relminmxid and heap_truncate_one_rel() in heap.c should handle
+	 * num_tuples and num_pages but that would be slightly inconvenient and
+	 * require an api change.
+	 */
+
+	/* Ensure RecentXmin is valid -- it almost certainly is but regression
+	 * tests turned up an unlikely corner case where it might not be */
+	if (!FirstSnapshotSet)
+		(void)GetLatestSnapshot();
+	Assert(FirstSnapshotSet);
+	vac_update_relstats(rel,
+		0, /* num_pages */
+		-1, /* num_tuples */
+		0, /* num_all_visible_pages */
+		true, /* hasindex -- ignored due to in_outer_xact false */
+		RecentXmin, /* frozenxid */
+		RelationIsPermanent(rel) ? GetOldestMultiXactId() : GetOurOldestMultiXactId(),
+		NULL, NULL, /* frozenxid_updated, minmxid_updated */
+		false /* in_outer_xac (see

Re: Ordering behavior for aggregates

2022-12-13 Thread Ronan Dunklau

Le mardi 13 décembre 2022, 14:05:10 CET Vik Fearing a écrit :
> On 12/13/22 13:55, Magnus Hagander wrote:
> > On Tue, Dec 13, 2022 at 1:51 PM Vik Fearing  
wrote:
> >> However, it is completely useless for things like AVG() or SUM().  If
> >> you include it, the aggregate will do the sort even though it is neither
> >> required nor desired.

I'm not sure about this. For AVG and SUM, if you want reproducible results 
with floating point numbers, you may want it. And if you disallow it for most 
avg and sum implementations except for floating point types, it's not a very 
consistent user experience.


> >> 
> >> I am proposing something like pg_aggregate.aggordering which would be an
> >> enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently
> >> all aggregates would have 'a' but I am thinking that a lot of them could
> >> be switched to 'f'.  In that case, if a user supplies an ordering, an
> >> error is raised.
> > 
> > Should there perhaps also be an option for "ignored" where we'd allow the
> > user to specify it, but not actually do the sort because we know it's
> > pointless? Or maybe that should be the behaviour of "forbidden", which
> > should then perhaps have a different name?
> 
> I did think about that but I can't think of any reason we would want to
> silently ignore something the user has written.  If the ordering doesn't
> make sense, we should forbid it.

It is allowed as of now, and so it would be a compatibility issue for queries 
existing in the wild. Ignoring it is just an optimization, just how we 
optimize away some joins entirely. 

--
Ronan Dunklau

Re: Minimal logical decoding on standbys

2022-12-13 Thread Robert Haas

On Tue, Dec 13, 2022 at 5:49 AM Drouvot, Bertrand
 wrote:
> > I think the real problem here is that
> > RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
> > when the relation is a user-catalog table. It does so because it
> > relies on RelationIsUsedAsCatalogTable, and that macro relies on
> > checking whether the reloptions include user_catalog_table.
>
> [ confusion ]

Sorry, I meant: RelationIsAccessibleInLogicalDecoding is returning
*the wrong answer* when the relation is an *INDEX*.

> > But I have a feeling that the reloptions
> > code is not very well-structured to allow reloptions to be stored any
> > place but in pg_class.reloptions, so this may be difficult to
> > implement.
>
> Why don't remove this "property" from reloptions? (would probably need much 
> more changes that the current approach and probably take care of upgrade 
> scenario too).
> I did not look in details but logged/unlogged is also propagated to the 
> indexes, so maybe we could use the same approach here. But is it worth the 
> probably added complexity (as compare to the current approach)?

I feel like changing the user-facing syntax is probably not a great
idea, as it inflicts upgrade pain that I don't see how we can really
fix.

> > It would be very helpful if there were some place to refer to that
> > explained the design decisions here, like why the feature we're trying
> > to get requires this infrastructure around indexes to be added. It
> > could be in the commit messages, an email message, a README, or
> > whatever, but right now I don't see it anywhere in here.
>
> Like adding something around those lines in the commit message?
>
> "
> On a primary database, any catalog rows that may be needed by a logical 
> decoding replication slot are not removed.
> This is done thanks to the catalog_xmin associated with the logical 
> replication slot.
>
> With logical decoding on standby, in the following cases:
>
> - hot_standby_feedback is off
> - hot_standby_feedback is on but there is no a physical slot between the 
> primary and the standby. Then, hot_standby_feedback will work, but only while 
> the connection is alive (for example a node restart would break it)
>
> Then the primary may delete system catalog rows that could be needed by the 
> logical decoding on the standby. Then, it’s mandatory to identify those rows 
> and invalidate the slots that may need them if any.
> "

This is very helpful, yes. I think perhaps we need to work some of
this into the code comments someplace, but getting it into the commit
message would be a good first step.

What I infer from the above is that the overall design looks like this:

- We want to enable logical decoding on standbys, but replay of WAL
from the primary might remove data that is needed by logical decoding,
causing replication conflicts much as hot standby does.
- Our chosen strategy for dealing with this type of replication slot
is to invalidate logical slots for which needed data has been removed.
- To do this we need the latestRemovedXid for each change, just as we
do for physical replication conflicts, but we also need to know
whether any particular change was to data that logical replication
might access.
- We can't rely on the standby's relcache entries for this purpose in
any way, because the WAL record that causes the problem might be
replayed before the standby even reaches consistency. (Is this true? I
think so.)
- Therefore every WAL record that potentially removes data from the
index or heap must carry a flag indicating whether or not it is one
that might be accessed during logical decoding.

Does that sound right?

It seems kind of unfortunate to have to add payload to a whole bevy of
record types for this feature. I think it's worth it, both because the
feature is extremely important, and also because there aren't any
record types that fall into this category that are going to be emitted
so frequently as to make it a performance problem. But it's certainly
more complicated than one might wish.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Ordering behavior for aggregates

2022-12-13 Thread Vik Fearing


On 12/13/22 14:25, Isaac Morland wrote:

On Tue, 13 Dec 2022 at 07:50, Vik Fearing  wrote:

I am proposing something like pg_aggregate.aggordering which would be an

enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently
all aggregates would have 'a' but I am thinking that a lot of them could
be switched to 'f'.  In that case, if a user supplies an ordering, an
error is raised.



Although I find "r" attractive, I have two concerns about it:

1) Do we really want to require ordering? I know it's weird and partially
undefined to call something like string_agg without an ordering, but what
if in the specific application it doesn’t matter in what order the items
appear?

2) There is a backward compatibility issue here; it’s not clear to me we
could apply "r" to any existing aggregate.



I do not intend to add 'r' to any existing aggregate.  I included it in 
the hypothetical enum for future aggregates and extensions.  It isn't 
perfect either because first_value((x, y) ORDER BY x) can still give a 
semi-random result.




Actually upon consideration, I think I have similar concerns about "f". We
don’t usually forbid "dumb" things; e.g., I can write a function which
ignores its inputs. And in some situations, "dumb" things make sense. For
example, if I’m specifying a function to use as a filter, it could be
reasonable in a particular instance to provide a function which ignores one
or more of its inputs.



Sure, but this isn't a function; this is syntax.  And in your example 
you are ignoring the input whereas currently aggregates *do not* ignore 
the ordering when they could/should.

--
Vik Fearing

Re: Ordering behavior for aggregates

2022-12-13 Thread Isaac Morland

On Tue, 13 Dec 2022 at 07:50, Vik Fearing  wrote:

I am proposing something like pg_aggregate.aggordering which would be an
> enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently
> all aggregates would have 'a' but I am thinking that a lot of them could
> be switched to 'f'.  In that case, if a user supplies an ordering, an
> error is raised.
>

Although I find "r" attractive, I have two concerns about it:

1) Do we really want to require ordering? I know it's weird and partially
undefined to call something like string_agg without an ordering, but what
if in the specific application it doesn’t matter in what order the items
appear?

2) There is a backward compatibility issue here; it’s not clear to me we
could apply "r" to any existing aggregate.

Actually upon consideration, I think I have similar concerns about "f". We
don’t usually forbid "dumb" things; e.g., I can write a function which
ignores its inputs. And in some situations, "dumb" things make sense. For
example, if I’m specifying a function to use as a filter, it could be
reasonable in a particular instance to provide a function which ignores one
or more of its inputs.

RE: Perform streaming logical transactions by background workers and parallel apply

2022-12-13 Thread houzj.f...@fujitsu.com

On Tue, Dec 13, 2022 7:06 AM Peter Smith  wrote:
> Some minor review comments for v58-0001

Thanks for your comments.

> ==
> 
> .../replication/logical/applyparallelworker.c
> 
> 1. pa_can_start
> 
> + /*
> + * Don't start a new parallel worker if user has set skiplsn as it's
> + * possible that user want to skip the streaming transaction. For 
> + streaming
> + * transaction, we need to serialize the transaction to a file so 
> + that we
> + * can get the last LSN of the transaction to judge whether to skip 
> + before
> + * starting to apply the change.
> + */
> + if (!XLogRecPtrIsInvalid(MySubscription->skiplsn))
> + return false;
> 
> 
> "that user want" -> "that they want"
> 
> "For streaming transaction," -> "For streaming transactions,"

Changed.

> ~~~
> 
> 2. pa_free_worker_info
> 
> + /* Remove from the worker pool. */
> + ParallelApplyWorkerPool = list_delete_ptr(ParallelApplyWorkerPool,
> +winfo);
> 
> Unnecessary wrapping

Changed.

> ~~~
> 
> 3. pa_set_stream_apply_worker
> 
> +/*
> + * Set the worker that required to apply the current streaming transaction.
> + */
> +void
> +pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo) {  
> +stream_apply_worker = winfo; }
> 
> Comment wording seems wrong.

Tried to improve this comment.

> ==
> 
> src/include/replication/worker_internal.h
> 
> 4. ParallelApplyWorkerShared
> 
> + * XactLastCommitEnd from the parallel apply worker. This is required 
> +to
> + * update the lsn_mappings by leader worker.
> + */
> + XLogRecPtr last_commit_end;
> +} ParallelApplyWorkerShared;
> 
> 
> "This is required to update the lsn_mappings by leader worker." --> 
> did you mean "This is required by the leader worker so it can update 
> the lsn_mappings." ??

Changed.

Also thanks for the kind reminder in [1], rebased the patch set.
Attach the new patch set.

[1] - 
https://www.postgresql.org/message-id/CAHut%2BPt4qv7xfJUmwdn6Vy47L5mqzKtkPr31%3DDmEayJWXetvYg%40mail.gmail.com

Best regards,
Hou zj

Re: Ordering behavior for aggregates

2022-12-13 Thread Vik Fearing


On 12/13/22 13:55, Magnus Hagander wrote:

On Tue, Dec 13, 2022 at 1:51 PM Vik Fearing  wrote:


The standard only defines an ORDER BY clause inside of an aggregate for
ARRAY_AGG().  As an extension to the standard, we allow it for all
aggregates, which is very convenient for non-standard things like
string_agg().

However, it is completely useless for things like AVG() or SUM().  If
you include it, the aggregate will do the sort even though it is neither
required nor desired.

I am proposing something like pg_aggregate.aggordering which would be an
enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently
all aggregates would have 'a' but I am thinking that a lot of them could
be switched to 'f'.  In that case, if a user supplies an ordering, an
error is raised.



Should there perhaps also be an option for "ignored" where we'd allow the
user to specify it, but not actually do the sort because we know it's
pointless? Or maybe that should be the behaviour of "forbidden", which
should then perhaps have a different name?



I did think about that but I can't think of any reason we would want to 
silently ignore something the user has written.  If the ordering doesn't 
make sense, we should forbid it.




My main motivation behind this is to be able to optimize aggregates that

could stop early such as ANY_VALUE(), but also to self-optimize queries
written in error (or ignorance).

There is recurring demand for a first_agg() of some sort, and that one
(whether implemented in core or left to extensions) would use 'r' so
that an error is raised if the user does not supply an ordering.

I have not started working on this because I envision quite a lot of
bikeshedding, but this is the approach I am aiming for.

Thoughts?



  For consistency, should we have a similar flag for DISITNCT? That could be
interesting to forbid for something like first_agg() wouldn't it? I'm not
sure what the usecase would be to require it, but maybe there is one?



I thought about that too, but decided it could be a separate patch 
because far fewer aggregates would need it.

--
Vik Fearing

Re: Ordering behavior for aggregates

2022-12-13 Thread Magnus Hagander

On Tue, Dec 13, 2022 at 1:51 PM Vik Fearing  wrote:

> The standard only defines an ORDER BY clause inside of an aggregate for
> ARRAY_AGG().  As an extension to the standard, we allow it for all
> aggregates, which is very convenient for non-standard things like
> string_agg().
>
> However, it is completely useless for things like AVG() or SUM().  If
> you include it, the aggregate will do the sort even though it is neither
> required nor desired.
>
> I am proposing something like pg_aggregate.aggordering which would be an
> enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently
> all aggregates would have 'a' but I am thinking that a lot of them could
> be switched to 'f'.  In that case, if a user supplies an ordering, an
> error is raised.
>

Should there perhaps also be an option for "ignored" where we'd allow the
user to specify it, but not actually do the sort because we know it's
pointless? Or maybe that should be the behaviour of "forbidden", which
should then perhaps have a different name?


My main motivation behind this is to be able to optimize aggregates that
> could stop early such as ANY_VALUE(), but also to self-optimize queries
> written in error (or ignorance).
>
> There is recurring demand for a first_agg() of some sort, and that one
> (whether implemented in core or left to extensions) would use 'r' so
> that an error is raised if the user does not supply an ordering.
>
> I have not started working on this because I envision quite a lot of
> bikeshedding, but this is the approach I am aiming for.
>
> Thoughts?
>

 For consistency, should we have a similar flag for DISITNCT? That could be
interesting to forbid for something like first_agg() wouldn't it? I'm not
sure what the usecase would be to require it, but maybe there is one?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Ordering behavior for aggregates

2022-12-13 Thread Vik Fearing

The standard only defines an ORDER BY clause inside of an aggregate for 
ARRAY_AGG().  As an extension to the standard, we allow it for all 
aggregates, which is very convenient for non-standard things like 
string_agg().


However, it is completely useless for things like AVG() or SUM().  If 
you include it, the aggregate will do the sort even though it is neither 
required nor desired.


I am proposing something like pg_aggregate.aggordering which would be an 
enum of behaviors such as f=Forbidden, a=Allowed, r=Required.  Currently 
all aggregates would have 'a' but I am thinking that a lot of them could 
be switched to 'f'.  In that case, if a user supplies an ordering, an 
error is raised.


My main motivation behind this is to be able to optimize aggregates that 
could stop early such as ANY_VALUE(), but also to self-optimize queries 
written in error (or ignorance).


There is recurring demand for a first_agg() of some sort, and that one 
(whether implemented in core or left to extensions) would use 'r' so 
that an error is raised if the user does not supply an ordering.


I have not started working on this because I envision quite a lot of 
bikeshedding, but this is the approach I am aiming for.


Thoughts?
--
Vik Fearing

Re: Error-safe user functions

2022-12-13 Thread Amul Sul

On Mon, Dec 12, 2022 at 12:00 AM Tom Lane  wrote:
>
> Andrew Dunstan  writes:
> > Maybe as we work through the remaining input functions (there are about
> > 60 core candidates left on my list) we should mark them with a comment
> > if no adjustment is needed.
>
> I did a quick pass through them last night.  Assuming that we don't
> need to touch the unimplemented input functions (eg for pseudotypes),
> I count these core functions as still needing work:
>
> aclitemin
> bit_in
> box_in
> bpcharin
> byteain
> cash_in
> cidin
> cidr_in
> circle_in
> inet_in
> int2vectorin
> jsonpath_in
> line_in
> lseg_in
> macaddr8_in
> macaddr_in

Attaching patches changing these functions except bpcharin,
byteain, jsonpath_in, and cidin. I am continuing work on the next
items below:

> multirange_in
> namein
> oidin
> oidvectorin
> path_in
> pg_lsn_in
> pg_snapshot_in
> point_in
> poly_in
> range_in
> regclassin
> regcollationin
> regconfigin
> regdictionaryin
> regnamespacein
> regoperatorin
> regoperin
> regprocedurein
> regprocin
> regrolein
> regtypein
> tidin
> tsqueryin
> tsvectorin
> uuid_in
> varbit_in
> varcharin
> xid8in
> xidin
> xml_in
>
> and these contrib functions:
>
> hstore:
> hstore_in
> intarray:
> bqarr_in
> isn:
> ean13_in
> isbn_in
> ismn_in
> issn_in
> upc_in
> ltree:
> ltree_in
> lquery_in
> ltxtq_in
> seg:
> seg_in
>
> Maybe we should have a conversation about which of these are
> highest priority to get to a credible feature.  We clearly need
> to fix the remaining SQL-spec types (varchar and bpchar, mainly).
> At the other extreme, likely nobody would weep if we never fixed
> int2vectorin, for instance.
>
> I'm a little concerned about the cost-benefit of fixing the reg* types.
> The ones that accept type names actually use the core grammar to parse
> those.  Now, we probably could fix the grammar to be non-throwing, but
> it'd be very invasive and I'm not sure about the performance impact.
> It might be best to content ourselves with soft reporting of lookup
> failures, as opposed to syntax problems.
>

Regards,
Amul
From 4c4c18bd8104114351ca58a73a9d92fbb0c85dd2 Mon Sep 17 00:00:00 2001
From: Amul Sul 
Date: Tue, 13 Dec 2022 17:29:04 +0530
Subject: [PATCH v1 13/14] Change macaddr8_in to allow non-throw error
 reporting

---
 src/backend/utils/adt/mac8.c   | 36 +-
 src/test/regress/expected/macaddr8.out | 25 ++
 src/test/regress/sql/macaddr8.sql  |  6 +
 3 files changed, 49 insertions(+), 18 deletions(-)

diff --git a/src/backend/utils/adt/mac8.c b/src/backend/utils/adt/mac8.c
index 24d219f6386..adb253a6ac0 100644
--- a/src/backend/utils/adt/mac8.c
+++ b/src/backend/utils/adt/mac8.c
@@ -35,7 +35,8 @@
 #define lobits(addr) \
   ((unsigned long)(((addr)->e<<24) | ((addr)->f<<16) | ((addr)->g<<8) | ((addr)->h)))
 
-static unsigned char hex2_to_uchar(const unsigned char *ptr, const unsigned char *str);
+static unsigned char hex2_to_uchar(const unsigned char *ptr, const unsigned char *str,
+   Node *escontext);
 
 static const signed char hexlookup[128] = {
 	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
@@ -58,7 +59,8 @@ static const signed char hexlookup[128] = {
  * the entire string, which is used only for error reporting.
  */
 static inline unsigned char
-hex2_to_uchar(const unsigned char *ptr, const unsigned char *str)
+hex2_to_uchar(const unsigned char *ptr, const unsigned char *str,
+			  Node *escontext)
 {
 	unsigned char ret = 0;
 	signed char lookup;
@@ -88,13 +90,10 @@ hex2_to_uchar(const unsigned char *ptr, const unsigned char *str)
 	return ret;
 
 invalid_input:
-	ereport(ERROR,
+	ereturn(escontext, 0,
 			(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
 			 errmsg("invalid input syntax for type %s: \"%s\"", "macaddr8",
 	str)));
-
-	/* We do not actually reach here */
-	return 0;
 }
 
 /*
@@ -104,6 +103,7 @@ Datum
 macaddr8_in(PG_FUNCTION_ARGS)
 {
 	const unsigned char *str = (unsigned char *) PG_GETARG_CSTRING(0);
+	Node	   *escontext = fcinfo->context;
 	const unsigned char *ptr = str;
 	macaddr8   *result;
 	unsigned char a = 0,
@@ -136,32 +136,32 @@ macaddr8_in(PG_FUNCTION_ARGS)
 		switch (count)
 		{
 			case 1:
-a = hex2_to_uchar(ptr, str);
+a = hex2_to_uchar(ptr, str, escontext);
 break;
 			case 2:
-b = hex2_to_uchar(ptr, str);
+b = hex2_to_uchar(ptr, str, escontext);
 break;
 			case 3:
-c = hex2_to_uchar(ptr, str);
+c = hex2_to_uchar(ptr, str, escontext);
 break;
 			case 4:
-d = hex2_to_uchar(ptr, str);
+d = hex2_to_uchar(ptr, str, escontext);
 break;
 			case 5:
-e = hex2_to_uchar(ptr, str);
+e = hex2_to_uchar(ptr, str, escontext);
 break;
 			case 6:
-f = hex2_to_uchar(ptr, str);
+f = hex2_to_uchar(ptr, str, escontext);
 break;
 			case 7:
-g = hex2_to_uchar(ptr, str);
+g = hex2_to_uchar(ptr, str, escontext);
 break;
 			case 8:
-h = hex2_to_uchar(ptr, str);
+h = hex2_to_uchar(ptr, str,

Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-13 Thread Amit Kapila

On Tue, Dec 13, 2022 at 7:35 AM Kyotaro Horiguchi
 wrote:
>
> At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila  
> wrote in
> > On Mon, Dec 12, 2022 at 1:04 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > > once and apply later. Our basic design is as follows:
> > >
> > > * All transactions areserialized to files if min_apply_delay is set to 
> > > non-zero.
> > > * After receiving the commit message and spending time, workers reads and
> > >   applies spooled messages
> > >
> >
> > I think this may be more work than required because in some cases
> > doing I/O just to delay xacts will later lead to more work. Can't we
> > send some ping to walsender to communicate that walreceiver is alive?
> > We already seem to be sending a ping in LogicalRepApplyLoop if we
> > haven't heard anything from the server for more than
> > wal_receiver_timeout / 2. Now, it is possible that the walsender is
> > terminated due to some other reason and we need to see if we can
> > detect that or if it will only be detected once the walreceiver's
> > delay time is over.
>
> FWIW, I thought the same thing with Amit.
>
> What we should do here is logrep workers notifying to walsender that
> it's living and the communication in-between is fine, and maybe the
> worker's status. Spontaneous send_feedback() calls while delaying will
> be sufficient for this purpose. We might need to supress extra forced
> feedbacks instead. In contrast the worker doesn't need to bother to
> know whether the peer is living until it receives the next data. But
> we might need to adjust the wait_time in LogicalRepApplyLoop().
>
> But, I'm not sure what will happen when walsender is blocked by
> buffer-full for a long time.
>

Yeah, I think ideally it will timeout but if we have a solution like
during delay, we keep sending ping messages time-to-time, it should
work fine. However, that needs to be verified. Do you see any reasons
why that won't work?

-- 
With Regards,
Amit Kapila.

Re: Raising the SCRAM iteration count

2022-12-13 Thread Daniel Gustafsson

> On 12 Dec 2022, at 15:47, Jonathan S. Katz  wrote:

> To throw on a bit of paint, if we do change it, we should likely follow what 
> would come out in a RFC.
> 
> While the SCRAM-SHA-512 RFC is still in draft[1], the latest draft it 
> contains a "SHOULD" recommendation of 1, which was bumped up from 4096 in 
> an earlier version of the draft:

This is however the draft for a different algorithm: SCRAM-SHA-512.  We are
supporting SCRAM-SHA-256 which is defined in RFC7677.  The slightly lower
recommendation there makes sense as SHA-512 is more computationally expensive
than SHA-256.

It does raise an interesting point though, if we in the future add suppprt for
SCRAM-SHA-512 (which seems reasonable to do) it's not good enough to have a
single GUC for SCRAM iterations; we'd need to be able to set the iteration
count per algorithm. I'll account for that when updating the patch downthread.

--
Daniel Gustafsson   https://vmware.com/

Re: Minimal logical decoding on standbys

2022-12-13 Thread Drouvot, Bertrand


Hi,

On 12/12/22 6:41 PM, Robert Haas wrote:

On Sat, Dec 10, 2022 at 3:09 AM Drouvot, Bertrand
 wrote:

Attaching V30, mandatory rebase due to 66dcb09246.


It's a shame that this hasn't gotten more attention, because the topic
is important, but I'm as guilty of being too busy to spend a lot of
time on it as everyone else.


Thanks for looking at it! Yeah, I think this is an important feature too.



Anyway, while I'm not an expert on this topic, 


Then, we are two ;-)
I "just" resurrected this very old thread and do the best that I can to have it 
moving forward.


I did spend a little
time looking at it today, especially 0001. Here are a few comments:

I think that it's not good for IndexIsAccessibleInLogicalDecoding and
RelationIsAccessibleInLogicalDecoding to both exist. Indexes and
tables are types of relations, so this invites confusion: when the
object in question is an index, it would seem that either one can be
applied, based on the names.


Agree.


I think the real problem here is that
RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
when the relation is a user-catalog table. It does so because it
relies on RelationIsUsedAsCatalogTable, and that macro relies on
checking whether the reloptions include user_catalog_table.



I think the Macro is returning the right answer when the relation is a 
user-catalog table.
I think the purpose is to identify relations that are permitted in READ only 
access during logical decoding.
Those are the ones that have been created by initdb in the pg_catalog schema, 
or have been marked as user provided catalog tables (that's what is documented 
in [1]).

Or did you mean when the relation is "NOT" a user-catalog table?


But here we can see where past thinking of this topic has been,
perhaps, a bit fuzzy. If that option were called user_catalog_relation
and had to be set on both tables and indexes as appropriate, then
RelationIsAccessibleInLogicalDecoding would already be doing the right
thing, and consequently there would be no need to add
IndexIsAccessibleInLogicalDecoding. 


Yeah, agree.


I think we should explore the idea
of making the existing macro return the correct answer rather than
adding a new one. It's probably too late to redefine the semantics of
user_catalog_table, although if anyone wants to argue that we could
require logical decoding plugins to set this for both indexes and
tables, and/or rename to say relation instead of table, and/or add a
parallel reloption called user_catalog_index, then let's talk about
that.

Otherwise, I think we can consider adjusting the definition of
RelationIsUsedAsCatalogTable. The simplest way to do that would be to
make it check indisusercatalog for indexes and do what it does already
for tables. Then IndexIsUserCatalog and
IndexIsAccessibleInLogicalDecoding go away and
RelationIsAccessibleInLogicalDecoding returns the right answer in all
cases.



That does sound a valid option to me too, I'll look at it.


But I also wonder if a new pg_index column is really the right
approach here. One fairly obvious alternative is to try to use the
user_catalog_table reloption in both places. We could try to propagate
that reloption from the table to its indexes; whenever it's set or
unset on the table, push that down to each index. We'd have to take
care not to let the property be changed independently on indexes,
though. This feels a little grotty to me, but it does have the
advantage of symmetry. 


I thought about this approach too when working on it. But I thought it would be "weird" 
to start to propagate option(s) from table(s) to indexe(s). I mean, if that's an "option" 
why should it be propagated?
Furthermore, it seems to me that this option does behave more like a property 
(that affects logical decoding), more like logged/unlogged (being reflected in 
pg_class.relpersistence not in reloptions).


Another way to get symmetry is to go the other
way and add a new column pg_class.relusercatalog which gets used
instead of putting user_catalog_table in the reloptions, and
propagated down to indexes. 


Yeah, agree (see my previous point above).


But I have a feeling that the reloptions
code is not very well-structured to allow reloptions to be stored any
place but in pg_class.reloptions, so this may be difficult to
implement. 


Why don't remove this "property" from reloptions? (would probably need much 
more changes that the current approach and probably take care of upgrade scenario too).
I did not look in details but logged/unlogged is also propagated to the 
indexes, so maybe we could use the same approach here. But is it worth the 
probably added complexity (as compare to the current approach)?


Yet a third way is to have the index fetch the flag from
the associated table, perhaps when the relcache entry is built. But I
see no existing precedent for that in RelationInitIndexAccessInfo,
which I think is where it would be if we had it -- and that makes me
suspect that there might be good

Re: Introduce a new view for checkpointer related stats

2022-12-13 Thread Nitin Jadhav

The patch looks good to me.

Thanks & Regards,
Nitin Jadhav

On Fri, Dec 2, 2022 at 11:20 AM Bharath Rupireddy
 wrote:
>
> On Wed, Nov 30, 2022 at 5:15 PM Bharath Rupireddy
>  wrote:
> >
> > I don't have a strong opinion about changing column names. However, if
> > we were to change it, I prefer to use names that
> > PgStat_CheckpointerStats has. BTW, that's what
> > PgStat_BgWriterStats/pg_stat_bgwriter and
> > PgStat_ArchiverStats/pg_stat_archiver uses.
>
> After thinking about this a while, I convinced myself to change the
> column names to be a bit more meaningful. I still think having
> checkpoints in the column names is needed because it also has other
> backend related columns. I'm attaching the v4 patch for further
> review.
> CREATE VIEW pg_stat_checkpointer AS
> SELECT
> pg_stat_get_timed_checkpoints() AS timed_checkpoints,
> pg_stat_get_requested_checkpoints() AS requested_checkpoints,
> pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
> pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
> pg_stat_get_buf_written_checkpoints() AS buffers_written_checkpoints,
> pg_stat_get_buf_written_backend() AS buffers_written_backend,
> pg_stat_get_buf_fsync_backend() AS buffers_fsync_backend,
> pg_stat_get_checkpointer_stat_reset_time() AS stats_reset;
>
> --
> Bharath Rupireddy
> PostgreSQL Contributors Team
> RDS Open Source Databases
> Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Infinite loop while acquiring new TOAST Oid

2022-12-13 Thread Nikita Malakhov

Hi hackers!

I've prepared a patch with a 64-bit TOAST Value ID. It is a kind of
prototype
and needs some further work, but it is already working and ready to play
with.

I've introduced 64-bit value ID field in varatt_external, but to keep it
compatible
with older bases 64-bit value is stored as a structure of two 32-bit fields
and value
ID moved to be the last in the varatt_external structure. Also, I've
introduced
different ID assignment function - instead of using global OID assignment
function
I use individual counters for each TOAST table, automatically cached and
after
server restart when new value is inserted into TOAST table maximum used
value
is searched and used to assign the next one.

>Andres Freund:
>I think we'll need to do something about the width of varatt_external to
make
>the conversion to 64bit toast oids viable - and if we do, I don't think
>there's a decent argument for staying with 4 byte toast OIDs. I think the
>varatt_external equivalent would end up being smaller in just about all
cases.
>And as you said earlier, the increased overhead inside the toast table /
index
>is not relevant compared to the size of toasted datums.

There is some small overhead due to indexing 64-bit values. Also, for
smaller
tables we can use 32-bit ID instead of 64-bit, but then we would have a
problem
when we reach the limit of 2^32.

Pg_upgrade seems to be not a very difficult case if we go keeping old TOAST
tables using 32-bit values,

Please have a look. I'd be grateful for some further directions.

GIT branch with this feature, rebased onto current master:
https://github.com/postgrespro/postgres/tree/toast_64bit

On Wed, Dec 7, 2022 at 12:00 AM Nikita Malakhov  wrote:

> Hi hackers!
>
> Here's some update on the subject - I've made a branch based on master from
> 28/11 and introduced 64-bit TOAST value ID there. It is not complete now
> but
> is already working and has some features:
> - extended structure for TOAST pointer (varatt_long_external) with 64-bit
> TOAST value ID;
> - individual ID counters for each TOAST table, being cached for faster
> access
> and lookup of maximum TOAST ID in use after server restart.
> https://github.com/postgrespro/postgres/tree/toast_64bit
>
> --
> Regards,
> Nikita Malakhov
> Postgres Professional
> https://postgrespro.ru/
>

-- 
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/

0001_64_bit_toast_valueid_v1.patch
Description: Binary data

Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-13 Thread Amit Kapila

On Tue, Dec 13, 2022 at 4:36 AM Peter Smith  wrote:
>
> ~~~
>
> 3. pa_set_stream_apply_worker
>
> +/*
> + * Set the worker that required to apply the current streaming transaction.
> + */
> +void
> +pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo)
> +{
> + stream_apply_worker = winfo;
> +}
>
> Comment wording seems wrong.
>

I think something like "Cache the parallel apply worker information."
may be more suitable here.

Few more similar cosmetic comments:
1.
+ /*
+ * Unlock the shared object lock so that the parallel apply worker
+ * can continue to receive changes.
+ */
+ if (!first_segment)
+ pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);

This comment is missing in the new (0002) patch.

2.
+ if (!winfo->serialize_changes)
+ {
+ if (!first_segment)
+ pa_unlock_stream(winfo->shared->xid, AccessExclusiveLock);

I think we should write some comments on why we are not unlocking when
serializing changes.

3. Please add a comment like below in the patch to make it clear why
in stream_abort case we perform locking before sending the message.
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1858,6 +1858,13 @@ apply_handle_stream_abort(StringInfo s)
 * worker will wait on the lock for the next
set of changes after
 * processing the STREAM_ABORT message if it
is not already waiting
 * for STREAM_STOP message.
+*
+* It is important to perform this locking
before sending the
+* STREAM_ABORT message so that the leader can
hold the lock first
+* and the parallel apply worker will wait for
the leader to release
+* the lock. This is the same as what we do in
+* apply_handle_stream_stop. See Locking
Considerations atop
+* applyparallelworker.c.
 */
if (!toplevel_xact)

-- 
With Regards,
Amit Kapila.

Re: New strategies for freezing, advancing relfrozenxid early

2022-12-13 Thread John Naylor

On Tue, Dec 13, 2022 at 8:00 AM Peter Geoghegan  wrote:
>
> On Mon, Dec 12, 2022 at 3:47 PM Jeff Davis  wrote:
> > But the heuristic also seems off to me. What if you have lots of
partitions
> > in an append-only range-partitioned table? That would tend to use the
> > lazy freezing strategy (because each partition is small), but that's
> > not what you want. I understand heuristics aren't perfect, but it feels
> > like we could do something better.
>
> It is at least vastly superior to vacuum_freeze_min_age in cases like
> this. Not that that's hard -- vacuum_freeze_min_age just doesn't ever
> trigger freezing in any autovacuum given a table like pgbench_history
> (barring during aggressive mode), due to how it interacts with the
> visibility map. So we're practically guaranteed to do literally all
> freezing for an append-only table in an aggressive mode VACUUM.
>
> Worst of all, that happens on a timeline that has nothing to do with
> the physical characteristics of the table itself (like the number of
> unfrozen heap pages or something).

If the number of unfrozen heap pages is the thing we care about, perhaps
that, and not the total size of the table, should be the parameter that
drives freezing strategy?

> That said, I agree that the system-level picture of debt (the system
> level view of the number of unfrozen heap pages) is relevant, and that
> it isn't directly considered by the patch. I think that that can be
> treated as work for a future release. In fact, I think that there is a
> great deal that we could teach autovacuum.c about the system level
> view of things -- this is only one.

It seems an easier path to considering system-level of debt (as measured by
unfrozen heap pages) would be to start with considering table-level debt
measured the same way.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Use get_call_result_type() more widely

2022-12-13 Thread Michael Paquier

On Tue, Dec 13, 2022 at 01:06:48PM +0530, Bharath Rupireddy wrote:
> A review comment in another thread [1] by Michael Paquier about the
> usage of get_call_result_type() instead of explicit building of
> TupleDesc made me think about using it more widely. Actually, the
> get_call_result_type() looks at the function definitions to figure the
> column names and build the required TupleDesc, usage of which avoids
> duplication of the column names between pg_proc.dat/function
> definitions and source code. Also, it saves a good number of LOC ~415
> [2] and the size of all the object files put together gets reduced by
> ~4MB, which means, the postgres binary becomes leaner by ~4MB [3]. I'm
> attaching a patch for these changes.

I have wanted to look at that when poking at the interface for
materialized SRFs but lacked of steam back then.  Even after this
change, we still have coverage for CreateTemplateTupleDesc() and
TupleDescInitEntry() through the GUCs/SHOW or even WAL sender, so the
coverage does not worry me much.  Backpatch conflicts may be a point
of contention, but that's pretty much in the same spirit as
SetSingleFuncCall()/InitMaterializedSRF().

All in that, +1 (still need to check in details what you have here,
looks rather fine at quick glance).
--
Michael

signature.asc
Description: PGP signature

68 matches

Mail list logo