date:20221205

Re: slab allocator performance issues

2022-12-05 Thread David Rowley

On Fri, 11 Nov 2022 at 22:20, John Naylor  wrote:
>
>
> On Wed, Oct 12, 2022 at 4:37 PM David Rowley  wrote:
> > [v3]
>
> + /*
> + * Compute a shift that guarantees that shifting chunksPerBlock with it
> + * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for 
> full blocks).
> + */
> + slab->freelist_shift = 0;
> + while ((slab->chunksPerBlock >> slab->freelist_shift) >= 
> (SLAB_FREELIST_COUNT - 1))
> + slab->freelist_shift++;
>
> + /*
> + * Ensure, without a branch, that index 0 is only used for blocks entirely
> + * without free chunks.
> + * XXX: There probably is a cheaper way to do this. Needing to shift twice
> + * by slab->freelist_shift isn't great.
> + */
> + index = (freecount + (1 << slab->freelist_shift) - 1) >> 
> slab->freelist_shift;
>
> How about something like
>
> #define SLAB_FREELIST_COUNT ((1<<3) + 1)
> index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);

Doesn't this create a sort of round-robin use of the free list?  What
we want is a sort of "histogram" bucket set of free lists so we can
group together blocks that have a close-enough free number of chunks.
Unless I'm mistaken, I think what you have doesn't do that.

I wondered if simply:

index = -(-freecount >> slab->freelist_shift);

would be faster than Andres' version.  I tried it out and on my AMD
machine, it's about the same speed. Same on a Raspberry Pi 4.

Going by [2], the instructions are very different with each method, so
other machines with different latencies on those instructions might
show something different. I attached what I used to test if anyone
else wants a go.

AMD Zen2
$ ./freecount 20
Test 'a' in 0.922766 seconds
Test 'd' in 0.922762 seconds (0.000433% faster)

RPI4
$ ./freecount 20
Test 'a' in 3.341350 seconds
Test 'd' in 3.338690 seconds (0.079672% faster)

That was gcc. Trying it with clang, it went in a little heavy-handed
and optimized out my loop, so some more trickery might be needed for a
useful test on that compiler.

David

[2] https://godbolt.org/z/dh95TohEG

#include 
#include 
#include 

#define SLAB_FREELIST_COUNT 9

int main(int argc, char **argv)
{
clock_t start, end;
double v1_time, v2_time;
int freecount;
int freelist_shift = 0;
int chunksPerBlock;
int index;
int zerocount = 0;

if (argc < 2)
{
printf("Syntax: %s \n", argv[0]);
return -1;
}

chunksPerBlock = atoi(argv[1]);
printf("chunksPerBlock = %d\n", chunksPerBlock);

while ((chunksPerBlock >> freelist_shift) >= (SLAB_FREELIST_COUNT - 1))
freelist_shift++;
printf("freelist_shift = %d\n", freelist_shift);

start = clock();

for (freecount = 0; freecount <= chunksPerBlock; freecount++)
{
index = (freecount + (1 << freelist_shift) - 1) >> 
freelist_shift;

/* try to prevent optimizing the above out */
if (index == 0)
zerocount++;
}


end = clock();
v1_time = (double) (end - start) / CLOCKS_PER_SEC;
printf("Test 'a' in %f seconds\n", v1_time);
printf("zerocount = %d\n", zerocount);

zerocount = 0;
start = clock();

for (freecount = 0; freecount <= chunksPerBlock; freecount++)
{
index = -(-freecount >> freelist_shift);

/* try to prevent optimizing the above out */
if (index == 0)
zerocount++;

}

end = clock();
v2_time = (double) (end - start) / CLOCKS_PER_SEC;
printf("Test 'd' in %f seconds (%f%% faster)\n", v2_time, v1_time / 
v2_time * 100.0 - 100.0);
printf("zerocount = %d\n", zerocount);
return 0;
}

Re: pg_dump: Remove "blob" terminology

2022-12-05 Thread Peter Eisentraut


On 02.12.22 09:34, Daniel Gustafsson wrote:

On 2 Dec 2022, at 08:09, Peter Eisentraut  
wrote:



fixed


+1 on this version of the patch, LGTM.


committed


I also put back the old long options forms in the documentation and help but 
marked them deprecated.


+  --blobs (deprecated)
While not in scope for this patch, I wonder if we should add a similar
(deprecated) marker on other commandline options which are documented to be
deprecated.  -i on bin/postgres comes to mind as one candidate.


makes sense

Re: Generate pg_stat_get_* functions with Macros

2022-12-05 Thread Drouvot, Bertrand


Hi,

On 12/5/22 8:44 AM, Michael Paquier wrote:

On Mon, Dec 05, 2022 at 08:27:15AM +0100, Drouvot, Bertrand wrote:

On 12/4/22 6:32 PM, Nathan Bossart wrote:

Alright.  I marked this as ready-for-committer.


Thanks!


Well, that's kind of nice:
  5 files changed, 139 insertions(+), 396 deletions(-)
And I like removing code, so..



Thanks for looking at it!


In the same area, I am counting a total of 21 (?) pgstat routines for
databases that rely on pgstat_fetch_stat_dbentry() while returning an
int64.  This would lead to more cleanup.
--



Yeah, good point, thanks!

I'll look at the "databases" ones but I think in a separate patch. The reason 
is that the current one is preparatory work for [1].
Means, once the current patch is committed, working on [1] and "cleaning" the 
databases one could be done in parallel. Sounds good to you?


[1]: https://commitfest.postgresql.org/41/3984/

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Operation log for major operations

2022-12-05 Thread Dmitry Koval


Hi!

>I think storing this in pg_control is a bad idea.  That file is
>extremely critical and if you break it, you're pretty much SOL on
>recovering your data.  I suggest that this should use a separate file.

Thanks. Operation log location changed to 'global/pg_control_log' (if 
the name is not pretty, it can be changed).


I attached the patch (v2-0001-Operation-log.patch) and description of 
operation log (Operation-log.txt).



With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com-
Introduction.
-
Operation log is designed to store information about critical system events 
(like pg_upgrade, pg_resetwal, pg_resetwal, etc.).
This information is not interesting to the ordinary user, but it is very 
important for the vendor's technical support.
An example: client complains about DB damage to technical support (damage was 
caused by pg_resetwal which was "silent" executed by one of administrators).

Concepts.
-
* operation log is placed in the separate file 'global/pg_control_log' (log 
size is 8kB);
* user can not modify the operation log;  log can be changed  by function call 
only (from postgres or from postgres utilities);
* operation log is a ring buffer (with CRC-32 protection), deleting entries 
from the operation log is possible only when the buffer is overflowed;
* SQL-function is used to read data of operation log.

Example of operation log data.
--

>select * from pg_operation_log();
   event|edition|version|   lsn   | last |count
+---+---+-+--+--
 startup|vanilla|10.22.0|0/828|2022-11-18 23:06:27+03|1
 pg_upgrade |vanilla|15.0.0 |0/828|2022-11-18 23:06:27+03|1
 startup|vanilla|15.0.0 |0/80001F8|2022-11-18 23:11:53+03|3
 pg_resetwal|vanilla|15.0.0 |0/80001F8|2022-11-18 23:09:53+03|2
(4 rows)

Some details about inserting data to operation log.
---
There are two modes of inserting information about events in the operation log.

* MERGE mode (events "startup", "pg_resetwal", "pg_rewind").
We searches in ring buffer of operation log an event with the same type 
("startup" for example) with the same version number.
If event was found, we will increment event counter by 1 and update the 
date/time of event ("last" field) with the current value.
If event was not found, we will add this event to the ring buffer (see INSERT 
mode).
* INSERT mode (events "bootstrap", "pg_upgrade", "promoted").
We will add an event to the ring buffer (without searching).
-
Operation log structures.
-
The operation log is placed in the pg_control_log file (file size 
PG_OPERATION_LOG_FULL_SIZE=8192).
Operation log is a ring buffer of 341 elements (24 bytes each) with header (8 
bytes).

а) Operation log element:

typedef struct OperationLogData
{
uint8   ol_type;/* operation type */
uint8   ol_edition; /* postgres edition */
uint16  ol_count;   /* number of operations */
uint32  ol_version; /* postgres version */
pg_time_t   ol_timestamp;   /* = int64, operation date/time */
XLogRecPtr  ol_lsn; /* = uint64, last check point 
record ptr */
}   OperationLogData;

Description of fields:

* ol_type - event type. The types are contained in an enum:

typedef enum
{
OLT_BOOTSTRAP = 1,  /* bootstrap */
OLT_STARTUP,/* server startup */
OLT_RESETWAL,   /* pg_resetwal */
OLT_REWIND, /* pg_rewind */
OLT_UPGRADE,/* pg_upgrade */
OLT_PROMOTED,   /* promoted */
OLT_NumberOfTypes   /* should be last */
}   ol_type_enum;

* ol_edition - PostgreSQL edition (this field is important for custom 
PostgreSQL editions).
The editions are contained in an enum:

typedef enum
{
ED_PG_ORIGINAL = 0
/* Here can be custom editions */
}   PgNumEdition;

* ol_count - the number of fired events in case events of this type can be 
merged (otherwise 1). Maximum is 65536;
* ol_version - version formed with using PG_VERSION_NUM (PG_VERSION_NUM*100; 
for example: 13000800 for v13.8). Two last digits can be used for patch version;
* ol_timestamp - date/time of the last fired event in case events of this type 
can be merged (otherwise - the date/time of the event);
* ol_lsn - "Latest checkpoint location" value (ControlFile->checkPoint) which 
contains in pg_control file at the time of the event.

б) Operation log header:

typedef struct OperationLogHeader
{
pg_crc32c   ol_crc; /* CRC of operation l

Re: Generate pg_stat_get_* functions with Macros

2022-12-05 Thread Michael Paquier

On Mon, Dec 05, 2022 at 09:11:43AM +0100, Drouvot, Bertrand wrote:
> Means, once the current patch is committed, working on [1] and
> "cleaning" the databases one could be done in parallel. Sounds good
> to you? 

Doing that in a separate patch is fine by me.
--
Michael


signature.asc
Description: PGP signature

Order getopt arguments

2022-12-05 Thread Peter Eisentraut

I had noticed that most getopt() or getopt_long() calls had their letter 
lists in pretty crazy orders.  There might have been occasional attempts 
at grouping, but those then haven't been maintained as new options were 
added.  To restore some sanity to this, I went through and ordered them 
alphabetically.From 022b6f7665e7f38fd773f7d44c892344b714f795 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 5 Dec 2022 09:15:53 +0100
Subject: [PATCH] Order getopt arguments

Order the letters in the arguments of getopt() and getopt_long(), as
well as in the subsequent switch statements.  In most cases, I used
alphabetical with lower case first.  In a few cases, existing
different orders (e.g., upper case first) was kept to reduce the diff
size.
---
 src/backend/bootstrap/bootstrap.c |  52 ++---
 src/backend/postmaster/postmaster.c   |  56 ++---
 src/backend/tcop/postgres.c   |  54 ++---
 src/bin/pg_amcheck/pg_amcheck.c   |  10 +-
 src/bin/pg_archivecleanup/pg_archivecleanup.c |   2 +-
 src/bin/pg_basebackup/pg_basebackup.c | 108 -
 src/bin/pg_basebackup/pg_receivewal.c |  41 ++--
 src/bin/pg_basebackup/pg_recvlogical.c|   8 +-
 src/bin/pg_checksums/pg_checksums.c   |  14 +-
 src/bin/pg_upgrade/option.c   |   2 +-
 src/bin/pgbench/pgbench.c | 206 +-
 src/bin/scripts/clusterdb.c   |  38 ++--
 src/bin/scripts/createdb.c|  44 ++--
 src/bin/scripts/dropdb.c  |  20 +-
 src/bin/scripts/dropuser.c|  14 +-
 src/bin/scripts/reindexdb.c   |  56 ++---
 src/bin/scripts/vacuumdb.c|  98 -
 src/interfaces/ecpg/preproc/ecpg.c|  91 
 .../modules/libpq_pipeline/libpq_pipeline.c   |   8 +-
 19 files changed, 458 insertions(+), 464 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c 
b/src/backend/bootstrap/bootstrap.c
index 661ebacb0c..d623ad9d50 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -228,6 +228,32 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)
case 'B':
SetConfigOption("shared_buffers", optarg, 
PGC_POSTMASTER, PGC_S_ARGV);
break;
+   case 'c':
+   case '-':
+   {
+   char   *name,
+  *value;
+
+   ParseLongOption(optarg, &name, &value);
+   if (!value)
+   {
+   if (flag == '-')
+   ereport(ERROR,
+   
(errcode(ERRCODE_SYNTAX_ERROR),
+
errmsg("--%s requires a value",
+   
optarg)));
+   else
+   ereport(ERROR,
+   
(errcode(ERRCODE_SYNTAX_ERROR),
+
errmsg("-c %s requires a value",
+   
optarg)));
+   }
+
+   SetConfigOption(name, value, 
PGC_POSTMASTER, PGC_S_ARGV);
+   pfree(name);
+   pfree(value);
+   break;
+   }
case 'D':
userDoption = pstrdup(optarg);
break;
@@ -265,32 +291,6 @@ BootstrapModeMain(int argc, char *argv[], bool check_only)

PGC_S_DYNAMIC_DEFAULT);
}
break;
-   case 'c':
-   case '-':
-   {
-   char   *name,
-  *value;
-
-   ParseLongOption(optarg, &name, &value);
-   if (!value)
-   {
-   if (flag == '-')
-   ereport(ERROR,
-

Re: Order getopt arguments

2022-12-05 Thread Fabien COELHO




Hello Peter,

I had noticed that most getopt() or getopt_long() calls had their letter 
lists in pretty crazy orders.  There might have been occasional attempts 
at grouping, but those then haven't been maintained as new options were 
added. To restore some sanity to this, I went through and ordered them 
alphabetically.


I agree that a more or less random historical order does not make much 
sense.


For pgbench, ISTM that sorting per functionality then alphabetical would 
be better than pure alphabetical because it has 2 modes. Such sections 
might be (1) general (2) connection (3) common/shared (4) initialization 
and (5) benchmarking, we some comments on each.


What do you think? If okay, I'll send you a patch for that.

--
Fabien.

Re: Optimize common expressions in projection evaluation

2022-12-05 Thread Peifeng Qiu

> Which is properly written as the following, using lateral, which also avoids 
> the problem you describe:
>
> INSERT INTO tbl
> SELECT func_call.*
> FROM ft
> JOIN LATERAL convert_func(ft.rawdata) AS func_call ON true;

I didn't fully realize this syntax until you point out. Just try it out in
our case and this works well. I think My problem is mostly resolved
without the need of this patch.  Thanks!

It's still good to do something about the normal (func(v)).* syntax
if it's still considered legal. I will give a try to expanding it more
cleverly and see if we can avoid the duplicate evaluation issue.

Peifeng Qiu

Re: [PATCH] Add native windows on arm64 support

2022-12-05 Thread Niyas Sait

On 05/12/2022 05:12, Michael Paquier wrote:

- Last comes OpenSSL, that supports amd64_arm64 as build target (see
NOTES-WINDOWS.md), and the library names are libssl.lib and
libcrypto.lib.  Looking at
https://slproweb.com/products/Win32OpenSSL.html, there are
experimental builds for arm64 with OpenSSL 3.0.  Niyas or somebody
else, could you look at the contents of lib/VC/ and see what we could
rely on for the debug builds after installing this MSI?  We should
rely on something like lib/VC/sslcrypto64MD.lib or
lib/VC/sslcrypto32MD.lib, but for arm64.

I tried that installer. And I can see following libraries installed in 
lib/VC location.

libcryptoarm64MD.lib
libcryptoarm64MDd.lib
libcryptoarm64MT.lib
libcryptoarm64MTd.lib
libsslarm64MD.lib
libsslarm64MDd.lib
libsslarm64MT.lib
libsslarm64MTd.lib

-   USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK => undef,
+   USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK => $self->{platform} eq "ARM64" ? : 
1 : undef,
Did you actually test this patch?  This won't work at all with perl,
per se the double colon after the question mark.

Yes I did a full build. I am not sure why I didn't see any error with 
that. My perl skills are very limited and I started with something bit 
more naive like "$self->{platform} == "ARM64" ? : 1 : undef" But that 
didn't work and I changed to using eq and the compilation was fine. 
Thanks for fixing the patch.

> For now, please find attached an updated patch with all the fixes I
> could come up with.

Thanks. I did a quick sanity build with your updated patch and looks fine.

--
Niyas

Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-05 Thread Amit Kapila

On Sun, Dec 4, 2022 at 4:48 PM houzj.f...@fujitsu.com
 wrote:
>
> On Friday, December 2, 2022 4:59 PM Peter Smith  wrote:
> >
>
> > ~~~
> >
> > 12. pa_clean_subtrans
> >
> > +/* Reset the list that maintains subtransactions. */ void
> > +pa_clean_subtrans(void)
> > +{
> > + subxactlist = NIL;
> > +}
> >
> > Maybe a more informative function name would be pa_reset_subxactlist()?
>
> I thought the current name is more consistent with pa_start_subtrans.
>

Then how about changing the name to pg_reset_subtrans()?

>
> >
> > 21. apply_worker_clean_exit
> >
> > I wasn't sure if calling this a 'clean' exit meant anything much.
> >
> > How about:
> > - apply_worker_proc_exit, or
> > - apply_worker_exit
>
> I thought the clean means the exit number is 0(proc_exit(0)) and is
> not due to any ERROR, I am not sure If proc_exit or exit is better.
>
> I have addressed other comments in the new version patch.
>

+1 for apply_worker_exit.

One minor suggestion for a recent change in v56-0001*:
 /*
- * A hash table used to cache streaming transactions being applied and the
- * parallel application workers required to apply transactions.
+ * A hash table used to cache the state of streaming transactions being
+ * applied by the parallel apply workers.
  */
 static HTAB *ParallelApplyTxnHash = NULL;

-- 
With Regards,
Amit Kapila.

Marking options deprecated in help output

2022-12-05 Thread Daniel Gustafsson

In the pg_dump blob terminology thread it was briefly discussed [0] to mark
parameters as deprecated in the --help output.  The attached is a quick diff to
show that that would look like.  Personally I think it makes sense, not
everyone will read the docs.

--
Daniel Gustafsson   https://vmware.com/

[0] 95467596-834d-baa4-c67c-5db1096ed...@enterprisedb.com



deprecated_options.diff
Description: Binary data

Re: Using WaitEventSet in the postmaster

2022-12-05 Thread Thomas Munro

On Sat, Dec 3, 2022 at 10:41 AM Thomas Munro  wrote:
> Here's an iteration like that.  Still WIP grade.  It passes, but there
> must be something I don't understand about this computer program yet,
> because if I move the "if (pending_..." section up into the block
> where WL_LATCH_SET has arrived (instead of testing those variables
> every time through the loop), a couple of tests leave zombie
> (unreaped) processes behind, indicating that something funky happened
> to the state machine that I haven't yet grokked.  Will look more next
> week.

Duh.  The reason for that was the pre-existing special case for
PM_WAIT_DEAD_END, which used a sleep(100ms) loop to wait for children
to exit, which I needed to change to a latch wait.  Fixed in the next
iteration, attached.

The reason for the existing sleep-based approach was that we didn't
want to accept any more connections (or spin furiously because the
listen queue was non-empty).  So in this version I invented a way to
suppress socket events temporarily with WL_SOCKET_IGNORE, and then
reactivate them after crash reinit.

Still WIP, but I hope travelling in the right direction.
From 65b5fa1f7024cb78cee9ba57d36a78dc17ffe492 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 9 Nov 2022 22:59:58 +1300
Subject: [PATCH v3] Give the postmaster a WaitEventSet and a latch.

Traditionally, the postmaster's architecture was quite unusual.  It did
a lot of work inside signal handlers, which were only unblocked while
waiting in select() to make that safe.

Switch to a more typical architecture, where signal handlers just set
flags and use a latch to close races.  Now the postmaster looks like
all other PostgreSQL processes, multiplexing its event processing in
epoll_wait()/kevent()/poll()/WaitForMultipleObjects() depending on the
OS.

Changes:

 * WL_SOCKET_ACCEPT is a new event for an incoming connection (on Unix,
   this is just another name for WL_SOCKET_READABLE, but Window has a
   different underlying event; this mirrors WL_SOCKET_CONNECTED on the
   other end of a connection)

 * WL_SOCKET_IGNORE is a new way to stop waking up for new incoming
   connections while shutting down.

 * Small adjustments to WaitEventSet to allow running in the postmaster.

 * Allow the postmaster to set up its own local latch.  For now we don't
   want other backends setting the postmaster's latch directly (perhaps
   later we'll figure out how to use a shared latch "robustly", so that
   memory corruption can't interfere with the postmaster's
   cleanup-and-restart responsibilities, but for now there is a two-step
   signal protocol SIGUSR1 -> SIGURG).

 * The existing signal handlers are cut in two: a handle_XXX part that
   sets a pending_XXX variable and sets the local latch, and a
   process_XXX part.

 * ServerLoop(), the process_XXX() functions and
   PostmasterStateMachine() now all take a pointer to a Postmaster
   object that lives on the stack as a parameter that initially holds the
   WaitEventSet they need to do their job.  Many other global variables
   could be moved into it, but that's not done here.

 * Signal handlers are now installed with the regular pqsignal()
   function rather then the special pqsignal_pm() function; the concerns
   about the portability of SA_RESTART vs select() are no longer
   relevant: SUSv2 left it implementation-defined whether select()
   restarts, but didn't add that qualification for poll(), and it doesn't
   matter anyway because we call SetLatch() creating a new reason to wake
   up.

Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/CA%2BhUKG%2BZ-HpOj1JsO9eWUP%2Bar7npSVinsC_npxSy%2BjdOMsx%3DGg%40mail.gmail.com
---
 src/backend/libpq/pqsignal.c|  40 ---
 src/backend/postmaster/postmaster.c | 413 +++-
 src/backend/storage/ipc/latch.c |  22 ++
 src/backend/tcop/postgres.c |   1 -
 src/backend/utils/init/miscinit.c   |  13 +-
 src/include/libpq/pqsignal.h|   3 -
 src/include/miscadmin.h |   1 +
 src/include/storage/latch.h |   9 +-
 8 files changed, 266 insertions(+), 236 deletions(-)

diff --git a/src/backend/libpq/pqsignal.c b/src/backend/libpq/pqsignal.c
index 1ab34c5214..718043a39d 100644
--- a/src/backend/libpq/pqsignal.c
+++ b/src/backend/libpq/pqsignal.c
@@ -97,43 +97,3 @@ pqinitmask(void)
 	sigdelset(&StartupBlockSig, SIGALRM);
 #endif
 }
-
-/*
- * Set up a postmaster signal handler for signal "signo"
- *
- * Returns the previous handler.
- *
- * This is used only in the postmaster, which has its own odd approach to
- * signal handling.  For signals with handlers, we block all signals for the
- * duration of signal handler execution.  We also do not set the SA_RESTART
- * flag; this should be safe given the tiny range of code in which the
- * postmaster ever unblocks signals.
- *
- * pqinitmask() must have been invoked previously.
- */
-pqsigfunc
-pqsignal_pm(int signo, pqsigfunc func)
-{
-	struct sigaction act,
-oact;
-
-	act.sa_handler = f

Re: doc: add missing "id" attributes to extension packaging page

2022-12-05 Thread Alvaro Herrera

On 2022-Dec-05, Ian Lawrence Barwick wrote:

> On this page:
> 
>   https://www.postgresql.org/docs/current/extend-extensions.html
> 
> three of the  sections are missing an "id" attribute; patch adds
> these. Noticed when trying to create a stable link to one of the affected
> sections.

Hm, I was reminded of this patch here that adds IDs in a lot of places
https://postgr.es/m/3bac458c-b121-1b20-8dea-0665986fa...@gmx.de
and this other one 
https://postgr.es/m/76287ac6-f415-8562-fdaa-5876380c0...@gmx.de
which adds XSL stuff for adding selectable anchors next to each
id-carrying item.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Avoid streaming the transaction which are skipped (in corner cases)

2022-12-05 Thread Dilip Kumar

On Mon, Dec 5, 2022 at 9:21 AM Dilip Kumar  wrote:
>
> On Mon, Dec 5, 2022 at 8:59 AM Amit Kapila  wrote:
> >
> > On Sun, Dec 4, 2022 at 5:14 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Saturday, December 3, 2022 7:37 PM Amit Kapila 
> > >  wrote:
> > > > Apart from the above, I have slightly adjusted the comments in the 
> > > > attached. Do
> > > > let me know what you think of the attached.
> > >
> > > Thanks for updating the patch. It looks good to me.
> > >
> >
> > I feel the function name ReorderBufferLargestTopTXN() is slightly
> > misleading because it also checks some of the streaming properties
> > (like whether the TXN has partial changes and whether it contains any
> > streamable change). Shall we rename it to
> > ReorderBufferLargestStreamableTopTXN() or something like that?
>
> Yes that makes sense

I have done this change in the attached patch.

> > The other point to consider is whether we need to have a test case for
> > this patch. I think before this patch if the size of DDL changes in a
> > transaction exceeds logical_decoding_work_mem, the empty streams will
> > be output in the plugin but after this patch, there won't be any such
> > stream.

I tried this test, but I think generating 64k data with just CID
messages will make the test case really big.  I tried using multiple
sessions such that one session makes the reorder buffer full but
contains partial changes so that we try to stream another transaction
but that is not possible in an automated test to consistently generate
the partial change.

I think we need something like this[1] so that we can better control
the streaming.

[1]https://www.postgresql.org/message-id/OSZPR01MB631042582805A8E8615BC413FD329%40OSZPR01MB6310.jpnprd01.prod.outlook.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From d15b33d74952a78c4050ed7f1235bbb675643478 Mon Sep 17 00:00:00 2001
From: Amit Kapila 
Date: Sat, 3 Dec 2022 12:11:29 +0530
Subject: [PATCH v5] Avoid unnecessary streaming of transactions during logical
 replication.

After restart, we don't perform streaming of an in-progress transaction if
it was previously decoded and confirmed by client. To achieve that we were
comparing the END location of the WAL record being decoded with the WAL
location we have already decoded and confirmed by the client. While
decoding the commit record, to decide whether to process and send the
complete transaction, we compare its START location with the WAL location
we have already decoded and confirmed by the client. Now, if we need to
queue some change in the transaction while decoding the commit record
(e.g. snapshot), it is possible that we decide to stream the transaction
but later commit processing decides to skip it. In such a case, we would
needlessly send the changes and later when we decide to skip it, we will
send stream abort.

We also sometimes decide to stream the changes when we actually just need
to process them locally like a change for invalidations. This will lead us
to send empty streams. To avoid this, while queuing each change for
decoding, we remember whether the transaction has any change that actually
needs to be sent downstream and use that information later to decide
whether to stream the transaction or not.

Author: Dilip Kumar
Reviewed-by: Hou Zhijie, Ashutosh Bapat, Shi yu, Amit Kapila
Discussion: https://postgr.es/m/CAFiTN-tHK=7lzfrps8fbt2ksrojgqbzywcgxst2bm9-rjja...@mail.gmail.com
---
 src/backend/replication/logical/reorderbuffer.c | 51 ++---
 src/include/replication/reorderbuffer.h | 23 +++
 2 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 31f7381..6e11056 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -695,9 +695,9 @@ ReorderBufferTXNByXid(ReorderBuffer *rb, TransactionId xid, bool create,
  * Record the partial change for the streaming of in-progress transactions.  We
  * can stream only complete changes so if we have a partial change like toast
  * table insert or speculative insert then we mark such a 'txn' so that it
- * can't be streamed.  We also ensure that if the changes in such a 'txn' are
- * above logical_decoding_work_mem threshold then we stream them as soon as we
- * have a complete change.
+ * can't be streamed.  We also ensure that if the changes in such a 'txn' can
+ * be streamed and are above logical_decoding_work_mem threshold then we stream
+ * them as soon as we have a complete change.
  */
 static void
 ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
@@ -762,7 +762,8 @@ ReorderBufferProcessPartialChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	 */
 	if (ReorderBufferCanStartStreaming(rb) &&
 		!(rbtxn_has_partial_change(toptxn)) &&
-		rbtxn_is_serialized(txn))
+		rbtxn_is_serialized(txn) &&
+		rbtxn_has_streamable_change(to

Re: slab allocator performance issues

2022-12-05 Thread John Naylor

On Mon, Dec 5, 2022 at 3:02 PM David Rowley  wrote:
>
> On Fri, 11 Nov 2022 at 22:20, John Naylor 
wrote:

> > #define SLAB_FREELIST_COUNT ((1<<3) + 1)
> > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);
>
> Doesn't this create a sort of round-robin use of the free list?  What
> we want is a sort of "histogram" bucket set of free lists so we can
> group together blocks that have a close-enough free number of chunks.
> Unless I'm mistaken, I think what you have doesn't do that.

The intent must have slipped my mind along the way.

> I wondered if simply:
>
> index = -(-freecount >> slab->freelist_shift);
>
> would be faster than Andres' version.  I tried it out and on my AMD
> machine, it's about the same speed. Same on a Raspberry Pi 4.
>
> Going by [2], the instructions are very different with each method, so
> other machines with different latencies on those instructions might
> show something different. I attached what I used to test if anyone
> else wants a go.

I get about 0.1% difference on my machine.  Both ways boil down to (on gcc)
3 instructions with low latency. The later ones need the prior results to
execute, which I think is what the XXX comment "isn't great" was referring
to. The new coding is more mysterious (does it do the right thing on all
platforms?), so I guess the original is still the way to go unless we get a
better idea.

--
John Naylor
EDB: http://www.enterprisedb.com

MemoizePath fails to work for partitionwise join

2022-12-05 Thread Richard Guo

I happened to notice that currently MemoizePath cannot be generated for
partitionwise join.  I traced that and have found the problem.  One
condition we need to meet to generate MemoizePath is that inner path's
ppi_clauses should have the form "outer op inner" or "inner op outer",
as checked by clause_sides_match_join in paraminfo_get_equal_hashops.

Note that when are at this check, the inner path has not got the chance
to be re-parameterized (that is done later in try_nestloop_path), so if
it is parameterized, it is parameterized by the topmost parent of the
outer rel, not the outer rel itself.  Thus this check performed by
clause_sides_match_join could not succeed if we are joining two child
rels.

The fix is straightforward, just to use outerrel->top_parent if it is
not null and leave the reparameterization work to try_nestloop_path.  In
addition, I believe when reparameterizing MemoizePath we need to adjust
its param_exprs too.

Attach the patch for fix.

Thanks
Richard


v1-0001-Fix-MemoizePath-for-partitionwise-join.patch
Description: Binary data

Re: Avoid streaming the transaction which are skipped (in corner cases)

2022-12-05 Thread Amit Kapila

On Mon, Dec 5, 2022 at 3:41 PM Dilip Kumar  wrote:
>
> On Mon, Dec 5, 2022 at 9:21 AM Dilip Kumar  wrote:
> >
> > On Mon, Dec 5, 2022 at 8:59 AM Amit Kapila  wrote:
> > >
> > > On Sun, Dec 4, 2022 at 5:14 PM houzj.f...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Saturday, December 3, 2022 7:37 PM Amit Kapila 
> > > >  wrote:
> > > > > Apart from the above, I have slightly adjusted the comments in the 
> > > > > attached. Do
> > > > > let me know what you think of the attached.
> > > >
> > > > Thanks for updating the patch. It looks good to me.
> > > >
> > >
> > > I feel the function name ReorderBufferLargestTopTXN() is slightly
> > > misleading because it also checks some of the streaming properties
> > > (like whether the TXN has partial changes and whether it contains any
> > > streamable change). Shall we rename it to
> > > ReorderBufferLargestStreamableTopTXN() or something like that?
> >
> > Yes that makes sense
>
> I have done this change in the attached patch.
>
> > > The other point to consider is whether we need to have a test case for
> > > this patch. I think before this patch if the size of DDL changes in a
> > > transaction exceeds logical_decoding_work_mem, the empty streams will
> > > be output in the plugin but after this patch, there won't be any such
> > > stream.
>
> I tried this test, but I think generating 64k data with just CID
> messages will make the test case really big.  I tried using multiple
> sessions such that one session makes the reorder buffer full but
> contains partial changes so that we try to stream another transaction
> but that is not possible in an automated test to consistently generate
> the partial change.
>

I also don't see a way to achieve it in an automated way because both
toast and speculative inserts are part of one statement, so we need a
real concurrent test to make it happen. Can anyone else think of a way
to achieve it?

> I think we need something like this[1] so that we can better control
> the streaming.
>

+1. The additional advantage would be that we can generate parallel
apply and new streaming tests with much lesser data. Shi-San, can you
please start a new thread for the GUC patch proposed by you as
indicated by Dilip?

-- 
With Regards,
Amit Kapila.

Re: Missing MaterialPath support in reparameterize_path_by_child

2022-12-05 Thread Ashutosh Bapat

On Fri, Dec 2, 2022 at 9:13 PM Tom Lane  wrote:
>
> Ashutosh Bapat  writes:
> > On Fri, Dec 2, 2022 at 8:25 AM Tom Lane  wrote:
> >> Unfortunately, I don't have an example that produces such a
> >> failure against HEAD.  It seems certain to me that such cases
> >> exist, though, so I'd like to apply and back-patch the attached.
>
> > From this comment, that I wrote back when I implemented that function,
> > I wonder if we thought MaterialPath wouldn't appear on the inner side
> > of nestloop join. But that can't be the case. Or probably we didn't
> > find MaterialPath being there from our tests.
> >  * This function is currently only applied to the inner side of a 
> > nestloop
> >  * join that is being partitioned by the partitionwise-join code.  
> > Hence,
> >  * we need only support path types that plausibly arise in that context.
> > But I think it's good to have MaterialPath there.
>
> So thinking about this a bit: the reason it is okay if reparameterize_path
> fails is that it's not fatal.  We just go on our way without making
> a parameterized path for that appendrel.  However, if
> reparameterize_path_by_child fails for every available child path,
> we end up with "could not devise a query plan", because the
> partitionwise-join code is brittle and won't tolerate failure
> to build a parent-join path.  Seems like we should be willing to
> fall back to a non-partitionwise join in that case.
>
> regards, tom lane

partition-wise join should be willing to fallback to non-partitionwise
join in such a case. After spending a few minutes with the code, I
think generate_partitionwise_join_paths() should not call
set_cheapest() is the pathlist of the child is NULL and should just
wind up and avoid adding any path.

-- 
Best Wishes,
Ashutosh Bapat

Re: Allow placeholders in ALTER ROLE w/o superuser

2022-12-05 Thread Pavel Borisov

> On Thu, Dec 1, 2022 at 6:14 AM Alexander Korotkov  
> wrote:
> > On Wed, Nov 23, 2022 at 1:53 AM Steve Chavez  wrote:
> > > So from my side this all looks good!
> >
> > Thank you for your feedback.
> >
> > The next revision of the patch is attached.  It contains code
> > improvements, comments and documentation.  I'm going to also write
> > sode tests.  pg_db_role_setting doesn't seem to be well-covered with
> > tests.  I will probably need to write a new module into
> > src/tests/modules to check now placeholders interacts with dynamically
> > defined GUCs.
>
> Another revision of patch is attached.  It's fixed now that USER SET
> values can't be used for PGC_SUSET parameters.  Tests are added.  That
> require new module test_pg_db_role_setting to check dynamically
> defined GUCs.

I've looked through the last version of a patch. The tests in v3
failed due to naming mismatches. I fixed this in v4 (PFA).
The other thing that may seem unexpected: is whether the value should
apply to the ordinary user only, encoded in the parameter name. The
pro of this is that it doesn't break catalog compatibility by a
separate field for GUC permissions a concept that doesn't exist today
(and maybe not needed at all). Also, it makes the patch more
minimalistic in the code. This is also fully compatible with the
previous parameters naming due to parentheses being an unsupported
symbol for the parameter name.

I've also tried to revise the comments and docs a little bit to
reflect the changes.
The CI-enabled build of patch v4 for reference is at
https://github.com/pashkinelfe/postgres/tree/placeholders-in-alter-role-v4

Overall the patch looks useful and good enough to be committed.

Kind regards,
Pavel Borisov,
Supabase

v4-0001-USER-SET-parameters-for-pg_db_role_setting.patch
Description: Binary data

Re: Allow placeholders in ALTER ROLE w/o superuser

2022-12-05 Thread Pavel Borisov

After posting the patch I've found my own typo in docs. So corrected
it in v5 (PFA).

Regards,
Pavel.


v5-0001-USER-SET-parameters-for-pg_db_role_setting.patch
Description: Binary data

move some bitmapset.c macros to bitmapset.h

2022-12-05 Thread John Naylor

Over in [1], Masahiko and I found that using some bitmapset logic yields a
useful speedup in one part of the proposed radix tree patch. In addition to
what's in bitmapset.h now, we'd need WORDNUM, BITNUM, RIGHTMOST_ONE and
bmw_rightmost_one_pos() from bitmapset.c. The file tidbitmap.c has its own
copies of the first two, and we could adapt that strategy again. But
instead of three files defining these, it seems like it's time to consider
moving them somewhere more central.

Attached is a simple form of this concept, including moving
HAS_MULTIPLE_ONES just for consistency. One possible objection is the
possibility of namespace clash. Thoughts?

I also considered putting the macros and typedefs in pg_bitutils.h. One
organizational advantage is that pg_bitutils.h already offers convenience
function symbols where the parameter depends on SIZEOF_SIZE_T, so putting
the BITS_PER_BITMAPWORD versions there makes sense. But that way is not a
clear win, so I didn't go that far.

[1]
https://www.postgresql.org/message-id/CAFBsxsHgP5LP9q%3DRq_3Q2S6Oyut67z%2BVpemux%2BKseSPXhJF7sg%40mail.gmail.com

-- 
John Naylor
EDB: http://www.enterprisedb.com
 src/backend/nodes/bitmapset.c | 24 
 src/backend/nodes/tidbitmap.c |  5 -
 src/include/nodes/bitmapset.h | 24 
 3 files changed, 24 insertions(+), 29 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..3204b49738 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -26,33 +26,9 @@
 #include "port/pg_bitutils.h"
 
 
-#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
-
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*--
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *xx1
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *yy0
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *yy1
- * and then ANDing with the original value gives
- *001
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *--
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
 /* Select appropriate bit-twiddling functions for bitmap word size */
 #if BITS_PER_BITMAPWORD == 32
 #define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index a7a6b26668..8a1fd1d205 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -72,11 +72,6 @@
  */
 #define PAGES_PER_CHUNK  (BLCKSZ / 32)
 
-/* We use BITS_PER_BITMAPWORD and typedef bitmapword from nodes/bitmapset.h */
-
-#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
-
 /* number of active words for an exact page: */
 #define WORDS_PER_PAGE	((MAX_TUPLES_PER_PAGE - 1) / BITS_PER_BITMAPWORD + 1)
 /* number of active words for a lossy chunk: */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..8462282b9e 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -48,6 +48,30 @@ typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
+#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)  ((x) % BITS_PER_BITMAPWORD)
+
+/*--
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *xx1
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *yy0
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *yy1
+ * and then ANDing with the original value gives
+ *001
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *--
+ */
+#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
+
+#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
+
 typedef struct Bitmapset
 {
 	pg_node_attr(custom_copy_equal, special_read_write)

Re: move some bitmapset.c macros to bitmapset.h

2022-12-05 Thread Alvaro Herrera

On 2022-Dec-05, John Naylor wrote:

> diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
> index b7b274aeff..3204b49738 100644
> --- a/src/backend/nodes/bitmapset.c
> +++ b/src/backend/nodes/bitmapset.c
> @@ -26,33 +26,9 @@
>  #include "port/pg_bitutils.h"
>  
>  
> -#define WORDNUM(x)   ((x) / BITS_PER_BITMAPWORD)
> -#define BITNUM(x)((x) % BITS_PER_BITMAPWORD)

In this location, nobody can complain about the naming of these macros,
since they're just used to implement other bitmapset.c code.  However,
if you move them to the .h file, ISTM you should give them more
meaningful names.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Oh, great altar of passive entertainment, bestow upon me thy discordant images
at such speed as to render linear thought impossible" (Calvin a la TV)

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-12-05 Thread Melih Mutlu

Hi hackers,

I've been working on/struggling with this patch for a while. But I haven't
updated this thread regularly.
So sharing what I did with this patch so far.

> Amit Kapila , 6 Ağu 2022 Cmt, 16:01 tarihinde
> şunu yazdı:
> >>
> >> I think there is some basic flaw in slot reuse design. Currently, we
> >> copy the table by starting a repeatable read transaction (BEGIN READ
> >> ONLY ISOLATION LEVEL REPEATABLE READ) and create a slot that
> >> establishes a snapshot which is first used for copy and then LSN
> >> returned by it is used in the catchup phase after the copy is done.
> >> The patch won't establish such a snapshot before a table copy as it
> >> won't create a slot each time. If this understanding is correct, I
> >> think we need to use ExportSnapshot/ImportSnapshot functionality to
> >> achieve it or do something else to avoid the problem mentioned.
>

This issue that Amit mentioned causes some problems such as duplicated rows
in the subscriber.
Basically, with this patch, tablesync worker creates a replication slot
only in its first run. To ensure table copy and sync are consistent with
each other, the worker needs the correct snapshot and LSN which both are
returned by slot create operation.
Since this patch does not create a rep. slot in each table copy and instead
reuses the one created in the beginning, we do not get a new snapshot and
LSN for each table anymore. Snapshot gets lost right after the transaction
is committed, but  the patch continues to use the same LSN for next tables
without the proper snapshot.
In the end, for example, the worker might first copy some rows, then apply
changes from rep. slot and inserts those rows again for the tables in
later iterations.

I discussed some possible ways to resolve this with Amit offline:
1- Copy all tables in one transaction so that we wouldn't need to deal with
snapshots.
Not easy to keep track of the progress. If the transaction fails, we would
need to start all over again.

2- Don't lose the first snapshot (by keeping a transaction open with the
snapshot imported or some other way) and use the same snapshot and LSN for
all tables.
I'm not sure about the side effects of keeping a transaction open that long
or using a snapshot that might be too old after some time.
Still seems like it might work.

3- For each table, get a new snapshot and LSN by using an existing
replication slot.
Even though this approach wouldn't create a new replication slot, preparing
the slot for snapshot and then taking the snapshot may be costly.
Need some numbers here to see how much this approach would improve the
performance.

I decided to go with approach 3 (creating a new snapshot with an existing
replication slot) for now since it would require less change in the
tablesync worker logic than the other options would.
To achieve this, this patch introduces a new command for Streaming
Replication Protocol.
The new REPLICATION_SLOT_SNAPSHOT command basically mimics how
CREATE_REPLICATION_SLOT creates a snapshot, but without actually creating a
new replication slot.
Later the tablesync worker calls this command if it decides not to create a
new rep. slot, uses the snapshot created and LSN returned by the command.

Also;
After the changes discussed here [1], concurrent replication origin drops
by apply worker and tablesync workers may hold each other on wait due to
locks taken by replorigin_drop_by_name.
I see that this harms the performance of logical replication quite a bit in
terms of speed.
Even though reusing replication origins was discussed in this thread
before, the patch didn't include any change to do so.
The updated version of the patch now also reuses replication origins too.
Seems like even only changes to reuse origins by itself improves the
performance significantly.

Attached two patches:
0001: adds REPLICATION_SLOT_SNAPSHOT command for replication protocol.
0002: Reuses workers/replication slots and origins for tablesync

I would appreciate any feedback/review/thought on the approach and both
patches.
I will also share some numbers to compare performances of the patch and
master branch soon.

[1]
https://www.postgresql.org/message-id/flat/20220714115155.GA5439%40depesz.com

Best,
--
Melih Mutlu
Microsoft


0001-Add-replication-protocol-cmd-to-create-a-snapshot.patch
Description: Binary data


v4-0002-Reuse-Logical-Replication-Background-worker.patch
Description: Binary data

RE: suppressing useless wakeups in logical/worker.c

2022-12-05 Thread Hayato Kuroda (Fujitsu)

Dear Nathan,

Thank you for making the patch! I tested your patch, and it basically worked 
well.
About following part:

```
ConfigReloadPending = false;
ProcessConfigFile(PGC_SIGHUP);
+   now = GetCurrentTimestamp();
+   for (int i = 0; i < NUM_LRW_WAKEUPS; i++)
+   LogRepWorkerComputeNextWakeup(i, now);
+
+   /*
+* If a wakeup time for starting sync workers was set, 
just set it
+* to right now.  It will be recalculated as needed.
+*/
+   if (next_sync_start != PG_INT64_MAX)
+   next_sync_start = now;
}
```

Do we have to recalculate the NextWakeup when subscriber receives SIGHUP signal?
I think this may cause the unexpected change like following.

Assuming that wal_receiver_timeout is 60s, and wal_sender_timeout on publisher 
is
0s (or the network between nodes is disconnected).
And we send SIGHUP signal per 20s to subscriber's postmaster.

Currently the last_recv_time is calcurated when the worker accepts messages,
and the value is used for deciding to send a ping. The worker will exit if the
walsender does not reply.

But in your patch, the apply worker calcurates wakeup[LRW_WAKEUP_PING] and
wakeup[LRW_WAKEUP_TERMINATE] again when it gets SIGHUP, so the worker never 
sends
ping with requestReply = true, and never exits due to the timeout.

My case seems to be crazy, but there may be another issues if it remains.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: HOT chain validation in verify_heapam()

2022-12-05 Thread Himanshu Upadhyaya

On Fri, Dec 2, 2022 at 5:43 AM Andres Freund  wrote:

>
> > + /* Loop over offsets and validate the data in the
> predecessor array. */
> > + for (OffsetNumber currentoffnum = FirstOffsetNumber;
> currentoffnum <= maxoff;
> > +  currentoffnum = OffsetNumberNext(currentoffnum))
> > + {
> > + HeapTupleHeader pred_htup;
> > + HeapTupleHeader curr_htup;
> > + TransactionId pred_xmin;
> > + TransactionId curr_xmin;
> > + ItemId  pred_lp;
> > + ItemId  curr_lp;
> > + boolpred_in_progress;
> > + XidCommitStatus xid_commit_status;
> > + XidBoundsViolation xid_status;
> > +
> > + ctx.offnum = predecessor[currentoffnum];
> > + ctx.attnum = -1;
> > + curr_lp = PageGetItemId(ctx.page, currentoffnum);
> > + if (!lp_valid[currentoffnum] ||
> ItemIdIsRedirected(curr_lp))
> > + continue;
>
> I don't think we should do PageGetItemId(ctx.page, currentoffnum); if
> !lp_valid[currentoffnum].
>
> Fixed.

>
> > + if (ctx.offnum == 0)
>
> For one, I think it'd be better to use InvalidOffsetNumber here. But more
> generally, storing the predecessor in ctx.offnum seems quite confusing.
>
> changed all relevant places to use  InvalidOffsetNumber.

>
> > + {
> > + /*
> > +  * No harm in overriding value of
> ctx.offnum as we will always
> > +  * continue if we are here.
> > +  */
> > + ctx.offnum = currentoffnum;
> > + if (TransactionIdIsInProgress(curr_xmin)
> || TransactionIdDidCommit(curr_xmin))
>
> Is it actually ok to call TransactionIdDidCommit() here? There's a reason
> get_xid_status() exists after all. And we do have the xid status for xmin
> already, so this could just check xid_commit_status, no?
>
>
> I think it will be good to pass NULL to get_xid_status like
"get_xid_status(curr_xmin, &ctx, NULL);" so that we can only check the xid
status at the time when it is actually required. This way we can avoid
checking xid status in cases when we simply 'continue' due to some check.


-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com
From 02f0d19bd82f390df8e3d732d60cf3eb0ae1dc97 Mon Sep 17 00:00:00 2001
From: Himanshu Upadhyaya 
Date: Mon, 5 Dec 2022 18:28:33 +0530
Subject: [PATCH v8] Implement HOT chain validation in verify_heapam()

Himanshu Upadhyaya, reviewed by Robert Haas, Aleksander Alekseev, Andres Freund
---
 contrib/amcheck/verify_heapam.c   | 278 ++
 src/bin/pg_amcheck/t/004_verify_heapam.pl | 193 ++-
 2 files changed, 460 insertions(+), 11 deletions(-)

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index b72a5c96d1..ebad73d64b 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -399,6 +399,9 @@ verify_heapam(PG_FUNCTION_ARGS)
 	for (ctx.blkno = first_block; ctx.blkno <= last_block; ctx.blkno++)
 	{
 		OffsetNumber maxoff;
+		OffsetNumber predecessor[MaxOffsetNumber] = {InvalidOffsetNumber};
+		OffsetNumber successor[MaxOffsetNumber] = {InvalidOffsetNumber};
+		bool		lp_valid[MaxOffsetNumber] = {false};
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -433,6 +436,8 @@ verify_heapam(PG_FUNCTION_ARGS)
 		for (ctx.offnum = FirstOffsetNumber; ctx.offnum <= maxoff;
 			 ctx.offnum = OffsetNumberNext(ctx.offnum))
 		{
+			OffsetNumber nextoffnum;
+
 			ctx.itemid = PageGetItemId(ctx.page, ctx.offnum);
 
 			/* Skip over unused/dead line pointers */
@@ -469,6 +474,13 @@ verify_heapam(PG_FUNCTION_ARGS)
 	report_corruption(&ctx,
 	  psprintf("line pointer redirection to unused item at offset %u",
 			   (unsigned) rdoffnum));
+
+/*
+ * make entry in successor array, redirected tuple will be
+ * validated at the time when we loop over successor array
+ */
+successor[ctx.offnum] = rdoffnum;
+lp_valid[ctx.offnum] = true;
 continue;
 			}
 
@@ -504,9 +516,268 @@ verify_heapam(PG_FUNCTION_ARGS)
 			/* It should be safe to examine the tuple's header, at least */
 			ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid);
 			ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);
+			lp_valid[ctx.offnum] = true;
 
 			/* Ok, ready to check this next tuple */
 			check_tuple(&ctx);
+
+			/*
+			 * Add the data to the successor array if next updated tuple is in
+			 * the same page. It will be used later to generate the
+			 * predecessor array.
+			 *
+			 * We need to access the tuple's header to populate the
+			 * predecessor array. However the tuple is not necessarily sanity
+			 * checked yet so

Re: daitch_mokotoff module

2022-12-05 Thread Dag Lem

Hi Ian,

Ian Lawrence Barwick  writes:

> Hi Dag
>
> 2022年2月3日(木) 23:27 Dag Lem :
>>
>> Hi,
>>
>> Just some minor adjustments to the patch:
>>
>> * Removed call to locale-dependent toupper()
>> * Cleaned up input normalization
>
> This patch was marked as "Waiting on Author" in the CommitFest entry [1]
> but I see you provided an updated version which hasn't received any feedback,
> so I've move this to the next CommitFest [2] and set it to "Needs Review".
>
> [1] https://commitfest.postgresql.org/40/3451/
> [2] https://commitfest.postgresql.org/41/3451/
>
>> I have been asked to sign up to review a commitfest patch or patches -
>> unfortunately I've been ill with COVID-19 and it's not until now that
>> I feel well enough to have a look.
>>
>> Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/
>> as you suggested (https://commitfest.postgresql.org/36/3379/ seems to
>> have been reviewed now).
>>
>> If there are other suggestions for a patch or patches to review for
>> someone new to PostgreSQL internals, I'd be grateful for that.
>
> I see you provided some feedback on 
> https://commitfest.postgresql.org/36/3468/,
> though the patch seems to have not been accepted (but not conclusively 
> rejected
> either). If you still have the chance to review another patch (or more) it 
> would
> be much appreciated, as there's quite a few piling up. Things like 
> documentation
> or small improvements to client applications are always a good place to start.
> Reviews can be provided at any time, there's no need to wait for the next
> CommitFest.
>

OK, I'll try to find another patch to review.

Regards

Dag Lem

Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2022-12-05 Thread Amit Kapila

On Mon, Dec 5, 2022 at 6:30 PM Melih Mutlu  wrote:
>
> Attached two patches:
> 0001: adds REPLICATION_SLOT_SNAPSHOT command for replication protocol.
> 0002: Reuses workers/replication slots and origins for tablesync
>
> I would appreciate any feedback/review/thought on the approach and both 
> patches.
> I will also share some numbers to compare performances of the patch and 
> master branch soon.
>

It would be interesting to see the numbers differently for resue of
replication slots and origins. This will let us know how much each of
those optimizations helps with the reuse of workers.

-- 
With Regards,
Amit Kapila.

Re: Allow placeholders in ALTER ROLE w/o superuser

2022-12-05 Thread Alexander Korotkov

On Mon, Dec 5, 2022 at 2:27 PM Pavel Borisov  wrote:
> After posting the patch I've found my own typo in docs. So corrected
> it in v5 (PFA).

The new revision of the patch is attached.

I've removed the mention of "(s)" suffix from the "Server
Configuration" docs section. I think it might be confusing since this
suffix isn't a part of the variable name. It is only used for storage.
Instead, I've added the description of this suffix to the catalog
structure description and psql documentation.

Also, I've added psql tab completion for the USER SET flag, and made
some enhancements to comments, tests, and commit message.

--
Regards,
Alexander Korotkov

0001-Add-USER-SET-parameter-values-for-pg_db_role_sett-v6.patch
Description: Binary data

Re: move some bitmapset.c macros to bitmapset.h

2022-12-05 Thread Tom Lane

Alvaro Herrera  writes:
> On 2022-Dec-05, John Naylor wrote:
>> -#define WORDNUM(x)  ((x) / BITS_PER_BITMAPWORD)
>> -#define BITNUM(x)   ((x) % BITS_PER_BITMAPWORD)

> In this location, nobody can complain about the naming of these macros,
> since they're just used to implement other bitmapset.c code.  However,
> if you move them to the .h file, ISTM you should give them more
> meaningful names.

IMV these are absolutely private to bitmapset.c.  I reject the idea
that they should be exposed publicly, under these names or any others.

Maybe we need some more bitmapset primitive functions?  What is it
you actually want to accomplish in the end?

regards, tom lane

Re: Missing MaterialPath support in reparameterize_path_by_child

2022-12-05 Thread Tom Lane

Ashutosh Bapat  writes:
> partition-wise join should be willing to fallback to non-partitionwise
> join in such a case. After spending a few minutes with the code, I
> think generate_partitionwise_join_paths() should not call
> set_cheapest() is the pathlist of the child is NULL and should just
> wind up and avoid adding any path.

We clearly need to not call set_cheapest(), but that's not sufficient;
we still fail at higher levels, as you'll see if you try the example
Richard found.  I ended up making fe12f2f8f to fix this.

I don't especially like "rel->nparts = 0" as a way of disabling
partitionwise join; ISTM it'd be clearer and more flexible to reset
consider_partitionwise_join instead of destroying the data structure.
But that's the way it's being done elsewhere, and I didn't want to
tamper with it in a bug fix.  I see various assertions about parent
and child consider_partitionwise_join flags being equal, which we
might have to revisit if we try to make it work that way.

regards, tom lane

Re: Allow placeholders in ALTER ROLE w/o superuser

2022-12-05 Thread Pavel Borisov

Hi, Alexander!

On Mon, 5 Dec 2022 at 17:51, Alexander Korotkov  wrote:
>
> On Mon, Dec 5, 2022 at 2:27 PM Pavel Borisov  wrote:
> > After posting the patch I've found my own typo in docs. So corrected
> > it in v5 (PFA).
>
> The new revision of the patch is attached.
>
> I've removed the mention of "(s)" suffix from the "Server
> Configuration" docs section. I think it might be confusing since this
> suffix isn't a part of the variable name. It is only used for storage.
> Instead, I've added the description of this suffix to the catalog
> structure description and psql documentation.
>
> Also, I've added psql tab completion for the USER SET flag, and made
> some enhancements to comments, tests, and commit message.

The changes in expected test results are somehow lost in v6, I've
corrected them in v7.
Otherwise, I've looked through the updated patch and it is good.

Regards,
Pavel.


v7-0001-Add-USER-SET-parameter-values-for-pg_db_role_sett.patch
Description: Binary data

ANY_VALUE aggregate

2022-12-05 Thread Vik Fearing

The SQL:2023 Standard defines a new aggregate named ANY_VALUE.  It 
returns an implementation-dependent (i.e. non-deterministic) value from 
the rows in its group.


PFA an implementation of this aggregate.

Ideally, the transition function would stop being called after the first 
non-null was found, and then the entire aggregation would stop when all 
functions say they are finished[*], but this patch does not go anywhere 
near that far.


This patch is based off of commit fb958b5da8.

[*] I can imagine something like array_agg(c ORDER BY x LIMIT 5) to get 
the top five of something without going through a LATERAL subquery.

--
Vik FearingFrom 7465fac12fc636ff26088ae31de2937f7c3a459f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 9 Apr 2022 00:07:38 +0200
Subject: [PATCH] Implement ANY_VALUE aggregate

SQL:2023 defines an ANY_VALUE aggregate whose purpose is to emit an
implementation-dependent (i.e. non-deterministic) value from the
aggregated rows.
---
 doc/src/sgml/func.sgml   | 14 ++
 src/backend/utils/adt/misc.c | 12 
 src/include/catalog/pg_aggregate.dat |  4 
 src/include/catalog/pg_proc.dat  |  8 
 src/test/regress/expected/aggregates.out | 18 ++
 src/test/regress/sql/aggregates.sql  |  5 +
 6 files changed, 61 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2052d3c844..1823ee71d7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19706,16 +19706,30 @@ SELECT NULLIF(value, '(none)') ...

 Description

Partial Mode
   
  
 
  
+  
+   
+
+ any_value
+
+any_value ( "any" )
+same as input type
+   
+   
+Chooses a non-deterministic value from the non-null input values.
+   
+   Yes
+  
+
   

 
  array_agg
 
 array_agg ( anynonarray )
 anyarray

diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 9c13251231..94c92de06d 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -928,8 +928,20 @@ pg_get_replica_identity_index(PG_FUNCTION_ARGS)
 	idxoid = RelationGetReplicaIndex(rel);
 	table_close(rel, AccessShareLock);
 
 	if (OidIsValid(idxoid))
 		PG_RETURN_OID(idxoid);
 	else
 		PG_RETURN_NULL();
 }
+
+Datum
+any_value_trans(PG_FUNCTION_ARGS)
+{
+	/* Return the first non-null argument */
+	if (!PG_ARGISNULL(0))
+		PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+	if (!PG_ARGISNULL(1))
+		PG_RETURN_DATUM(PG_GETARG_DATUM(1));
+	PG_RETURN_NULL();
+}
+
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index b9110a5298..37626d6f0c 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -620,9 +620,13 @@
   aggtransfn => 'ordered_set_transition_multi', aggfinalfn => 'cume_dist_final',
   aggfinalextra => 't', aggfinalmodify => 'w', aggmfinalmodify => 'w',
   aggtranstype => 'internal' },
 { aggfnoid => 'dense_rank(any)', aggkind => 'h', aggnumdirectargs => '1',
   aggtransfn => 'ordered_set_transition_multi',
   aggfinalfn => 'dense_rank_final', aggfinalextra => 't', aggfinalmodify => 'w',
   aggmfinalmodify => 'w', aggtranstype => 'internal' },
 
+# any_value
+{ aggfnoid => 'any_value(anyelement)', aggtransfn => 'any_value_trans',
+  aggcombinefn => 'any_value_trans', aggtranstype => 'anyelement' },
+
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9301b2627..2ee4797559 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11849,9 +11849,17 @@
   proname => 'brin_minmax_multi_summary_recv', provolatile => 's',
   prorettype => 'pg_brin_minmax_multi_summary', proargtypes => 'internal',
   prosrc => 'brin_minmax_multi_summary_recv' },
 { oid => '4641', descr => 'I/O',
   proname => 'brin_minmax_multi_summary_send', provolatile => 's',
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '8981', descr => 'arbitrary value from among input values',
+  proname => 'any_value', prokind => 'a', proisstrict => 'f',
+  prorettype => 'anyelement', proargtypes => 'anyelement',
+  prosrc => 'aggregate_dummy' },
+{ oid => '8982', descr => 'any_value transition function',
+  proname => 'any_value_trans', prorettype => 'anyelement', proargtypes => 'anyelement anyelement',
+  prosrc => 'any_value_trans' },
+
 ]
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index fc2bd40be2..fb87b9abf1 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -20,16 +20,28 @@ SELECT avg(four) AS avg_1 FROM onek;
 (1 row)
 
 SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;
avg_32
 -
  32.666

Re: pg_basebackup: add test about zstd compress option

2022-12-05 Thread Robert Haas

On Fri, Dec 2, 2022 at 11:29 PM Ian Lawrence Barwick  wrote:
> Though on reflection maybe it's overkill and the existing tests
> suffice. Anyway leaving the patch here in the interests of pushing
> this forward in some direction.

Do you think that there is a scenario where 008_untar.pl and
009_extract.pl pass but this test fails, alerting us to a problem that
would otherwise have gone undetected? If so, what is that scenario?

The only thing that I can think of would be if $decompress_program
--test were failing, but actually trying to decompress succeeded. I
would be inclined to dismiss that particular scenario as not important
enough to be worth the additional CPU cycles.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump: Remove "blob" terminology

2022-12-05 Thread Robert Haas

On Sat, Dec 3, 2022 at 11:07 AM Andrew Dunstan  wrote:
> > Well, what this would lose is the ability to selectively restore
> > individual large objects using "pg_restore -L".  I'm not sure who
> > out there might be depending on that, but if we assume that's the
> > empty set I fear we'll find out it's not.  So a workaround switch
> > seemed possibly worth the trouble.  I don't have a position yet
> > on which way ought to be the default.
>
> OK, fair point. I suspect making the batched mode the default would gain
> more friends than enemies.

A lot of people probably don't know that selective restore even exists
but it is an AWESOME feature and I'd hate to see us break it, or even
just degrade it.

I wonder if we can't find a better solution than bunching TOC entries
together. Perhaps we don't need every TOC entry in memory
simultaneously, for instance, especially things like LOBs that don't
have dependencies.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: slab allocator performance issues

2022-12-05 Thread Robert Haas

On Fri, Sep 10, 2021 at 5:07 PM Tomas Vondra
 wrote:
> Turns out it's pretty difficult to benchmark this, because the results
> strongly depend on what the backend did before.

What you report here seems to be mostly cold-cache effects, with which
I don't think we need to be overly concerned. We don't want big
regressions in the cold-cache case, but there is always going to be
some overhead when a new backend starts up, because you've got to
fault some pages into the heap/malloc arena/whatever before they can
be efficiently accessed. What would be more concerning is if we found
out that the performance depended heavily on the internal state of the
allocator. For example, suppose you have two warmup routines W1 and
W2, each of which touches the same amount of total memory, but with
different details. Then you have a benchmark B. If you do W1-B and
W2-B and the time for B varies dramatically between them, then you've
maybe got an issue. For instance, it could indicate that the allocator
has issue when the old and new allocations are very different sizes,
or something like that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PoC] Reducing planning time when tables have many partitions

2022-12-05 Thread Thom Brown

On Sun, 4 Dec 2022 at 00:35, David Rowley  wrote:
>
> On Tue, 29 Nov 2022 at 21:59, Yuya Watari  wrote:
> > Thank you for testing the patch with an actual query. This speedup is
> > very impressive. When I used an original query with 1024 partitions,
> > its planning time was about 200ms. Given that each partition is also
> > partitioned in your workload, I think the result of 1415ms is
> > reasonable.
>
> I was looking again at the v9-0001 patch and I think we can do a
> little better when building the Bitmapset of matching EMs.  For
> example, in the v9 patch, the code for get_ecmember_indexes_strict()
> is doing:
>
> + if (!with_children)
> + matching_ems = bms_copy(ec->ec_nonchild_indexes);
> + else
> + matching_ems = bms_copy(ec->ec_member_indexes);
> +
> + i = -1;
> + while ((i = bms_next_member(relids, i)) >= 0)
> + {
> + RelOptInfo *rel = root->simple_rel_array[i];
> +
> + matching_ems = bms_int_members(matching_ems, 
> rel->eclass_member_indexes);
> + }
>
> It seems reasonable that if there are a large number of partitions
> then ec_member_indexes will have a large number of Bitmapwords.  When
> we do bms_int_members() on that, we're going to probably end up with a
> bunch of trailing zero words in the set.  In the v10 patch, I've
> changed this to become:
>
> +inti = bms_next_member(relids, -1);
> +
> +if (i >= 0)
> +{
> +RelOptInfo *rel = root->simple_rel_array[i];
> +
> +/*
> + * bms_intersect to the first relation to try to keep the resulting
> + * Bitmapset as small as possible.  This saves having to make a
> + * complete bms_copy() of one of them.  One may contain significantly
> + * more words than the other.
> + */
> +if (!with_children)
> +matching_ems = bms_intersect(rel->eclass_member_indexes,
> + ec->ec_nonchild_indexes);
> +else
> +matching_ems = bms_intersect(rel->eclass_member_indexes,
> + ec->ec_member_indexes);
> +
> +while ((i = bms_next_member(relids, i)) >= 0)
> +{
> +rel = root->simple_rel_array[i];
> +matching_ems = bms_int_members(matching_ems,
> +   rel->eclass_member_indexes);
> +}
> +}
>
> so, effectively we first bms_intersect to the first member of relids
> before masking out the bits for the remaining ones.  This should mean
> we'll have a Bitmapset with fewer words in many complex planning
> problems. There's no longer the dilemma of having to decide if we
> should start with RelOptInfo's eclass_member_indexes or the
> EquivalenceClass's member indexes.  When using bms_int_member, we
> really want to start with the smallest of those so we get the smallest
> resulting set.  With bms_intersect(), it will always make a copy of
> the smallest set. v10 does that instead of bms_copy()ing the
> EquivalenceClass's member's Bitmapset.
>
> I also wondered how much we're losing to the fact that
> bms_int_members() zeros the trailing words and does not trim the
> Bitmapset down.
>
> The problem there is 2-fold;
> 1) we have to zero the trailing words on the left input. That'll
> pollute the CPU cache a bit as it may have to fetch a bunch of extra
> cache lines, and;
> 2) subsequent bms_int_members() done afterwards may have to mask out
> additional words. If we can make the shortest input really short, then
> subsequent bms_int_members() are going to be very fast.
>
> You might argue there that setting nwords to the shortest length may
> cause us to have to repalloc the Bitmapset if we need to later add
> more members again, but if you look at the repalloc() code, it's
> effectively a no-op when the allocated size >= the requested size, so
> repalloc() should be very fast in this case. So, worst case, there's
> an additional "no-op" repalloc() (which should be very fast) followed
> by maybe a bms_add_members() which has to zero the words instead of
> bms_int_members(). I changed this in the v10-0002 patch. I'm not sure
> if we should do this or not.
>
> I also changed v10-0001 so that we still store the EquivalenceClass's
> members list.  There were a few places where the code just wanted to
> get the first member and having to look at the Bitmapset index and
> fetch the first match from PlannerInfo seemed convoluted.  If the
> query is simple, it seems like it's not going to be very expensive to
> add a few EquivalenceMembers to this list. When planning more complex
> problems, there's probably enough other extra overhead that we're
> unlikely to notice the extra lappend()s.  This also allows v10-0003 to
> work, see below.
>
> In v10-0003, I experimented with the iterator concept that I mentioned
> earlier.  Since v10-0001 is now storing the EquivalenceMember list in
> EquivalenceClass again, it's now quite simple to have the iterator
> decide if it should be scanning the inde

Re: Collation version tracking for macOS

2022-12-05 Thread Robert Haas

On Sun, Dec 4, 2022 at 10:12 PM Thomas Munro  wrote:
> My tentative votes are:
>
> 1.  I think we should seriously consider provider = ICU63.  I still
> think search-by-collversion is a little too magical, even though it
> clearly can be made to work.  Of the non-magical systems, I think
> encoding the choice of library into the provider name would avoid the
> need to add a second confusing "X_version" concept alongside our
> existing "X_version" columns in catalogues and DDL syntax, while still
> making it super clear what is going on.  This would include adding DDL
> commands so you can do ALTER DATABASE/COLLATION ... PROVIDER = ICU63
> to make warnings go way.

+1. I wouldn't lose any sleep if we picked a different non-magical
option, but I think this is probably my favorite of the
explicit-library-version options (though it is close) and I like it
better than search-by-collversion.

(It's possible that I'm wrong to like it better, but I do.)

> 2.  I think we should ignore minor versions for now (other than
> reporting them in the relevant introspection functions), but not make
> any choices that would prevent us from changing our mind about that in
> a later release.  For example, having two levels of specificity ICU
> and ICU68  in the libver-in-provider-name design wouldn't preclude us
> from adding support for ICU68_2 later

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Error-safe user functions

2022-12-05 Thread Robert Haas

On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker  wrote:
> 2. ereturn_* => errfeedback / error_feedback / feedback

Oh, I like that, especially errfeedback.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_dump: Remove "blob" terminology

2022-12-05 Thread Tom Lane

Robert Haas  writes:
> I wonder if we can't find a better solution than bunching TOC entries
> together. Perhaps we don't need every TOC entry in memory
> simultaneously, for instance, especially things like LOBs that don't
> have dependencies.

Interesting idea.  We'd have to either read the TOC multiple times,
or shove the LOB TOC entries into a temporary file, either of which
has downsides.

regards, tom lane

Re: Order getopt arguments

2022-12-05 Thread Robert Haas

On Mon, Dec 5, 2022 at 3:42 AM Fabien COELHO  wrote:
> > I had noticed that most getopt() or getopt_long() calls had their letter
> > lists in pretty crazy orders.  There might have been occasional attempts
> > at grouping, but those then haven't been maintained as new options were
> > added. To restore some sanity to this, I went through and ordered them
> > alphabetically.
>
> I agree that a more or less random historical order does not make much
> sense.
>
> For pgbench, ISTM that sorting per functionality then alphabetical would
> be better than pure alphabetical because it has 2 modes. Such sections
> might be (1) general (2) connection (3) common/shared (4) initialization
> and (5) benchmarking, we some comments on each.

I don't see the value in this. Grouping related options often makes
sense, but it seems more confusing than helpful in the case of a
getopt string.

+1 for Peter's proposal to just alphabetize. That's easy to maintain,
at least in theory.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Error-safe user functions

2022-12-05 Thread Tom Lane

Robert Haas  writes:
> On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker  wrote:
>> 2. ereturn_* => errfeedback / error_feedback / feedback

> Oh, I like that, especially errfeedback.

efeedback?  But TBH I do not think any of these are better than ereturn.

Whether or not you agree with my position that it'd be best if the new
macro name is the same length as "ereport", I hope we can all agree
that it had better be short.  ereport call nests already tend to contain
quite long lines.  We don't need to add another couple tab-stops worth
of indentation there.

As for it being the same length: if you take a close look at my 0002
patch, you will realize that replacing ereport with a different-length
name will double or triple the number of lines that need to be touched
in many input functions.  Yeah, we could sweep that under the rug to
some extent by submitting non-pgindent'd patches and running a separate
pgindent commit later, but that is not without pain.  I don't want to
go there for the sake of a name that isn't really compellingly The
Right Word, and "feedback" just isn't very compelling IMO.

regards, tom lane

Re: Order getopt arguments

2022-12-05 Thread Tom Lane

Robert Haas  writes:
> +1 for Peter's proposal to just alphabetize. That's easy to maintain,
> at least in theory.

Agreed for single-letter options.  Long options complicate matters:
are we going to order their code stanzas by the actual long name, or
by the character/number returned by getopt?  Or are we going to be
willing to repeatedly renumber the assigned codes to keep those the
same?  I don't think I want to go that far.

regards, tom lane

Re: pg_dump: Remove "blob" terminology

2022-12-05 Thread Robert Haas

On Mon, Dec 5, 2022 at 10:56 AM Tom Lane  wrote:
> Interesting idea.  We'd have to either read the TOC multiple times,
> or shove the LOB TOC entries into a temporary file, either of which
> has downsides.

I wonder if we'd need to read the TOC entries repeatedly, or just incrementally.

I haven't studied the pg_dump format much, but I wonder if we could
arrange things so that the "boring" entries without dependencies, or
maybe the LOB entries specifically, are grouped together in some way
where we know the byte-length of that section and can just skip over
it. Then when we need them we go back and read 'em one by one.

(Of course this doesn't work if reading for standard input or if
multiple phases of processing need them or whatever. Not trying to say
I've solved the problem. But in general I think we need to give more
thought to the possibility that "just read all the data into memory"
is not an adequate solution to $PROBLEM.)

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Error-safe user functions

2022-12-05 Thread Robert Haas

On Mon, Dec 5, 2022 at 11:09 AM Tom Lane  wrote:
> Robert Haas  writes:
> > On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker  
> > wrote:
> >> 2. ereturn_* => errfeedback / error_feedback / feedback
>
> > Oh, I like that, especially errfeedback.
>
> efeedback?  But TBH I do not think any of these are better than ereturn.

I do. Having a macro name that is "return" plus one character is going
to make people think that it returns. I predict that if you insist on
using that name people are still going to be making mistakes based on
that confusion 10 years from now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Order getopt arguments

2022-12-05 Thread Robert Haas

On Mon, Dec 5, 2022 at 11:14 AM Tom Lane  wrote:
> Robert Haas  writes:
> > +1 for Peter's proposal to just alphabetize. That's easy to maintain,
> > at least in theory.
>
> Agreed for single-letter options.  Long options complicate matters:
> are we going to order their code stanzas by the actual long name, or
> by the character/number returned by getopt?  Or are we going to be
> willing to repeatedly renumber the assigned codes to keep those the
> same?  I don't think I want to go that far.

I was only talking about the actual argument to getopt(), not the
order of the code stanzas. I'm not sure what we ought to do about the
latter.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Error-safe user functions

2022-12-05 Thread Andrew Dunstan



On 2022-12-05 Mo 11:20, Robert Haas wrote:
> On Mon, Dec 5, 2022 at 11:09 AM Tom Lane  wrote:
>> Robert Haas  writes:
>>> On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker  
>>> wrote:
 2. ereturn_* => errfeedback / error_feedback / feedback
>>> Oh, I like that, especially errfeedback.
>> efeedback?  But TBH I do not think any of these are better than ereturn.
> I do. Having a macro name that is "return" plus one character is going
> to make people think that it returns. I predict that if you insist on
> using that name people are still going to be making mistakes based on
> that confusion 10 years from now.
>

OK, I take both this point and Tom's about trying to keep it the same
length. So we need something that's  7 letters, doesn't say 'return' and
preferably begins with 'e'. I modestly suggest 'eseterr', or if we like
the 'feedback' idea 'efeedbk'.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Order getopt arguments

2022-12-05 Thread Tom Lane

Robert Haas  writes:
> I was only talking about the actual argument to getopt(), not the
> order of the code stanzas. I'm not sure what we ought to do about the
> latter.

100% agreed that the getopt argument should just be alphabetical.
But the bulk of Peter's patch is rearranging switch cases to agree
with that, and if you want to do that then you have to also think
about long options, which are not in the getopt argument.  I'm
not entirely convinced that reordering the switch cases is worth
troubling over.

regards, tom lane

Re: Error-safe user functions

2022-12-05 Thread Joe Conway


On 12/5/22 11:36, Andrew Dunstan wrote:


On 2022-12-05 Mo 11:20, Robert Haas wrote:

On Mon, Dec 5, 2022 at 11:09 AM Tom Lane  wrote:

Robert Haas  writes:

On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker  wrote:

2. ereturn_* => errfeedback / error_feedback / feedback

Oh, I like that, especially errfeedback.

efeedback?  But TBH I do not think any of these are better than ereturn.

I do. Having a macro name that is "return" plus one character is going
to make people think that it returns. I predict that if you insist on
using that name people are still going to be making mistakes based on
that confusion 10 years from now.



OK, I take both this point and Tom's about trying to keep it the same
length. So we need something that's  7 letters, doesn't say 'return' and
preferably begins with 'e'. I modestly suggest 'eseterr', or if we like
the 'feedback' idea 'efeedbk'.


Maybe eretort?

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order getopt arguments

2022-12-05 Thread Robert Haas

On Mon, Dec 5, 2022 at 11:51 AM Tom Lane  wrote:
> Robert Haas  writes:
> > I was only talking about the actual argument to getopt(), not the
> > order of the code stanzas. I'm not sure what we ought to do about the
> > latter.
>
> 100% agreed that the getopt argument should just be alphabetical.
> But the bulk of Peter's patch is rearranging switch cases to agree
> with that, and if you want to do that then you have to also think
> about long options, which are not in the getopt argument.  I'm
> not entirely convinced that reordering the switch cases is worth
> troubling over.

I'm not particularly sold on that either, but neither am I
particularly opposed to it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Error-safe user functions

2022-12-05 Thread Tom Lane

Robert Haas  writes:
> On Mon, Dec 5, 2022 at 11:09 AM Tom Lane  wrote:
>> efeedback?  But TBH I do not think any of these are better than ereturn.

> I do. Having a macro name that is "return" plus one character is going
> to make people think that it returns.

But it does return, or at least you need to code on the assumption
that it will.  (The cases where it doesn't aren't much different
from any situation where a called subroutine unexpectedly throws
an error.  Callers typically don't have to consider that.)

regards, tom lane

Re: Allow placeholders in ALTER ROLE w/o superuser

2022-12-05 Thread Alvaro Herrera

I couldn't find any discussion of the idea of adding "(s)" to the
variable name in order to mark the variable userset in the catalog, and
I have to admit I find it a bit strange.  Are we really agreed that
that's the way to proceed?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)

Re: Error-safe user functions

2022-12-05 Thread Tom Lane

Joe Conway  writes:
> On 12/5/22 11:36, Andrew Dunstan wrote:
>> OK, I take both this point and Tom's about trying to keep it the same
>> length. So we need something that's 7 letters, doesn't say 'return' and
>> preferably begins with 'e'. I modestly suggest 'eseterr', or if we like
>> the 'feedback' idea 'efeedbk'.

> Maybe eretort?

Nah, it's so close to ereport that it looks like a typo.  eseterr isn't
awful, perhaps.  Or maybe err, but I've not thought of suitable .

regards, tom lane

Re: Allow placeholders in ALTER ROLE w/o superuser

2022-12-05 Thread Tom Lane

Alvaro Herrera  writes:
> I couldn't find any discussion of the idea of adding "(s)" to the
> variable name in order to mark the variable userset in the catalog, and
> I have to admit I find it a bit strange.  Are we really agreed that
> that's the way to proceed?

I hadn't been paying close attention to this thread, sorry.

I agree that that seems like a very regrettable choice,
especially if you anticipate having to bump catversion anyway.
Better to add a bool column to the catalog.

regards, tom lane

Re: Error-safe user functions

2022-12-05 Thread Robert Haas

On Mon, Dec 5, 2022 at 12:09 PM Tom Lane  wrote:
> But it does return, or at least you need to code on the assumption
> that it will.  (The cases where it doesn't aren't much different
> from any situation where a called subroutine unexpectedly throws
> an error.  Callers typically don't have to consider that.)

Are you just trolling me here?

AIUI, the macro never returns in the sense of using the return
statement, unlike PG_RETURN_WHATEVER(), which do. It possibly
transfers control by throwing an error. But that is also true of just
about everything you do in PostgreSQL code, because errors can get
thrown from almost anywhere. So clearly the possibility of a non-local
transfer of control is not the issue here. The issue is the
possibility that there will be NO transfer of control. That is, you
are compelled to write ereturn() and then afterwards you still need a
return statement.

I do not understand how it is possible to sensibly argue that someone
won't see a macro called ereturn() and perhaps come to the false
conclusion that it will always return.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Error-safe user functions

2022-12-05 Thread Tom Lane

I wrote:
> Nah, it's so close to ereport that it looks like a typo.  eseterr isn't
> awful, perhaps.  Or maybe err, but I've not thought of suitable .

... "errsave", maybe?

regards, tom lane

Re: suppressing useless wakeups in logical/worker.c

2022-12-05 Thread Nathan Bossart

On Mon, Dec 05, 2022 at 01:00:19PM +, Hayato Kuroda (Fujitsu) wrote:
> But in your patch, the apply worker calcurates wakeup[LRW_WAKEUP_PING] and
> wakeup[LRW_WAKEUP_TERMINATE] again when it gets SIGHUP, so the worker never 
> sends
> ping with requestReply = true, and never exits due to the timeout.

This is the case for the walreceiver patch, too.  If a SIGHUP arrives just
before we are due to ping the server, the ping wakeup time will be pushed
back.  To me, this seems unlikely to cause any issues in practice unless
the server is being constantly SIGHUP'd.  If we wanted to fix this, we'd
probably need to recompute the wakeup times using the values currently set.
I haven't looked into this too closely, but it doesn't sound tremendously
difficult.  Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Error-safe user functions

2022-12-05 Thread Robert Haas

On Mon, Dec 5, 2022 at 12:27 PM Tom Lane  wrote:
> I wrote:
> > Nah, it's so close to ereport that it looks like a typo.  eseterr isn't
> > awful, perhaps.  Or maybe err, but I've not thought of suitable .
>
> ... "errsave", maybe?

eseterr or errsave seem totally fine to me, FWIW.

I would probably choose a more verbose name if I were doing it, but I
do get the point that keeping line lengths reasonable is important,
and if someone were to accuse me of excessive prolixity, I would be
unable to mount much of a defense.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Error-safe user functions

2022-12-05 Thread Joe Conway

On 12/5/22 12:35, Robert Haas wrote:

On Mon, Dec 5, 2022 at 12:27 PM Tom Lane  wrote:

I wrote:
> Nah, it's so close to ereport that it looks like a typo.  eseterr isn't
> awful, perhaps.  Or maybe err, but I've not thought of suitable .

... "errsave", maybe?

eseterr or errsave seem totally fine to me, FWIW.

+1

I would probably choose a more verbose name if I were doing it, but I
do get the point that keeping line lengths reasonable is important,
and if someone were to accuse me of excessive prolixity, I would be
unable to mount much of a defense.

prolixity -- nice word! I won't comment on its applicability to you in 
particular ;-P

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Request to modify view_table_usage to include materialized views

2022-12-05 Thread Jonathan Lemig

Hello,

I think this is the correct mail list for feature/modification requests.
If not please let me know which mail list I should use.

Would it be possible to modify the information_schema.view_table_usage
(VTU) to include materialized views?  (
https://www.postgresql.org/docs/current/infoschema-view-table-usage.html)

Currently when querying VTU, if the view you're interested in queries a
materialized view, then it doesn't show up in VTU.  For example, I was
trying to determine which tables/views made up a particular view:

--View is present in pg_views
drps=> select schemaname, viewname, viewowner
drps-> from pg_views
drps-> where viewname = 'platform_version_v';
 schemaname |  viewname  | viewowner
++---
 event  | platform_version_v | drps


-- Check view_table_usage for objects that are queried by the
platform_version_v view, but it doesn't find any:

drps=> select *
drps=> from information_schema.view_table_usage
drps=> where view_name = 'platform_version_v';

 view_catalog | view_schema | view_name | table_catalog | table_schema |
table_name
--+-+---+---+--+
(0 rows)

I looked at the pg_views.definition column for platform_version_v, and it
is querying a materialized view.

The source code for information_schema.view_table_usage view is at
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/catalog/information_schema.sql;h=18725a02d1fb6ffda3d218033b972a0ff23aac3b;hb=HEAD#l2605

If I change lines 2605 and 2616 to:

2605: AND v.relkind in ('v','m')
2616: AND t.relkind IN ('r', 'v', 'f', 'p','m')

and compile the modified version of VTU in my test schema, then I see the
MV that is used in the query of platform_version_v view:

drps=> select *
drps=> from test.view_table_usage
drps=> where view_name = 'platform_version_v';

 view_catalog | view_schema | view_name  | table_catalog |
table_schema | table_name
--+-++---+--+-
 drps | event   | platform_version_v | drps  | event
 | platform_version_mv


My method of changing those 2 lines of code may not be the best or correct
solution, it's just to illustrate what I'm looking for.

Thanks!

Jon

Re: Failed Assert while pgstat_unlink_relation

2022-12-05 Thread Andres Freund

Hi,

On 2022-12-05 15:20:55 +0900, Kyotaro Horiguchi wrote:
> The in-xact created relation t1 happened to be scanned during the
> CREATE RULE and a stats entry is attached. So the stats entry loses t1
> at roll-back, then crashes.  Thus, if I understand it correctly, it
> seems to me that just unlinking the stats from t1 (when relkind is
> changed) works.
> 
> But the fix doesn't change the behavior in relkind-not-changing
> cases. If an in-xact-created table gets a stats entry then the
> relcache entry for t1 is refreshed to a table relation again then the
> transaction rolls back, crash will happen for the same reason. I'm not
> sure if there is such a case actually.

We unlink the stats in that case already. see RelationDestroyRelation().


> When I tried to check that behavior further, I found that that
> CREATE ROLE is no longer allowed..

I assume you mean RULE, not ROLE? It should still work in 15.

Greetings,

Andres Freund

Re: Collation version tracking for macOS

2022-12-05 Thread Jeff Davis

On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:
> 1.  I think we should seriously consider provider = ICU63.  I still
> think search-by-collversion is a little too magical, even though it
> clearly can be made to work.  Of the non-magical systems, I think
> encoding the choice of library into the provider name would avoid the
> need to add a second confusing "X_version" concept alongside our
> existing "X_version" columns in catalogues and DDL syntax, while
> still
> making it super clear what is going on.

As I understand it, this is #2 in your previous list?

Can we put the naming of the provider into the hands of the user, e.g.:

  CREATE COLLATION PROVIDER icu63 TYPE icu
AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';

In this model, icu would be a "provider kind" and icu63 would be the
specific provider, which is named by the user.

That seems like the least magical approach, to me. We need an ICU
library; the administrator gives us one that looks like ICU; and we're
happy.

It avoids a lot of the annoyances we're discussing, and puts the power
in the hands of the admin. If they want to allow minor version updates,
they specify the library with .so.63, and let the symlinking handle it.

Of course, we can still do some sanity checks (WARNINGs or ERRORs) when
we think something is going wrong; like the version of ICU is too new,
or the reported version (ucol_getVersion()) doesn't match what's in
collversion. But we basically get out of the business of understanding
ICU versioning and leave that up to the administrator.

It's easier to document, and would require fewer GUCs (if any). And it
avoids mixing version information from another project into our data
model.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

Re: Error-safe user functions

2022-12-05 Thread Alvaro Herrera

On 2022-Dec-05, Tom Lane wrote:

> I wrote:
> > Nah, it's so close to ereport that it looks like a typo.  eseterr isn't
> > awful, perhaps.  Or maybe err, but I've not thought of suitable .
> 
> ... "errsave", maybe?

IMO eseterr is quite awful while errsave is not, so here goes my vote
for the latter.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
Tom: There seems to be something broken here.
Teodor: I'm in sackcloth and ashes...  Fixed.
http://archives.postgresql.org/message-id/482d1632.8010...@sigaev.ru

Re: MemoizePath fails to work for partitionwise join

2022-12-05 Thread Tom Lane

Richard Guo  writes:
> The fix is straightforward, just to use outerrel->top_parent if it is
> not null and leave the reparameterization work to try_nestloop_path.  In
> addition, I believe when reparameterizing MemoizePath we need to adjust
> its param_exprs too.

Right you are.  I'd noticed the apparent omission in
reparameterize_path_by_child, but figured that we'd need a test case to
find any other holes before it'd be worth fixing.  Thanks for finding
a usable test case.

One small problem is that top_parent doesn't exist in the back branches,
so I had to substitute a much uglier lookup in order to make this work
there.  I'm surprised that we got away without top_parent for this long
TBH, but anyway this fix validates the wisdom of 2f17b5701.

So, pushed with some cosmetic adjustments and the modified back-branch
code.

regards, tom lane

Re: Error-safe user functions

2022-12-05 Thread Tom Lane

Robert Haas  writes:
> AIUI, the macro never returns in the sense of using the return
> statement, unlike PG_RETURN_WHATEVER(), which do.

Oh!  Now I see what you don't like about it.  I thought you
meant "return to the call site", not "return to the call site's
caller".  Agreed that that could be confusing.

regards, tom lane

Re: Collation version tracking for macOS

2022-12-05 Thread Joe Conway


On 12/5/22 12:41, Jeff Davis wrote:

On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:

1.  I think we should seriously consider provider = ICU63.  I still
think search-by-collversion is a little too magical, even though it
clearly can be made to work.  Of the non-magical systems, I think
encoding the choice of library into the provider name would avoid the
need to add a second confusing "X_version" concept alongside our
existing "X_version" columns in catalogues and DDL syntax, while
still
making it super clear what is going on.


As I understand it, this is #2 in your previous list?

Can we put the naming of the provider into the hands of the user, e.g.:

   CREATE COLLATION PROVIDER icu63 TYPE icu
 AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';

In this model, icu would be a "provider kind" and icu63 would be the
specific provider, which is named by the user.

That seems like the least magical approach, to me. We need an ICU
library; the administrator gives us one that looks like ICU; and we're
happy.


+1

I like this. The provider kind defines which path we take in our code, 
and the specific library unambiguously defines a specific collation 
behavior (I think, ignoring bugs?)


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Error-safe user functions

2022-12-05 Thread Robert Haas

On Mon, Dec 5, 2022 at 12:44 PM Tom Lane  wrote:
> Robert Haas  writes:
> > AIUI, the macro never returns in the sense of using the return
> > statement, unlike PG_RETURN_WHATEVER(), which do.
>
> Oh!  Now I see what you don't like about it.  I thought you
> meant "return to the call site", not "return to the call site's
> caller".  Agreed that that could be confusing.

OK, good. I couldn't figure out what in the world we were arguing
about... apparently I wasn't being as clear as I thought I was.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Error-safe user functions

2022-12-05 Thread Andrew Dunstan



On 2022-12-05 Mo 12:42, Alvaro Herrera wrote:
> On 2022-Dec-05, Tom Lane wrote:
>
>> I wrote:
>>> Nah, it's so close to ereport that it looks like a typo.  eseterr isn't
>>> awful, perhaps.  Or maybe err, but I've not thought of suitable .
>> ... "errsave", maybe?
> IMO eseterr is quite awful while errsave is not, so here goes my vote
> for the latter.


Wait a minute!  Oh, no, sorry, as you were, 'errsave' is fine.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Request to modify view_table_usage to include materialized views

2022-12-05 Thread Tom Lane

Jonathan Lemig  writes:
> Would it be possible to modify the information_schema.view_table_usage
> (VTU) to include materialized views?

Is it physically possible?  Sure, it'd just take adjustment of some
relkind checks.

However, it's against project policy.  We consider that because the
information_schema views are defined by the SQL standard, they should
only show standardized properties of standardized objects.  If the
standard ever gains materialized views, we'd adjust those views to
show them.  In the meantime, they aren't there.

It would make little sense in any case to adjust only this one view.
But if we were to revisit that policy, there are a lot of corner
cases that would have to be thought through --- things that almost
fit into the views, or that might appear in a very misleading way,
etc.

regards, tom lane

Re: ANY_VALUE aggregate

2022-12-05 Thread David G. Johnston

On Mon, Dec 5, 2022 at 7:57 AM Vik Fearing  wrote:

> The SQL:2023 Standard defines a new aggregate named ANY_VALUE.  It
> returns an implementation-dependent (i.e. non-deterministic) value from
> the rows in its group.
>
> PFA an implementation of this aggregate.
>
>
Can we please add "first_value" and "last_value" if we are going to add
"some_random_value" to our library of aggregates?

Also, maybe we should have any_value do something like compute a 50/50
chance that any new value seen replaces the existing chosen value, instead
of simply returning the first value all the time.  Maybe even prohibit the
first value from being chosen so long as a second value appears.

David J.

Re: Error-safe user functions

2022-12-05 Thread Tom Lane

Andrew Dunstan  writes:
> Wait a minute!  Oh, no, sorry, as you were, 'errsave' is fine.

Seems like everybody's okay with errsave.  I'll make a v2 in a
little bit.  I'd like to try updating array_in and/or record_in
just to verify that indirection cases work okay, before we consider
the design to be set.

regards, tom lane

Re: ANY_VALUE aggregate

2022-12-05 Thread Tom Lane

"David G. Johnston"  writes:
> Can we please add "first_value" and "last_value" if we are going to add
> "some_random_value" to our library of aggregates?

First and last according to what ordering?  We have those in the
window-aggregate case, and I don't think we want to encourage people
to believe that "first" and "last" are meaningful otherwise.

ANY_VALUE at least makes it clear that you're getting an unspecified
one of the inputs.

regards, tom lane

Re: ANY_VALUE aggregate

2022-12-05 Thread Robert Haas

On Mon, Dec 5, 2022 at 1:04 PM Tom Lane  wrote:
> "David G. Johnston"  writes:
> > Can we please add "first_value" and "last_value" if we are going to add
> > "some_random_value" to our library of aggregates?
>
> First and last according to what ordering?  We have those in the
> window-aggregate case, and I don't think we want to encourage people
> to believe that "first" and "last" are meaningful otherwise.
>
> ANY_VALUE at least makes it clear that you're getting an unspecified
> one of the inputs.

I have personally implemented first_value() and last_value() in the
past in cases where I had guaranteed the ordering myself, or didn't
care what ordering was used. I think they're perfectly sensible. But
if we don't add them to core, at least they're easy to add in
user-space.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Using WaitEventSet in the postmaster

2022-12-05 Thread Andres Freund

Hi,

On 2022-12-05 22:45:57 +1300, Thomas Munro wrote:
> On Sat, Dec 3, 2022 at 10:41 AM Thomas Munro  wrote:
> > Here's an iteration like that.  Still WIP grade.  It passes, but there
> > must be something I don't understand about this computer program yet,
> > because if I move the "if (pending_..." section up into the block
> > where WL_LATCH_SET has arrived (instead of testing those variables
> > every time through the loop), a couple of tests leave zombie
> > (unreaped) processes behind, indicating that something funky happened
> > to the state machine that I haven't yet grokked.  Will look more next
> > week.
> 
> Duh.  The reason for that was the pre-existing special case for
> PM_WAIT_DEAD_END, which used a sleep(100ms) loop to wait for children
> to exit, which I needed to change to a latch wait.  Fixed in the next
> iteration, attached.
> 
> The reason for the existing sleep-based approach was that we didn't
> want to accept any more connections (or spin furiously because the
> listen queue was non-empty).  So in this version I invented a way to
> suppress socket events temporarily with WL_SOCKET_IGNORE, and then
> reactivate them after crash reinit.

That seems like an odd special flag. Why do we need it? Is it just because we
want to have assertions ensuring that something is being queried?


>  * WL_SOCKET_ACCEPT is a new event for an incoming connection (on Unix,
>this is just another name for WL_SOCKET_READABLE, but Window has a
>different underlying event; this mirrors WL_SOCKET_CONNECTED on the
>other end of a connection)

Perhaps worth committing separately and soon? Seems pretty uncontroversial
from here.


> +/*
> + * Object representing the state of a postmaster.
> + *
> + * XXX Lots of global variables could move in here.
> + */
> +typedef struct
> +{
> + WaitEventSet*wes;
> +} Postmaster;
> +

Seems weird to introduce this but then basically have it be unused. I'd say
either have a preceding patch move at least a few members into it, or just
omit it for now.


> + /* This may configure SIGURG, depending on platform. */
> + InitializeLatchSupport();
> + InitLocalLatch();

I'm mildly preferring InitProcessLocalLatch(), but not sure why - there's not
really a conflicting meaning of "local" here.


> +/*
> + * Initialize the WaitEventSet we'll use in our main event loop.
> + */
> +static void
> +InitializeWaitSet(Postmaster *postmaster)
> +{
> + /* Set up a WaitEventSet for our latch and listening sockets. */
> + postmaster->wes = CreateWaitEventSet(CurrentMemoryContext, 1 + 
> MAXLISTEN);
> + AddWaitEventToSet(postmaster->wes, WL_LATCH_SET, PGINVALID_SOCKET, 
> MyLatch, NULL);
> + for (int i = 0; i < MAXLISTEN; i++)
> + {
> + int fd = ListenSocket[i];
> +
> + if (fd == PGINVALID_SOCKET)
> + break;
> + AddWaitEventToSet(postmaster->wes, WL_SOCKET_ACCEPT, fd, NULL, 
> NULL);
> + }
> +}

The naming seems like it could be confused with latch.h
infrastructure. InitPostmasterWaitSet()?


> +/*
> + * Activate or deactivate the server socket events.
> + */
> +static void
> +AdjustServerSocketEvents(Postmaster *postmaster, bool active)
> +{
> + for (int pos = 1; pos < GetNumRegisteredWaitEvents(postmaster->wes); 
> ++pos)
> + ModifyWaitEvent(postmaster->wes,
> + pos, active ? WL_SOCKET_ACCEPT 
> : WL_SOCKET_IGNORE,
> + NULL);
> +}

This seems to hardcode the specific wait events we're waiting for based on
latch.c infrastructure. Not really convinced that's a good idea.

> + /*
> +  * Latch set by signal handler, or new connection pending on 
> any of our
> +  * sockets? If the latter, fork a child process to deal with it.
> +  */
> + for (int i = 0; i < nevents; i++)
>   {
> - if (errno != EINTR && errno != EWOULDBLOCK)
> + if (events[i].events & WL_LATCH_SET)
>   {
> - ereport(LOG,
> - (errcode_for_socket_access(),
> -  errmsg("select() failed in 
> postmaster: %m")));
> - return STATUS_ERROR;
> + ResetLatch(MyLatch);
> +
> + /* Process work scheduled by signal handlers. */
> + if (pending_action_request)
> + process_action_request(postmaster);
> + if (pending_child_exit)
> + process_child_exit(postmaster);
> + if (pending_reload_request)
> + process_reload_request();
> + if (pending_shutdown_request)
> +

Re: [PATCH] Add native windows on arm64 support

2022-12-05 Thread Andres Freund

Hi,

On 2022-12-05 14:12:41 +0900, Michael Paquier wrote:
> With meson gaining in maturity, perhaps that's not the most urgent
> thing as we will likely remove src/tools/msvc/ soon but I'd rather do
> that right anyway as much as I can to avoid an incorrect state in the
> tree at any time in its history.

I'd actually argue that we should just not add win32 support to
src/tools/msvc/.


> --- a/src/include/storage/s_lock.h
> +++ b/src/include/storage/s_lock.h
> @@ -708,13 +708,21 @@ typedef LONG slock_t;
>  #define SPIN_DELAY() spin_delay()
>  
>  /* If using Visual C++ on Win64, inline assembly is unavailable.
> - * Use a _mm_pause intrinsic instead of rep nop.
> + * Use _mm_pause (x64) or __isb (arm64) intrinsic instead of rep nop.
>   */
>  #if defined(_WIN64)
>  static __forceinline void
>  spin_delay(void)
>  {
> +#ifdef _M_ARM64
> + /*
> +  * See spin_delay aarch64 inline assembly definition above for details
> +  * ref: 
> https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions
> + */
> + __isb(_ARM64_BARRIER_SY);
> +#else
>   _mm_pause();
> +#endif
>  }
>  #else
>  static __forceinline void

This looks somewhat wrong to me. We end up with some ifdefs on the function
defintion level, and some others inside the function body. I think it should
be either or.



> diff --git a/meson.build b/meson.build
> index 725e10d815..e354ad7650 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -1944,7 +1944,13 @@ int main(void)
>  
>  elif host_cpu == 'arm' or host_cpu == 'aarch64'
>  
> -  prog = '''
> +  if cc.get_id() == 'msvc'
> +cdata.set('USE_ARMV8_CRC32C', false)
> +cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> +have_optimized_crc = true
> +  else
> +
> +prog = '''
>  #include 
>  
>  int main(void)

Why does this need to be hardcoded? The compiler probe should just work for
msvc.


> @@ -1960,18 +1966,19 @@ int main(void)
>  }
>  '''
>  
> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
> without -march=armv8-a+crc',
> -  args: test_c_args)
> -# Use ARM CRC Extension unconditionally
> -cdata.set('USE_ARMV8_CRC32C', 1)
> -have_optimized_crc = true
> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
> with -march=armv8-a+crc',
> -  args: test_c_args + ['-march=armv8-a+crc'])
> -# Use ARM CRC Extension, with runtime check
> -cflags_crc += '-march=armv8-a+crc'
> -cdata.set('USE_ARMV8_CRC32C', false)
> -cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> -have_optimized_crc = true
> +if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd 
> without -march=armv8-a+crc',
> +args: test_c_args)
> +  # Use ARM CRC Extension unconditionally
> +  cdata.set('USE_ARMV8_CRC32C', 1)
> +  have_optimized_crc = true
> +elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and 
> __crc32cd with -march=armv8-a+crc',
> +args: test_c_args + ['-march=armv8-a+crc'])
> +  # Use ARM CRC Extension, with runtime check
> +  cflags_crc += '-march=armv8-a+crc'
> +  cdata.set('USE_ARMV8_CRC32C', false)
> +  cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> +  have_optimized_crc = true
> +endif
>endif
>  endif

And then this reindent wouldn't be needed.

Greetings,

Andres Freund

Re: Request to modify view_table_usage to include materialized views

2022-12-05 Thread Jonathan Lemig

Hey Tom,

Thanks for the info.  I'll submit a document change request instead.

Thanks!

Jon

On Mon, Dec 5, 2022 at 11:53 AM Tom Lane  wrote:

> Jonathan Lemig  writes:
> > Would it be possible to modify the information_schema.view_table_usage
> > (VTU) to include materialized views?
>
> Is it physically possible?  Sure, it'd just take adjustment of some
> relkind checks.
>
> However, it's against project policy.  We consider that because the
> information_schema views are defined by the SQL standard, they should
> only show standardized properties of standardized objects.  If the
> standard ever gains materialized views, we'd adjust those views to
> show them.  In the meantime, they aren't there.
>
> It would make little sense in any case to adjust only this one view.
> But if we were to revisit that policy, there are a lot of corner
> cases that would have to be thought through --- things that almost
> fit into the views, or that might appear in a very misleading way,
> etc.
>
> regards, tom lane
>

Re: WAL Insertion Lock Improvements (was: Re: Avoid LWLockWaitForVar() for currently held WAL insertion lock in WaitXLogInsertionsToFinish())

2022-12-05 Thread Andres Freund

Hi,

FWIW, I don't see an advantage in 0003. If it allows us to make something else
simpler / faster, cool, but on its own it doesn't seem worthwhile.

On 2022-12-02 16:31:58 -0800, Nathan Bossart wrote:
> On Fri, Dec 02, 2022 at 04:32:38PM +0530, Bharath Rupireddy wrote:
> > On Fri, Dec 2, 2022 at 6:10 AM Andres Freund  wrote:
> >> I'm not sure this is quite right - don't we need a memory barrier. But I 
> >> don't
> >> see a reason to not just leave this code as-is. I think this should be
> >> optimized entirely in lwlock.c
> > 
> > Actually, we don't need that at all as LWLockWaitForVar() will return
> > immediately if the lock is free. So, I removed it.
> 
> I briefly looked at the latest patch set, and I'm curious how this change
> avoids introducing memory ordering bugs.  Perhaps I am missing something
> obvious.

I'm a bit confused too - the comment above talks about LWLockWaitForVar(), but
the patches seem to optimize LWLockUpdateVar().

I think it'd be safe to optimize LWLockConflictsWithVar(), due to some
pre-existing, quite crufty, code. LWLockConflictsWithVar() says:

 * Test first to see if it the slot is free right now.
 *
 * XXX: the caller uses a spinlock before this, so we don't need a 
memory
 * barrier here as far as the current usage is concerned.  But that 
might
 * not be safe in general.

which happens to be true in the single, indirect, caller:

/* Read the current insert position */
SpinLockAcquire(&Insert->insertpos_lck);
bytepos = Insert->CurrBytePos;
SpinLockRelease(&Insert->insertpos_lck);
reservedUpto = XLogBytePosToEndRecPtr(bytepos);

I think at the very least we ought to have a comment in
WaitXLogInsertionsToFinish() highlighting this.

It's not at all clear to me that the proposed fast-path for LWLockUpdateVar()
is safe. I think at the very least we could end up missing waiters that we
should have woken up.

I think it ought to be safe to do something like

pg_atomic_exchange_u64()..
if (!(pg_atomic_read_u32(&lock->state) & LW_FLAG_HAS_WAITERS))
  return;

because the pg_atomic_exchange_u64() will provide the necessary memory
barrier.

Greetings,

Andres Freund

Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert

2022-12-05 Thread Jacob Champion

On Fri, Dec 2, 2022 at 9:58 AM Jacob Champion  wrote:
> Thanks for the nudge -- running with OpenSSL 3.0.7 in CI did not fix
> the issue. I suspect a problem with our error stack handling...

It is a problem with the error queue, but *whose* problem is probably
up for debate. The queue looks like this after SSL_connect() returns:

error:1669:STORE
routines:ossl_store_get0_loader_int:unregistered
scheme:crypto/store/store_register.c:237:scheme=file
error:8002:system library:file_open:No such file or
directory:providers/implementations/storemgmt/file_store.c:269:calling
stat(/usr/local/etc/openssl@3/certs)
error:1669:STORE
routines:ossl_store_get0_loader_int:unregistered
scheme:crypto/store/store_register.c:237:scheme=file
error:8002:system library:file_open:No such file or
directory:providers/implementations/storemgmt/file_store.c:269:calling
stat(/usr/local/etc/openssl@3/certs)
error:1669:STORE
routines:ossl_store_get0_loader_int:unregistered
scheme:crypto/store/store_register.c:237:scheme=file
error:8002:system library:file_open:No such file or
directory:providers/implementations/storemgmt/file_store.c:269:calling
stat(/usr/local/etc/openssl@3/certs)
error:0A86:SSL
routines:tls_post_process_server_certificate:certificate verify
failed:ssl/statem/statem_clnt.c:1883:

Note that the error we care about is at the end, not the front.

We are not the first using Homebrew to run into this, and best I can
tell, it is a brew-specific bug. The certificate directory that's been
configured isn't actually installed by the formula. (A colleague here
was able to verify the same behavior on their local machine, so it's
not a Cirrus problem.)

The confusing "unrecognized scheme" message has thrown at least a few
people off the scent. That refers to an OpenSSL STORE URI, not the URI
describing the peer. (Why `file://` is considered "unregistered" is
beyond me, considering the documentation says that file URI support is
built into libcrypto.) From inspection, that error is put onto the
queue before checking to see if the certificate directory exists, and
then it's popped back off the queue if the directory is found(?!).
Unfortunately, the directory isn't there for Homebrew, which means we
get both errors, the first of which is not actually helpful. And then
it pushes the pair of errors two more times, for reasons I haven't
bothered looking into yet.

Maybe this is considered an internal error caused by a packaging bug,
in which case I expect the formula maintainers to ask why it worked
for 1.1. Maybe it's a client error because we're not looking for the
best error on the queue, in which case I ask how we're supposed to
know which error is the most interesting. (I actually kind of know the
answer to this -- OpenSSL's builtin clients appear to check the front
of the queue first, to see if it's an SSL-related error, and then if
it's not they grab the error at the end of the queue instead. To which
I ask: *what?*) Maybe clients are expected to present the entirety of
the queue. But then, why are three separate copies of the same errors
spamming the queue? We can't present that.

I'm considering filing an issue with OpenSSL, to see what they suggest
a responsible client should do in this situation...

> Separately from this, our brew cache in Cirrus is extremely out of
> date. Is there something that's supposed to be running `brew update`
> (or autoupdate) that is stuck or broken?

(If this is eventually considered a bug in the formula, we'll need to
update to get the fix regardless.)

--Jacob

Re: PGDOCS - Logical replication GUCs - added some xrefs

2022-12-05 Thread samay sharma

Hi,

On Mon, Oct 24, 2022 at 12:45 AM Peter Smith  wrote:

> Hi hackers.
>
> There is a docs Logical Replication section "31.10 Configuration
> Settings" [1] which describes some logical replication GUCs, and
> details on how they interact with each other and how to take that into
> account when setting their values.
>
> There is another docs Server Configuration section "20.6 Replication"
> [2] which lists the replication-related GUC parameters, and what they
> are for.
>
> Currently AFAIK those two pages are unconnected, but I felt it might
> be helpful if some of the parameters in the list [2] had xref links to
> the additional logical replication configuration information [1]. PSA
> a patch to do that.
>

+1 on the patch. Some feedback on v5 below.

> +
> + For logical replication configuration
settings refer
> + also to .
> +
> +

I feel the top paragraph needs to explain terminology for logical
replication like it does for physical replication in addition to linking to
the logical replication config page. I'm recommending this as we use terms
like subscriber etc. in description of parameters without introducing them
first.

As an example, something like below might work.

These settings control the behavior of the built-in streaming replication
feature (see Section 27.2.5) and logical replication (link).

For physical replication, servers will be either a primary or a standby
server. Primaries can send data, while standbys are always receivers of
replicated data. When cascading replication (see Section 27.2.7) is used,
standby servers can also be senders, as well as receivers. Parameters are
mainly for sending and standby servers, though some parameters have meaning
only on the primary server. Settings may vary across the cluster without
problems if that is required.

For logical replication, servers will either be publishers (also called
senders in the sections below) or subscribers. Publishers are ,
Subscribers are...

> +   
> + See  for more
details
> + about setting max_replication_slots for
logical
> + replication.
> +

The link doesn't add any new information regarding max_replication_slots
other than "to reserve some for table sync" and has a good amount of
unrelated info. I think it might be useful to just put a line here asking
to reserve some for table sync instead of linking to the entire logical
replication config section.

> -   Logical replication requires several configuration options to be set.
> +   Logical replication requires several configuration parameters to be
set.

May not be needed? The docs have references to both options and parameters
but I don't feel strongly about it. Feel free to use what you prefer.

I think we should add an additional line to the intro here saying that
parameters are mostly relevant only one of the subscriber or publisher.
Maybe a better written version of "While max_replication_slots means
different things on the publisher and subscriber, all other parameters are
relevant only on either the publisher or the subscriber."

> +  
> +   Notes

I don't think we need this sub-section. If I understand correctly, these
parameters are effective only on the subscriber side. So, any reason to not
include them in that section?

> +
> +   
> +Logical replication workers are also affected by
> +wal_receiver_timeout,
> +wal_receiver_status_interval
and
> +wal_receiver_retry_interval.
> +   
> +

I like moving this; it makes more sense here. Should we remove it from
config.sgml? It seems a bit out of place there as we generally talk only
about individual parameters there and this line is general logical
replication subscriber advise which is more suited to
logical-replication.sgml

> +   
> +Configuration parameter
> +max_worker_processes
> +may need to be adjusted to accommodate for replication workers, at
least (
> +max_logical_replication_workers
> ++ 1). Some extensions and parallel queries also
take
> +worker slots from max_worker_processes.
> +   
> +
> +  

I think we should move this to the subscriber section as said above. It's
useful to know this and people might skip over the notes.

> ~~
>
> Meanwhile, I also suspect that the main blurb top of [1] is not
> entirely correct... it says "These settings control the behaviour of
> the built-in streaming replication feature", although some of the GUCs
> mentioned later in this section are clearly for "logical replication".
>

> Thoughts?
>

I shared an idea above.

Regards,
Samay

>
> --
> [1] 31.10 Configuration Settings -
> https://www.postgresql.org/docs/current/logical-replication-config.html
> [2] 20.6 Replication -
> https://www.postgresql.org/docs/current/runtime-config-replication.html
>
> Kind Regards,
> Peter Smith.
> Fujitsu Australia
>

Re: Error-safe user functions

2022-12-05 Thread Corey Huinker

On Mon, Dec 5, 2022 at 11:36 AM Andrew Dunstan  wrote:

>
> On 2022-12-05 Mo 11:20, Robert Haas wrote:
> > On Mon, Dec 5, 2022 at 11:09 AM Tom Lane  wrote:
> >> Robert Haas  writes:
> >>> On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker 
> wrote:
>  2. ereturn_* => errfeedback / error_feedback / feedback
> >>> Oh, I like that, especially errfeedback.
> >> efeedback?  But TBH I do not think any of these are better than ereturn.
> > I do. Having a macro name that is "return" plus one character is going
> > to make people think that it returns. I predict that if you insist on
> > using that name people are still going to be making mistakes based on
> > that confusion 10 years from now.
> >
>
> OK, I take both this point and Tom's about trying to keep it the same
> length. So we need something that's  7 letters, doesn't say 'return' and
> preferably begins with 'e'. I modestly suggest 'eseterr', or if we like
> the 'feedback' idea 'efeedbk'.
>
>
>
Consulting https://www.thesaurus.com/browse/feedback again:
ereply clocks in at 7 characters.

Re: Error-safe user functions

2022-12-05 Thread Corey Huinker

On Mon, Dec 5, 2022 at 1:00 PM Tom Lane  wrote:

> Andrew Dunstan  writes:
> > Wait a minute!  Oh, no, sorry, as you were, 'errsave' is fine.
>
> Seems like everybody's okay with errsave.  I'll make a v2 in a
> little bit.  I'd like to try updating array_in and/or record_in
> just to verify that indirection cases work okay, before we consider
> the design to be set.
>

+1 to errsave.

Re: ANY_VALUE aggregate

2022-12-05 Thread Corey Huinker

On Mon, Dec 5, 2022 at 12:57 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Mon, Dec 5, 2022 at 7:57 AM Vik Fearing 
> wrote:
>
>> The SQL:2023 Standard defines a new aggregate named ANY_VALUE.  It
>> returns an implementation-dependent (i.e. non-deterministic) value from
>> the rows in its group.
>>
>> PFA an implementation of this aggregate.
>>
>>
> Can we please add "first_value" and "last_value" if we are going to add
> "some_random_value" to our library of aggregates?
>
> Also, maybe we should have any_value do something like compute a 50/50
> chance that any new value seen replaces the existing chosen value, instead
> of simply returning the first value all the time.  Maybe even prohibit the
> first value from being chosen so long as a second value appears.
>
> David J.
>

Adding to the pile of wanted aggregates: in the past I've lobbied for
only_value() which is like first_value() but it raises an error on
encountering a second value.

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-12-05 Thread Andres Freund

Hi,

- I think it might be worth to rename IOCONTEXT_BUFFER_POOL to
  IOCONTEXT_{NORMAL, PLAIN, DEFAULT}. I'd like at some point to track WAL IO ,
  temporary file IO etc, and it doesn't seem useful to define a version of
  BUFFER_POOL for each of them. And it'd make it less confusing, because all
  the other existing contexts are also in the buffer pool (for now, can't wait
  for "bypass" or whatever to be tracked as well).

- given that IOContextForStrategy() is defined in freelist.c, and that
  declaring it in pgstat.h requires including buf.h, I think it's probably
  better to move IOContextForStrategy()'s declaration to freelist.h (doesn't
  exist, but whatever the right one is)

- pgstat_backend_io_stats_assert_well_formed() doesn't seem to belong in
  pgstat.c. Why not pgstat_io_ops.c?

- Do pgstat_io_context_ops_assert_zero(), pgstat_io_op_assert_zero() have to
  be in pgstat.h?


I think the only non-trival thing is the first point, the rest is stuff than I
also evolve during commit.

Greetings,

Andres Freund

Re: Allow placeholders in ALTER ROLE w/o superuser

2022-12-05 Thread Alexander Korotkov

On Mon, Dec 5, 2022 at 8:18 PM Tom Lane  wrote:
> Alvaro Herrera  writes:
> > I couldn't find any discussion of the idea of adding "(s)" to the
> > variable name in order to mark the variable userset in the catalog, and
> > I have to admit I find it a bit strange.  Are we really agreed that
> > that's the way to proceed?
>
> I hadn't been paying close attention to this thread, sorry.
>
> I agree that that seems like a very regrettable choice,
> especially if you anticipate having to bump catversion anyway.

I totally understand that this change requires a catversion bump.
I've reflected this in the commit message.

> Better to add a bool column to the catalog.

What about adding a boolean array to the pg_db_role_setting? So,
pg_db_role_setting would have the following columns.
 * setdatabase oid
 * setrole oid
 * setconfig text[]
 * setuser bool[]

--
Regards,
Alexander Korotkov

Re: ANY_VALUE aggregate

2022-12-05 Thread Robert Haas

On Mon, Dec 5, 2022 at 2:31 PM Corey Huinker  wrote:
> Adding to the pile of wanted aggregates: in the past I've lobbied for 
> only_value() which is like first_value() but it raises an error on 
> encountering a second value.

Yeah, that's another that I have hand-rolled in the past.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Error-safe user functions

2022-12-05 Thread Andrew Dunstan



On 2022-12-05 Mo 14:22, Corey Huinker wrote:
>
> On Mon, Dec 5, 2022 at 11:36 AM Andrew Dunstan 
> wrote:
>
>
> On 2022-12-05 Mo 11:20, Robert Haas wrote:
> > On Mon, Dec 5, 2022 at 11:09 AM Tom Lane  wrote:
> >> Robert Haas  writes:
> >>> On Sat, Dec 3, 2022 at 10:57 PM Corey Huinker
>  wrote:
>  2. ereturn_* => errfeedback / error_feedback / feedback
> >>> Oh, I like that, especially errfeedback.
> >> efeedback?  But TBH I do not think any of these are better than
> ereturn.
> > I do. Having a macro name that is "return" plus one character is
> going
> > to make people think that it returns. I predict that if you
> insist on
> > using that name people are still going to be making mistakes
> based on
> > that confusion 10 years from now.
> >
>
> OK, I take both this point and Tom's about trying to keep it the same
> length. So we need something that's  7 letters, doesn't say
> 'return' and
> preferably begins with 'e'. I modestly suggest 'eseterr', or if we
> like
> the 'feedback' idea 'efeedbk'.
>
>
>
> Consulting https://www.thesaurus.com/browse/feedback again:
> ereply clocks in at 7 characters.


It does?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: ANY_VALUE aggregate

2022-12-05 Thread Vik Fearing


On 12/5/22 15:57, Vik Fearing wrote:
The SQL:2023 Standard defines a new aggregate named ANY_VALUE.  It 
returns an implementation-dependent (i.e. non-deterministic) value from 
the rows in its group.


PFA an implementation of this aggregate.


Here is v2 of this patch.  I had forgotten to update sql_features.txt.
--
Vik Fearing
From a9bb61aab9788ae25fdcd28f7dcfb54a263665cc Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sat, 9 Apr 2022 00:07:38 +0200
Subject: [PATCH] Implement ANY_VALUE aggregate

SQL:2023 defines an ANY_VALUE aggregate whose purpose is to emit an
implementation-dependent (i.e. non-deterministic) value from the
aggregated rows.
---
 doc/src/sgml/func.sgml   | 14 ++
 src/backend/catalog/sql_features.txt |  1 +
 src/backend/utils/adt/misc.c | 12 
 src/include/catalog/pg_aggregate.dat |  4 
 src/include/catalog/pg_proc.dat  |  8 
 src/test/regress/expected/aggregates.out | 18 ++
 src/test/regress/sql/aggregates.sql  |  5 +
 7 files changed, 62 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2052d3c844..1823ee71d7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19706,16 +19706,30 @@ SELECT NULLIF(value, '(none)') ...

 Description

Partial Mode
   
  
 
  
+  
+   
+
+ any_value
+
+any_value ( "any" )
+same as input type
+   
+   
+Chooses a non-deterministic value from the non-null input values.
+   
+   Yes
+  
+
   

 
  array_agg
 
 array_agg ( anynonarray )
 anyarray

diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index 8704a42b60..b7b6ad6334 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -515,16 +515,17 @@ T617	FIRST_VALUE and LAST_VALUE functions			YES
 T618	NTH_VALUE function			NO	function exists, but some options missing
 T619	Nested window functions			NO	
 T620	WINDOW clause: GROUPS option			YES	
 T621	Enhanced numeric functions			YES	
 T622	Trigonometric functions			YES	
 T623	General logarithm functions			YES	
 T624	Common logarithm functions			YES	
 T625	LISTAGG			NO	
+T626	ANY_VALUE			YES	
 T631	IN predicate with one list element			YES	
 T641	Multiple column assignment			NO	only some syntax variants supported
 T651	SQL-schema statements in SQL routines			YES	
 T652	SQL-dynamic statements in SQL routines			NO	
 T653	SQL-schema statements in external routines			YES	
 T654	SQL-dynamic statements in external routines			NO	
 T655	Cyclically dependent routines			YES	
 T811	Basic SQL/JSON constructor functions			NO	
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 9c13251231..94c92de06d 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -928,8 +928,20 @@ pg_get_replica_identity_index(PG_FUNCTION_ARGS)
 	idxoid = RelationGetReplicaIndex(rel);
 	table_close(rel, AccessShareLock);
 
 	if (OidIsValid(idxoid))
 		PG_RETURN_OID(idxoid);
 	else
 		PG_RETURN_NULL();
 }
+
+Datum
+any_value_trans(PG_FUNCTION_ARGS)
+{
+	/* Return the first non-null argument */
+	if (!PG_ARGISNULL(0))
+		PG_RETURN_DATUM(PG_GETARG_DATUM(0));
+	if (!PG_ARGISNULL(1))
+		PG_RETURN_DATUM(PG_GETARG_DATUM(1));
+	PG_RETURN_NULL();
+}
+
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index b9110a5298..37626d6f0c 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -620,9 +620,13 @@
   aggtransfn => 'ordered_set_transition_multi', aggfinalfn => 'cume_dist_final',
   aggfinalextra => 't', aggfinalmodify => 'w', aggmfinalmodify => 'w',
   aggtranstype => 'internal' },
 { aggfnoid => 'dense_rank(any)', aggkind => 'h', aggnumdirectargs => '1',
   aggtransfn => 'ordered_set_transition_multi',
   aggfinalfn => 'dense_rank_final', aggfinalextra => 't', aggfinalmodify => 'w',
   aggmfinalmodify => 'w', aggtranstype => 'internal' },
 
+# any_value
+{ aggfnoid => 'any_value(anyelement)', aggtransfn => 'any_value_trans',
+  aggcombinefn => 'any_value_trans', aggtranstype => 'anyelement' },
+
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f9301b2627..2ee4797559 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -11849,9 +11849,17 @@
   proname => 'brin_minmax_multi_summary_recv', provolatile => 's',
   prorettype => 'pg_brin_minmax_multi_summary', proargtypes => 'internal',
   prosrc => 'brin_minmax_multi_summary_recv' },
 { oid => '4641', descr => 'I/O',
   proname => 'brin_minmax_multi_summary_send', provolatile => 's',
   prorettype => 'bytea', proargtypes => 'pg_brin_minmax_multi_summary',
   prosrc => 'brin_minmax_multi_summary_send' },
 
+{ oid => '8981', descr => 'arbitrary value fro

Re: predefined role(s) for VACUUM and ANALYZE

2022-12-05 Thread Pavel Luzanov


Hello,

While looking into the new feature, I found the following situation with 
the \dp command displaying privileges on the system tables:


GRANT VACUUM, ANALYZE ON TABLE pg_type TO alice;

SELECT relacl FROM pg_class WHERE oid = 'pg_type'::regclass;
   relacl
-
 {=r/postgres,postgres=arwdDxtvz/postgres,alice=vz/postgres}
(1 row)

But the \dp command does not show the granted privileges:

\dp pg_type
    Access privileges
 Schema | Name | Type | Access privileges | Column privileges | Policies
+--+--+---+---+--
(0 rows)

The comment in src/bin/psql/describe.c explains the situation:

    /*
     * Unless a schema pattern is specified, we suppress system and temp
     * tables, since they normally aren't very interesting from a 
permissions
     * point of view.  You can see 'em by explicit request though, eg 
with \z

     * pg_catalog.*
     */


So to see the privileges you have to explicitly specify the schema name:

\dp pg_catalog.pg_type
 Access privileges
   Schema   |  Name   | Type  |  Access privileges  | Column 
privileges | Policies

+-+---+-+---+--
 pg_catalog | pg_type | table | =r/postgres +|   |
    | |   | 
postgres=arwdDxtvz/postgres+|   |

    | |   | alice=vz/postgres |   |
(1 row)

But perhaps this behavior should be reviewed or at least documented?

-
Pavel Luzanov

Re: [PoC] Reducing planning time when tables have many partitions

2022-12-05 Thread David Rowley

On Tue, 6 Dec 2022 at 04:45, Thom Brown  wrote:
> Testing your patches with the same 1024 partitions, each with 64
> sub-partitions, I get a planning time of 205.020 ms, which is now a
> 1,377x speedup.  This has essentially reduced the planning time from a
> catastrophe to a complete non-issue.  Huge win!

Thanks for testing the v10 patches.

I wouldn't have expected such additional gains from v10. I was mostly
focused on trying to minimise any performance regression for simple
queries that wouldn't benefit from indexing the EquivalenceMembers.
Your query sounds like it does not fit into that category.  Perhaps it
is down to the fact that v9-0002 or v9-0003 reverts a couple of the
optimisations that is causing v9 to be slower than v10 for your query.
It's hard to tell without more details of what you're running.

Is this a schema and query you're able to share? Or perhaps mock up a
script of something similar enough to allow us to see why v9 and v10
are so different?

Additionally, it would be interesting to see if patching with v10-0002
alone helps the performance of your query at all. I didn't imagine
that change would give us anything easily measurable, but partition
pruning makes extensive use of Bitmapsets, so perhaps you've found
something. If you have then it might be worth considering v10-0002
independently of the EquivalenceMember indexing work.

David

Re: Collation version tracking for macOS

2022-12-05 Thread Thomas Munro

On Tue, Dec 6, 2022 at 6:45 AM Joe Conway  wrote:
> On 12/5/22 12:41, Jeff Davis wrote:
> > On Mon, 2022-12-05 at 16:12 +1300, Thomas Munro wrote:
> >> 1.  I think we should seriously consider provider = ICU63.  I still
> >> think search-by-collversion is a little too magical, even though it
> >> clearly can be made to work.  Of the non-magical systems, I think
> >> encoding the choice of library into the provider name would avoid the
> >> need to add a second confusing "X_version" concept alongside our
> >> existing "X_version" columns in catalogues and DDL syntax, while
> >> still
> >> making it super clear what is going on.
> >
> > As I understand it, this is #2 in your previous list?
> >
> > Can we put the naming of the provider into the hands of the user, e.g.:
> >
> >CREATE COLLATION PROVIDER icu63 TYPE icu
> >  AS '/path/to/libicui18n.so.63', '/path/to/libicuuc.so.63';
> >
> > In this model, icu would be a "provider kind" and icu63 would be the
> > specific provider, which is named by the user.
> >
> > That seems like the least magical approach, to me. We need an ICU
> > library; the administrator gives us one that looks like ICU; and we're
> > happy.
>
> +1
>
> I like this. The provider kind defines which path we take in our code,
> and the specific library unambiguously defines a specific collation
> behavior (I think, ignoring bugs?)

OK, I'm going to see what happens if I try to wrangle that stuff into
a new catalogue table.

Re: Error-safe user functions

2022-12-05 Thread Tom Lane

I wrote:
> Seems like everybody's okay with errsave.  I'll make a v2 in a
> little bit.  I'd like to try updating array_in and/or record_in
> just to verify that indirection cases work okay, before we consider
> the design to be set.

v2 as promised, incorporating the discussed renamings as well as some
follow-on ones (ErrorReturnContext -> ErrorSaveContext, notably).

I also tried moving the struct into a new header file, miscnodes.h
after Andrew's suggestion upthread.  That seems at least marginally
cleaner than putting it in nodes.h, although I'm not wedded to this
choice.

I was really glad that I took the trouble to update some less-trivial
input functions, because I learned two things:

* It's better if InputFunctionCallSafe will tolerate the case of not
being passed an ErrorSaveContext.  In the COPY hack it felt worthwhile
to have completely separate code paths calling InputFunctionCallSafe
or InputFunctionCall, but that's less appetizing elsewhere.

* There's a crying need for a macro that wraps up errsave() with an
immediate return.  Hence, ereturn() is reborn from the ashes.  I hope
Robert won't object to that name if it *does* do a return.

I feel pretty good about this version; it seems committable if there
are not objections.  Not sure if we should commit 0003 like this,
though.

regards, tom lane

diff --git a/doc/src/sgml/ref/create_type.sgml b/doc/src/sgml/ref/create_type.sgml
index 693423e524..53b8d15f97 100644
--- a/doc/src/sgml/ref/create_type.sgml
+++ b/doc/src/sgml/ref/create_type.sgml
@@ -900,6 +900,15 @@ CREATE TYPE name
function is written in C.
   
 
+  
+   In PostgreSQL version 16 and later, it is
+   desirable for base types' input functions to return safe
+   errors using the new errsave() mechanism, rather
+   than throwing ereport() exceptions as in previous
+   versions.  See src/backend/utils/fmgr/README for
+   more information.
+  
+
  
 
  
diff --git a/src/backend/nodes/Makefile b/src/backend/nodes/Makefile
index 4368c30fdb..7c594be583 100644
--- a/src/backend/nodes/Makefile
+++ b/src/backend/nodes/Makefile
@@ -56,6 +56,7 @@ node_headers = \
 	nodes/bitmapset.h \
 	nodes/extensible.h \
 	nodes/lockoptions.h \
+	nodes/miscnodes.h \
 	nodes/replnodes.h \
 	nodes/supportnodes.h \
 	nodes/value.h \
diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl
index 7212bc486f..08992dfd47 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -68,6 +68,7 @@ my @all_input_files = qw(
   nodes/bitmapset.h
   nodes/extensible.h
   nodes/lockoptions.h
+  nodes/miscnodes.h
   nodes/replnodes.h
   nodes/supportnodes.h
   nodes/value.h
@@ -89,6 +90,7 @@ my @nodetag_only_files = qw(
   executor/tuptable.h
   foreign/fdwapi.h
   nodes/lockoptions.h
+  nodes/miscnodes.h
   nodes/replnodes.h
   nodes/supportnodes.h
 );
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 2585e24845..81727ecb28 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -71,6 +71,7 @@
 #include "libpq/libpq.h"
 #include "libpq/pqformat.h"
 #include "mb/pg_wchar.h"
+#include "nodes/miscnodes.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgworker.h"
@@ -686,6 +687,154 @@ errfinish(const char *filename, int lineno, const char *funcname)
 }
 
 
+/*
+ * errsave_start --- begin a "safe" error-reporting cycle
+ *
+ * If "context" isn't an ErrorSaveContext node, this behaves as
+ * errstart(ERROR, domain), and the errsave() macro ends up acting
+ * exactly like ereport(ERROR, ...).
+ *
+ * If "context" is an ErrorSaveContext node, but the node creator only wants
+ * notification of the fact of a safe error without any details, just set
+ * the error_occurred flag in the ErrorSaveContext node and return false,
+ * which will cause us to skip the remaining error processing steps.
+ *
+ * Otherwise, create and initialize error stack entry and return true.
+ * Subsequently, errmsg() and perhaps other routines will be called to further
+ * populate the stack entry.  Finally, errsave_finish() will be called to
+ * tidy up.
+ */
+bool
+errsave_start(void *context, const char *domain)
+{
+	ErrorSaveContext *escontext;
+	ErrorData  *edata;
+
+	/*
+	 * Do we have a context for safe error reporting?  If not, just punt to
+	 * errstart().
+	 */
+	if (context == NULL || !IsA(context, ErrorSaveContext))
+		return errstart(ERROR, domain);
+
+	/* Report that an error was detected */
+	escontext = (ErrorSaveContext *) context;
+	escontext->error_occurred = true;
+
+	/* Nothing else to do if caller wants no further details */
+	if (!escontext->details_wanted)
+		return false;
+
+	/*
+	 * Okay, crank up a stack entry to store the info in.
+	 */
+
+	recursion_depth++;
+	if (++errordata_stack_depth >= ERRORDATA_STACK_SIZE)
+	{
+		/*
+		 * Wups, stack not big enough.  We treat this as a PANIC condition
+		 * because it suggests an infinite loop of errors dur

Re: Transaction timeout

2022-12-05 Thread Nikolay Samokhvalov

On Sat, Dec 3, 2022 at 9:41 AM Andrey Borodin  wrote:

> Fixed. Added test for this.
>

Thanks! Tested (gitpod:
https://gitpod.io/#https://gitlab.com/NikolayS/postgres/tree/transaction_timeout-v2
),

works as expected.

Re: Questions regarding distinct operation implementation

2022-12-05 Thread David Rowley

On Mon, 5 Dec 2022 at 02:34, Ankit Kumar Pandey  wrote:
> Interesting problem, Hashtables created in normal aggregates (AGG_HASHED
> mode) may provide some reference as they have hashagg_spill_tuple but I
> am not sure of any prior implementation of hashtable with counter and
> spill.

I'm unsure if there's such guidance that can be gleaned from studying
nodeAgg.c. IIRC, that works by deciding up-front how many partitions
we're going to split the hash key space into and then writing out
tuples to files based on "hashkey MOD number-of-partitions".  At the
end of that, you can just aggregate tuples one partition at a time.
All groups are in the same file/partition.

The reason this does not seem useful to your case is that you need to
be able to quickly look up a given Datum or set of Datums to check if
they are unique or not. For that, you'd need to reload the hash table
every time your lookup lands on a different partition of the hashkey
space. I fail to see how that could ever be fast unless there happened
to only be 1 partition.   To make that worse, when a tuple goes out of
the frame and the counter that's tracking how many times the Datum(s)
appeared reaches 0, you need to write the entire file out again minus
that tuple.  Let's say you're window function is on a column which is
distinct or *very* close to it and the given window is moving the
window frame forward 1 tuple per input tuple. If each subsequent Datum
hashes to a different partition, then you're going to need to load the
file for that hash key space to check if that Datum has already been
seen, then you're going to have to evict that tuple from the file as
it moves out of frame, so that means reading and writing that entire
file per input tuple consumed.  That'll perform very poorly!   It's
possible that you could maybe speed it up a bit with some lossy hash
table that sits atop of this can only tell you if the given key
definitely does *not* exists.  You'd then be able to just write that
tuple out to the partition and you'd not have to read or write out the
file again.  It's going to slow down to a crawl when the lossy table
contains too many false positives though.

> Major concern is, if we go through tuplesort route (without order
> by case), would we get handicapped in future if we want order by or more
> features?

Yeah, deciding that before you go down one of the paths is going to be
important. I imagine the reason that you've not found another database
that supports DISTINCT window functions in a window with an ORDER BY
clause is that it's very hard to make it in a way where it performs
well in all cases.

Maybe another way to go about it that will give you less lock-in if we
decide to make ORDER BY work later would be to design some new
tuple-store-like data structure that can be defined with a lookup key
so you could ask it if a given key is stored and it would return the
answer quickly without having to trawl through all stored tuples. It
would also need to support the same positional lookups as tuplestore
does today so that all evicting-tuples-from-the-window-frame stuff
works as it does today. If you made something like that, then the
changes required in nodeWindowAgg.c would be significantly reduced.
You'd also just have 1 work_mem limit to abide by instead of having to
consider sharing that between a tuplestore and a hashtable/tuplesort.

Maybe as step 1, you could invent keyedtuplestore.c and consume
tuplestore's functions but layer on the lossy hashtable idea that I
mentioned above. That'd have to be more than just a bloom filter as
you need a payload of the count of tuples matching the given hashkey
MOD nbuckets.  If you then had a function like
keyedtuplestore_key_definately_does_not_exist() (can't think of a
better name now) then you can just lookup the lossy table and if there
are 0 tuples at that lossy bucket, then you can
keyedtuplestore_puttupleslot() from nodeWindowAgg.c.
keyedtuplestore_key_definately_does_not_exist() would have to work
much harder if there were>0 tuples with the same lossy hashkey. You'd
need to trawl through the tuples and check each one.  Perhaps that
could be tuned a bit so if you get too many collisions then the lossy
table could be rehashed to a larger size. It's going to fall flat on
its face, performance-wise, when the hash table can't be made larger
due to work_mem constraints.

Anyway, that's a lot of only partially thought-through ideas above. If
you're working on a patch like this, you should expect to have to
rewrite it a dozen or 2 times as new ideas arrive. If you're good at
using the fact that the new patch is better than the old one as
motivation to continue, then you're onto an attitude that is
PostgreSQL-community-proof :-)  (thankfully) we're often not good at
"let's just commit it now and make it better later".

David

Re: Transaction timeout

2022-12-05 Thread Nikolay Samokhvalov

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Tested, works as expected;

documentation is not yet added

Re: initdb: Refactor PG_CMD_PUTS loops

2022-12-05 Thread Peter Eisentraut


On 02.12.22 15:07, Andrew Dunstan wrote:

On 2022-12-01 Th 05:02, Peter Eisentraut wrote:

Keeping the SQL commands that initdb runs in string arrays before
feeding them to PG_CMD_PUTS() seems unnecessarily verbose and
inflexible.  In some cases, the array only has one member.  In other
cases, one might want to use PG_CMD_PRINTF() instead, to parametrize a
command, but that would require breaking up the loop or using
workarounds like replace_token().  This patch unwinds all that; it's
much simpler that way.


Looks reasonable. (Most of this dates back to 2003/2004, the very early
days of initdb.c.)


committed

Re: Transaction timeout

2022-12-05 Thread Andres Freund

Hi,

On 2022-12-03 09:41:04 -0800, Andrey Borodin wrote:
> @@ -2720,6 +2723,7 @@ finish_xact_command(void)
>  
>   if (xact_started)
>   {
> +
>   CommitTransactionCommand();
>  
>  #ifdef MEMORY_CONTEXT_CHECKING

Spurious newline added.


> @@ -4460,6 +4473,10 @@ PostgresMain(const char *dbname, const char *username)
>   
> enable_timeout_after(IDLE_SESSION_TIMEOUT,
>   
>  IdleSessionTimeout);
>   }
> +
> +
> + if (get_timeout_active(TRANSACTION_TIMEOUT))
> + disable_timeout(TRANSACTION_TIMEOUT, 
> false);
>   }
>  
>   /* Report any recently-changed GUC options */

Too many newlines added.


I'm a bit worried about adding evermore branches and function calls for
the processing of single statements. We already spend a noticable
percentage of the cycles for a single statement in PostgresMain(), this
adds additional overhead.

I'm somewhat inclined to think that we need some redesign here before we
add more overhead.


> @@ -1360,6 +1363,16 @@ IdleInTransactionSessionTimeoutHandler(void)
>   SetLatch(MyLatch);
>  }
>  
> +static void
> +TransactionTimeoutHandler(void)
> +{
> +#ifdef HAVE_SETSID
> + /* try to signal whole process group */
> + kill(-MyProcPid, SIGINT);
> +#endif
> + kill(MyProcPid, SIGINT);
> +}
> +

Why does this use signals instead of just setting the latch like
IdleInTransactionSessionTimeoutHandler() etc?



> diff --git a/src/bin/pg_dump/pg_backup_archiver.c 
> b/src/bin/pg_dump/pg_backup_archiver.c
> index 0081873a72..5229fe3555 100644
> --- a/src/bin/pg_dump/pg_backup_archiver.c
> +++ b/src/bin/pg_dump/pg_backup_archiver.c
> @@ -3089,6 +3089,7 @@ _doSetFixedOutputState(ArchiveHandle *AH)
>   ahprintf(AH, "SET statement_timeout = 0;\n");
>   ahprintf(AH, "SET lock_timeout = 0;\n");
>   ahprintf(AH, "SET idle_in_transaction_session_timeout = 0;\n");
> + ahprintf(AH, "SET transaction_timeout = 0;\n");

Hm - why is that the right thing to do?



> diff --git a/src/test/isolation/specs/timeouts.spec 
> b/src/test/isolation/specs/timeouts.spec
> index c747b4ae28..a7f27811c7 100644
> --- a/src/test/isolation/specs/timeouts.spec
> +++ b/src/test/isolation/specs/timeouts.spec
> @@ -23,6 +23,9 @@ step sto{ SET statement_timeout = '10ms'; }
>  step lto { SET lock_timeout = '10ms'; }
>  step lsto{ SET lock_timeout = '10ms'; SET statement_timeout = '10s'; }
>  step slto{ SET lock_timeout = '10s'; SET statement_timeout = '10ms'; }
> +step tto { SET transaction_timeout = '10ms'; }
> +step sleep0  { SELECT pg_sleep(0.0001) }
> +step sleep10 { SELECT pg_sleep(0.01) }
>  step locktbl { LOCK TABLE accounts; }
>  step update  { DELETE FROM accounts WHERE accountid = 'checking'; }
>  teardown { ABORT; }
> @@ -47,3 +50,5 @@ permutation wrtbl lto update(*)
>  permutation wrtbl lsto update(*)
>  # statement timeout expires first, row-level lock
>  permutation wrtbl slto update(*)
> +# transaction timeout
> +permutation tto sleep0 sleep0 sleep10(*)
> \ No newline at end of file

I don't think this is quite sufficient. I think the test should verify
that transaction timeout interacts correctly with statement timeout /
idle in tx timeout.

Greetings,

Andres Freund

Re: [PATCH] Add native windows on arm64 support

2022-12-05 Thread Michael Paquier

On Mon, Dec 05, 2022 at 10:14:49AM -0800, Andres Freund wrote:
> On 2022-12-05 14:12:41 +0900, Michael Paquier wrote:
>> With meson gaining in maturity, perhaps that's not the most urgent
>> thing as we will likely remove src/tools/msvc/ soon but I'd rather do
>> that right anyway as much as I can to avoid an incorrect state in the
>> tree at any time in its history.
> 
> I'd actually argue that we should just not add win32 support to
> src/tools/msvc/.

Are you arguing about ripping out the existing win32 or do nothing for
arm64?  I guess the later, but these words also sound like the former
to me.
--
Michael


signature.asc
Description: PGP signature

Re: Transaction timeout

2022-12-05 Thread Andrey Borodin

Thanks for looking into this Andres!

On Mon, Dec 5, 2022 at 3:07 PM Andres Freund  wrote:
>
> I'm a bit worried about adding evermore branches and function calls for
> the processing of single statements. We already spend a noticable
> percentage of the cycles for a single statement in PostgresMain(), this
> adds additional overhead.
>
> I'm somewhat inclined to think that we need some redesign here before we
> add more overhead.
>
We can cap statement_timeout\idle_session_timeout by the budget of
transaction_timeout left.
Either way we can do batch function enable_timeouts() instead
enable_timeout_after().

Does anything of it make sense?

>
> > @@ -1360,6 +1363,16 @@ IdleInTransactionSessionTimeoutHandler(void)
> >   SetLatch(MyLatch);
> >  }
> >
> > +static void
> > +TransactionTimeoutHandler(void)
> > +{
> > +#ifdef HAVE_SETSID
> > + /* try to signal whole process group */
> > + kill(-MyProcPid, SIGINT);
> > +#endif
> > + kill(MyProcPid, SIGINT);
> > +}
> > +
>
> Why does this use signals instead of just setting the latch like
> IdleInTransactionSessionTimeoutHandler() etc?

I just copied statement_timeout behaviour. As I understand this
implementation is prefered if the timeout can catch the backend
running at full steam.

> > diff --git a/src/bin/pg_dump/pg_backup_archiver.c 
> > b/src/bin/pg_dump/pg_backup_archiver.c
> > index 0081873a72..5229fe3555 100644
> > --- a/src/bin/pg_dump/pg_backup_archiver.c
> > +++ b/src/bin/pg_dump/pg_backup_archiver.c
> > @@ -3089,6 +3089,7 @@ _doSetFixedOutputState(ArchiveHandle *AH)
> >   ahprintf(AH, "SET statement_timeout = 0;\n");
> >   ahprintf(AH, "SET lock_timeout = 0;\n");
> >   ahprintf(AH, "SET idle_in_transaction_session_timeout = 0;\n");
> > + ahprintf(AH, "SET transaction_timeout = 0;\n");
>
> Hm - why is that the right thing to do?
Because transaction_timeout has effects of statement_timeout.

Thank you!

Best regards, Andrey Borodin.

Re: Error-safe user functions

2022-12-05 Thread Andres Freund

Hi,

On 2022-12-05 16:40:06 -0500, Tom Lane wrote:
> +/*
> + * errsave_start --- begin a "safe" error-reporting cycle
> + *
> + * If "context" isn't an ErrorSaveContext node, this behaves as
> + * errstart(ERROR, domain), and the errsave() macro ends up acting
> + * exactly like ereport(ERROR, ...).
> + *
> + * If "context" is an ErrorSaveContext node, but the node creator only wants
> + * notification of the fact of a safe error without any details, just set
> + * the error_occurred flag in the ErrorSaveContext node and return false,
> + * which will cause us to skip the remaining error processing steps.
> + *
> + * Otherwise, create and initialize error stack entry and return true.
> + * Subsequently, errmsg() and perhaps other routines will be called to 
> further
> + * populate the stack entry.  Finally, errsave_finish() will be called to
> + * tidy up.
> + */
> +bool
> +errsave_start(void *context, const char *domain)

Why is context a void *?


> +{
> + ErrorSaveContext *escontext;
> + ErrorData  *edata;
> +
> + /*
> +  * Do we have a context for safe error reporting?  If not, just punt to
> +  * errstart().
> +  */
> + if (context == NULL || !IsA(context, ErrorSaveContext))
> + return errstart(ERROR, domain);

I don't think we should "accept" !IsA(context, ErrorSaveContext) - that
seems likely to hide things like use-after-free.


> + if (++errordata_stack_depth >= ERRORDATA_STACK_SIZE)
> + {
> + /*
> +  * Wups, stack not big enough.  We treat this as a PANIC 
> condition
> +  * because it suggests an infinite loop of errors during error
> +  * recovery.
> +  */
> + errordata_stack_depth = -1; /* make room on stack */
> + ereport(PANIC, (errmsg_internal("ERRORDATA_STACK_SIZE 
> exceeded")));
> + }

This is the fourth copy of this code...



> +/*
> + * errsave_finish --- end a "safe" error-reporting cycle
> + *
> + * If errsave_start() decided this was a regular error, behave as
> + * errfinish().  Otherwise, package up the error details and save
> + * them in the ErrorSaveContext node.
> + */
> +void
> +errsave_finish(void *context, const char *filename, int lineno,
> +const char *funcname)
> +{
> + ErrorSaveContext *escontext = (ErrorSaveContext *) context;
> + ErrorData  *edata = &errordata[errordata_stack_depth];
> +
> + /* verify stack depth before accessing *edata */
> + CHECK_STACK_DEPTH();
> +
> + /*
> +  * If errsave_start punted to errstart, then elevel will be ERROR or
> +  * perhaps even PANIC.  Punt likewise to errfinish.
> +  */
> + if (edata->elevel >= ERROR)
> + errfinish(filename, lineno, funcname);

I'd put a pg_unreachable() or such after the errfinish() call.


> + /*
> +  * Else, we should package up the stack entry contents and deliver them 
> to
> +  * the caller.
> +  */
> + recursion_depth++;
> +
> + /* Save the last few bits of error state into the stack entry */
> + if (filename)
> + {
> + const char *slash;
> +
> + /* keep only base name, useful especially for vpath builds */
> + slash = strrchr(filename, '/');
> + if (slash)
> + filename = slash + 1;
> + /* Some Windows compilers use backslashes in __FILE__ strings */
> + slash = strrchr(filename, '\\');
> + if (slash)
> + filename = slash + 1;
> + }
> +
> + edata->filename = filename;
> + edata->lineno = lineno;
> + edata->funcname = funcname;
> + edata->elevel = ERROR;  /* hide the LOG value used above */
> +
> + /*
> +  * We skip calling backtrace and context functions, which are more 
> likely
> +  * to cause trouble than provide useful context; they might act on the
> +  * assumption that a transaction abort is about to occur.
> +  */

This seems like a fair bit of duplicated code.


> + * This is the same as InputFunctionCall, but the caller may also pass a
> + * previously-initialized ErrorSaveContext node.  (We declare that as
> + * "void *" to avoid including miscnodes.h in fmgr.h.)

It seems way cleaner to forward declare ErrorSaveContext instead of
using void *.


> If escontext points
> + * to an ErrorSaveContext, any "safe" errors detected by the input function
> + * will be reported by filling the escontext struct.  The caller must
> + * check escontext->error_occurred before assuming that the function result
> + * is meaningful.

I wonder if we shouldn't instead make InputFunctionCallSafe() return a
boolean and return the Datum via a pointer. As callers are otherwise
going to need to do SAFE_ERROR_OCCURRED(escontext) themselves, I think
it should also lead to more concise (and slightly more efficient) code.


> +Datum
> +InputFunctionCallSafe(FmgrInfo *flinfo, char *str,
> +

Re: [PATCH] Add native windows on arm64 support

2022-12-05 Thread Andres Freund

Hi,

On 2022-12-06 08:31:16 +0900, Michael Paquier wrote:
> On Mon, Dec 05, 2022 at 10:14:49AM -0800, Andres Freund wrote:
> > On 2022-12-05 14:12:41 +0900, Michael Paquier wrote:
> >> With meson gaining in maturity, perhaps that's not the most urgent
> >> thing as we will likely remove src/tools/msvc/ soon but I'd rather do
> >> that right anyway as much as I can to avoid an incorrect state in the
> >> tree at any time in its history.
> > 
> > I'd actually argue that we should just not add win32 support to
> > src/tools/msvc/.
> 
> Are you arguing about ripping out the existing win32 or do nothing for
> arm64?  I guess the later, but these words also sound like the former
> to me.

Yes, that's a typo. I meant that we shouldn't add arm64 support to
src/tools/msvc/.

I do think we should rip out src/tools/msvc/ soon-ish, but we need
buildfarm support first.

Greetings,

Andres Freund

Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-05 Thread Peter Smith

Here are my review comments for patch v55-0002

==

.../replication/logical/applyparallelworker.c

1. pa_can_start

@@ -276,9 +278,9 @@ pa_can_start(TransactionId xid)
  /*
  * Don't start a new parallel worker if user has set skiplsn as it's
  * possible that user want to skip the streaming transaction. For
- * streaming transaction, we need to spill the transaction to disk so that
- * we can get the last LSN of the transaction to judge whether to skip
- * before starting to apply the change.
+ * streaming transaction, we need to serialize the transaction to a file
+ * so that we can get the last LSN of the transaction to judge whether to
+ * skip before starting to apply the change.
  */
  if (!XLogRecPtrIsInvalid(MySubscription->skiplsn))
  return false;

I think the wording change may belong in patch 0001 because it has
nothing to do with partial serializing.

~~~

2. pa_free_worker

+ /*
+ * Stop the worker if there are enough workers in the pool.
+ *
+ * XXX The worker is also stopped if the leader apply worker needed to
+ * serialize part of the transaction data due to a send timeout. This is
+ * because the message could be partially written to the queue due to send
+ * timeout and there is no way to clean the queue other than resending the
+ * message until it succeeds. To avoid complexity, we directly stop the
+ * worker in this case.
+ */
+ if (winfo->serialize_changes ||
+ napplyworkers > (max_parallel_apply_workers_per_subscription / 2))

Don't need to say "due to send timeout" 2 times in 2 sentences.

SUGGESTION
XXX The worker is also stopped if the leader apply worker needed to
serialize part of the transaction data due to a send timeout. This is
because the message could be partially written to the queue but there
is no way to clean the queue other than resending the message until it
succeeds. Directly stopping the worker avoids needing this complexity.

~~~

3. pa_spooled_messages

Previously I suggested this function name should be changed but that
was rejected (see [1] #6a)

> 6a.
> IMO a better name for this function would be
> pa_apply_spooled_messages();
Not sure about this.

~

FYI the reason for the previous suggestion is because there is no verb
in the current function name, so the reader is left thinking
pa_spooled_messages "what"?

It means the caller has to have extra comments like:
/* Check if changes have been serialized to a file. */
pa_spooled_messages();

OTOH, if the function was called something better -- e.g.
pa_check_for_spooled_messages() or similar -- then it would be
self-explanatory.

~

4.

 /*
+ * Replay the spooled messages in the parallel apply worker if the leader apply
+ * worker has finished serializing changes to the file.
+ */
+static void
+pa_spooled_messages(void)

I'm not 100% sure of the logic, so IMO maybe the comment should say a
bit more about how this works:

Specifically, let's say there was some timeout and the LA needed to
write the spool file, then let's say the PA timed out and found itself
inside this function. Now, let's say the LA is still busy writing the
file -- so what happens next?

Does this function simply return, then the main PA loop waits again,
then the times out again, then PA finds itself back inside this
function again... and that keeps happening over and over until
eventually the spool file is found FS_READY? Some explanatory comments
might help.

~

5.

+ /*
+ * Check if changes have been serialized to a file. if so, read and apply
+ * them.
+ */
+ SpinLockAcquire(&MyParallelShared->mutex);
+ fileset_state = MyParallelShared->fileset_state;
+ SpinLockRelease(&MyParallelShared->mutex);

"if so" -> "If so"

~~~


6. pa_send_data

+ *
+ * If the attempt to send data via shared memory times out, then we will switch
+ * to "PARTIAL_SERIALIZE mode" for the current transaction to prevent possible
+ * deadlocks with another parallel apply worker (refer to the comments atop
+ * applyparallelworker.c for details). This means that the current data and any
+ * subsequent data for this transaction will be serialized to a file.
  */
 void
 pa_send_data(ParallelApplyWorkerInfo *winfo, Size nbytes, const void *data)

SUGGESTION (minor comment rearranging)

If the attempt to send data via shared memory times out, then we will
switch to "PARTIAL_SERIALIZE mode" for the current transaction -- this
means that the current data and any subsequent data for this
transaction will be serialized to a file. This is done to prevent
possible deadlocks with another parallel apply worker (refer to the
comments atop applyparallelworker.c for details).

~

7.

+ /*
+ * Take the stream lock to make sure that the parallel apply worker
+ * will wait for the leader to release the stream lock until the
+ * end of the transaction.
+ */
+ pa_lock_stream(winfo->shared->xid, AccessExclusiveLock);

The comment doesn't sound right.

"until the end" -> "at the end" (??)

~~~

8. pa_stream_abort

@@ -1374,6 +1470,7 @@ pa_stream_abort(LogicalRepStreamAbortData *abort_da

1 2 >

1 - 100 of 145 matches

Mail list logo