subject:"Re\:"


On 30/07/2024 00:50, Thomas Munro wrote:

On Wed, Jul 3, 2024 at 8:09 PM Tom Lane  wrote:

Thomas Munro  writes:

Here are some experimental patches to try out some ideas mentioned
upthread, that are approximately unlocked by that cleanup.


FWIW, I'm good with getting rid of --disable-spinlocks and
--disable-atomics.  That's a fair amount of code and needing to
support it causes problems, as you say.  I am very much less
excited about ripping out our spinlock and/or atomics code in favor
of ; I just don't see the gain there, and I do see risk
in ceding control of the semantics and performance of those
primitives.


OK,  part on ice for now.  Here's an update of the rest,
this time also removing the barrier fallbacks as discussed in the LTO
thread[1].


Looks good to me.


I guess we should also consider reimplementing the spinlock on the
atomic API, but I can see that Andres is poking at spinlock code right
now so I'll keep out of his way...

Side issue: I noticed via CI failure when I tried to require
read/write barriers to be provided (a choice I backed out of), that on
MSVC we seem to be using the full memory barrier fallback for those.
Huh?  For x86, I think they should be using pg_compiler_barrier() (no
code gen, just prevent reordering), not pg_pg_memory_barrier(), no?


Agreed, arch-x86.h is quite clear on that.


Perhaps I'm missing something but I suspect we might be failing to
include arch-x86.h on that compiler when we should... maybe it needs
to detect _M_AMD64 too? 


Aha, yes I think that's it. Apparently, __x86_64__ is not defined on 
MSVC. To prove that, I added garbage to the "#ifdef __x86_64__" guarded 
block in atomics.h. The compilation passes on MSVC, but not on other 
platforms: https://cirrus-ci.com/build/6310061188841472.


That means that we're not getting the x86-64 instructions in 
src/port/pg_crc32c_sse42.c on MSVC either.


I think we should do:

#ifdef _M_AMD64
#define __x86_64__
#endif

somewhere, perhaps in src/include/port/win32.h.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Restart pg_usleep when interrupted

2024-07-29 Thread Sami Imseih




> On Jul 26, 2024, at 3:27 AM, Bertrand Drouvot  
> wrote:
> 
> Hi,
> 
> On Thu, Jul 25, 2024 at 05:27:15PM -0500, Sami Imseih wrote:
>> I am attaching v3 of the patch which addresses the comments made
>> earlier by Bertrand about the comment in the patch [1].
> 
> Thanks!
> 
> Looking at it:
> 
> 1 ===
> 
> +   struct instr_time start_time;
> 
> I think we can get rid of the "struct" keyword here.
> 
> 2 ===
> 
> +   struct instr_time current_time;
> +   struct instr_time elapsed_time;
> 
> Same as above.

Will fix those 2.

> 
> 3 ===
> 
> I gave more thoughts and I think it can be simplified a bit to reduce the
> number of operations in the while loop.
> 
> What about relying on a "absolute" time that way:
> 
>   instr_time absolute;
>absolute.ticks = start_time.ticks + msec * 100;
> 
> and then in the while loop:
> 
>while (nanosleep(, ) == -1 && errno == EINTR)
>{
>instr_time current_time;
>INSTR_TIME_SET_CURRENT(current_time);
> 
>if (current_time.ticks > absolute.ticks)
>{
>break;

While I agree this code is cleaner, myy hesitation there is we don’t 
have any other place in which we access .ticks directly and the 
common practice is to use the intsr_time.h APIs.


What do you think?


Regards,

Sami

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Jacob Champion

On Mon, Jul 29, 2024 at 1:51 PM Daniel Gustafsson  wrote:
> Together with a colleage we found the Azure provider use "verification_url"
> rather than xxx_uri.

Yeah, I think that's originally a Google-ism. (As far as I can tell
they helped author the spec for this and then didn't follow it. :/ ) I
didn't recall Azure having used it back when I was testing against it,
though, so that's good to know.

> Another discrepancy is that it uses a string for the
> interval (ie: "interval":"5").

Oh, that's a new one. I don't remember needing to hack around that
either; maybe iddawc handled it silently?

> One can of course argue that Azure is wrong and
> should feel bad, but I fear that virtually all (major) providers will have
> differences like this, so we will have to deal with it in an extensible 
> fashion
> (compile time, not runtime configurable).

Such is life... verification_url we will just have to deal with by
default, I think, since Google does/did it too. Not sure about
interval -- but do we want to make our distribution maintainers deal
with a compile-time setting for libpq, just to support various OAuth
flavors? To me it seems like we should just hold our noses and support
known (large) departures in the core.

> I was toying with making the name json_field name member an array, to allow
> variations.  That won't help with the fieldtype differences though, so another
> train of thought was to have some form of REQUIRED_XOR where fields can tied
> together.  What do you think about something along these lines?

If I designed it right, just adding alternative spellings directly to
the fields list should work. (The "required" check is by struct
member, not name, so both spellings can point to the same
destination.) The alternative typing on the other hand might require
something like a new sentinel "type" that will accept both... I hadn't
expected that.

> Another thing, shouldn't we really parse and interpret *all* REQUIRED fields
> even if we don't use them to ensure that the JSON is wellformed?  If the JSON
> we get is malformed in any way it seems like the safe/conservative option to
> error out.

Good, I was hoping to have a conversation about that. I am fine with
either option in principle. In practice I expect to add code to use
`expires_in` (so that we can pass it to custom OAuth hook
implementations) and `scope` (to check if the server has changed it on
us).

That leaves the provider... Forcing the provider itself to implement
unused stuff in order to interoperate seems like it could backfire on
us, especially since IETF standardized an alternate .well-known URI
[1] that changes some of these REQUIRED things into OPTIONAL. (One way
for us to interpret this: those fields may be required for OpenID, but
your OAuth provider might not be an OpenID provider, and our code
doesn't require OpenID.) I think we should probably tread lightly in
that particular case. Thoughts on that?

Thanks!
--Jacob

[1] https://www.rfc-editor.org/rfc/rfc8414.html

Re: Use pgBufferUsage for block reporting in analyze

2024-07-29 Thread Masahiko Sawada

Hi,

On Mon, Jul 29, 2024 at 12:12 AM Anthonin Bonnefoy
 wrote:
>
> On Sat, Jul 27, 2024 at 12:35 AM Masahiko Sawada  
> wrote:
> > An alternative idea would
> > be to pass StringInfo to AcquireSampleRowsFunc() so that callback can
> > write its messages there. This is somewhat similar to what we do in
> > the EXPLAIN command (cf, ExplainPropertyText() etc). It could be too
> > much but I think it could be better than writing logs in the single
> > format.
>

I have one comment on 0001 patch:

 /*
  * Calculate the difference in the Page
Hit/Miss/Dirty that
  * happened as part of the analyze by
subtracting out the
  * pre-analyze values which we saved above.
  */
-AnalyzePageHit = VacuumPageHit - AnalyzePageHit;
-AnalyzePageMiss = VacuumPageMiss - AnalyzePageMiss;
-AnalyzePageDirty = VacuumPageDirty - AnalyzePageDirty;
+memset(, 0, sizeof(BufferUsage));
+BufferUsageAccumDiff(,
, );
+
+total_blks_hit = bufferusage.shared_blks_hit +
+bufferusage.local_blks_hit;
+total_blks_read = bufferusage.shared_blks_read +
+bufferusage.local_blks_read;
+total_blks_dirtied = bufferusage.shared_blks_dirtied +
+bufferusage.local_blks_dirtied;

The comment should also be updated or removed.

And here are some comments on 0002 patch:

-   TimestampDifference(starttime, endtime, _dur, _dur);
+   delay_in_ms = TimestampDifferenceMilliseconds(starttime, endtime);

I think that this change is to make vacuum code consistent with
analyze code, particularly the following part:

/*
 * We do not expect an analyze to take > 25 days and it simplifies
 * things a bit to use TimestampDifferenceMilliseconds.
 */
delay_in_ms = TimestampDifferenceMilliseconds(starttime, endtime);

However, as the above comment says, delay_in_ms can have a duration up
to 25 days. I guess it would not be a problem for analyze cases but
could be in vacuum cases as vacuum could sometimes be running for a
very long time. I've seen vacuums running even for almost 1 month. So
I think we should keep this part.

---
 /* measure elapsed time iff autovacuum logging requires it */
-if (AmAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
+if (instrument)

The comment should also be updated.

---
Could you split the 0002 patch into two patches? One is to have
ANALYZE command (with VERBOSE option) write the buffer usage, and
second one is to add WAL usage to both ANALYZE command and
autoanalyze. I think adding WAL usage to ANALYZE could be
controversial as it should not be WAL-intensive operation, so I'd like
to keep them separate.


> I've tested this approach, it definitely looks better. I've added a
> logbuf StringInfo to AcquireSampleRowsFunc which will receive the
> logs. elevel was removed as it is not used anymore. Since everything
> is in the same log line, I've removed the relation name in the acquire
> sample functions.
>
> For partitioned tables, I've also added the processed partition table
> being sampled. The output will look like:
>
> INFO:  analyze of table "postgres.public.test_partition"
> Sampling rows from child "public.test_partition_1"
> pages: 5 of 5 scanned
> tuples: 999 live tuples, 0 are dead; 999 tuples in sample, 999
> estimated total tuples
> Sampling rows from child "public.test_partition_2"
> pages: 5 of 5 scanned
> tuples: 1000 live tuples, 0 are dead; 1000 tuples in sample, 1000
> estimated total tuples
> avg read rate: 2.604 MB/s, avg write rate: 0.000 MB/s
> ...
>
> For a file_fdw, the output will be:
>
> INFO:  analyze of table "postgres.public.pglog"
> tuples: 60043 tuples; 3 tuples in sample
> avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
> ...

Thank you for updating the patch. With your patch, I got the following
logs for when executing ANALYZE VERBOSE on a partitioned table:

postgres(1:3971560)=# analyze (verbose) p;
INFO:  analyzing "public.p" inheritance tree
INFO:  analyze of table "postgres.public.p"
Sampling rows from child "public.c1"
pages: 1 of 14750 scanned
tuples: 2259833 live tuples, 0 are dead; 1 tuples in sample,
254 estimated total tuples
Sampling rows from child "public.c2"
pages: 1 of 14750 scanned
tuples: 226 live tuples, 0 are dead; 1 tuples in sample,
500 estimated total tuples
Sampling rows from child "public.c3"
pages: 1 of 14750 scanned
tuples: 2259833 live tuples, 0 are dead; 1 tuples in sample,
254 estimated total tuples
avg read rate: 335.184 MB/s, avg write rate: 0.031 MB/s
buffer usage: 8249 hits, 21795 reads, 2 dirtied
WAL usage: 6 records, 1

Re: Incremental View Maintenance, take 2

2024-07-29 Thread Kirill Reshke

On Sat, 27 Jul 2024 at 13:26, Kirill Reshke  wrote:
>
> Hi!
> Cloudberry DB (Greenplum fork) uses IMMV feature for AQUMV (auto query
> use matview) feature, so i got interested in how it is implemented.
>
> On Thu, 11 Jul 2024 at 09:24, Yugo NAGATA  wrote:
> >
> > I updated the patch to bump up the version numbers in psql and pg_dump codes
> > from 17 to 18.
>
> Few suggestions:
>
> 1) `Add-relisivm-column-to-pg_class-system-catalog` commit message
> should be fixed, there is "isimmv" in the last line.
> 2) I dont get why `Add-Incremental-View-Maintenance-support.patch`
> goes after 0005 & 0004. Shoulndt we first implement feature server
> side, only when client (psql & pg_dump) side?
> 3) Can we provide regression tests for each function separately? Test
> for main feature in main patch, test for DISTINCT support in
> v34-0007-Add-DISTINCT-support-for-IVM.patch etc? This way the patchset
> will be easier to review, and can be committed separelety.
> 4) v34-0006-Add-Incremental-View-Maintenance-support.patch no longer
> applies due to 4b74ebf726d444ba820830cad986a1f92f724649. After
> resolving issues manually, it does not compile, because
> 4b74ebf726d444ba820830cad986a1f92f724649 also removes
> save_userid/save_sec_context fields from ExecCreateTableAs.
>
> >   if (RelationIsIVM(matviewRel) && stmt->skipData)
> Now this function accepts skipData param.
>
> 5) For DISTINCT support patch uses hidden __ivm* columns. Is this
> design discussed anywhere? I wonder if this is a necessity (only
> solution) or if there are alternatives.
> 6)
> What are the caveats of supporting some simple cases for aggregation
> funcs like in example?
> ```
> regress=# CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_2 AS SELECT
> sum(j) + sum(i) from mv_base_a;
> ERROR:  expression containing an aggregate in it is not supported on
> incrementally maintainable materialized view
> ```
> I can see some difficulties with division CREATE IMMV  AS SELECT
> 1/sum(i) from mv_base_a;  (sum(i) == 0 case), but adding &
> multiplication should be ok, aren't they?
>
>
> Overall, patchset looks mature, however it is far from being
> committable due to lack of testing/feedback/discussion. There is only
> one way to fix this... Test and discuss it!
>
>
> [1] https://github.com/cloudberrydb/cloudberrydb

Hi! Small update: I tried to run a regression test and all
IMMV-related tests failed on my vm. Maybe I'm doing something wrong, I
will try to investigate.

Another suggestion: support for \d and \d+ commands in psql. With v34
patchset applied, psql does not show anything IMMV-related in \d mode.

```
reshke=# \d m1
   Materialized view "public.m1"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 i  | integer |   |  |
Distributed by: (i)


reshke=# \d+ m1
 Materialized view "public.m1"
 Column |  Type   | Collation | Nullable | Default | Storage |
Compression | Stats target | Description
+-+---+--+-+-+-+--+-
 i  | integer |   |  | | plain   |
|  |
View definition:
 SELECT t1.i
   FROM t1;
Distributed by: (i)
Access method: heap

```

Output should be 'Incrementally materialized view "public.m1"' IMO.

Re: Optimize mul_var() for var1ndigits >= 8

Dean Rasheed  writes:
> On Mon, 29 Jul 2024 at 21:39, Joel Jacobson  wrote:
>> I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
>> using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
>> the benefits of simpler code.

> Looking at that other thread that you found [1], I think it's entirely
> possible that there are people who care about 32-bit systems, which
> means that we might well get complaints, if we make it slower for
> them. Unfortunately, I don't have any way to test that (I doubt that
> running a 32-bit executable on my x86-64 system is a realistic test).

I think we've already done things that might impact 32-bit systems
negatively (5e1f3b9eb for instance), and not heard a lot of pushback.
I would argue that anyone still running PG on 32-bit must have pretty
minimal performance requirements, so that they're unlikely to care if
numeric_mul gets slightly faster or slower.  Obviously a *big*
performance drop might get pushback.

regards, tom lane

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Jacob Champion

On Mon, Jul 29, 2024 at 5:02 AM Peter Eisentraut  wrote:
> We should take the check for exit() calls from libpq and expand it to
> cover the other libraries as well.  Maybe there are other problems like
> this?

Seems reasonable, yeah.

> But under what circumstances does "the linker doesn't strip out" happen?
>   If this happens accidentally, then we should have seen some buildfarm
> failures or something?

On my machine, for example, I see differences with optimization
levels. Say you inadvertently call pfree() in a _shlib build, as I did
multiple times upthread. By itself, that shouldn't actually be a
problem (it eventually redirects to free()), so it should be legal to
call pfree(), and with -O2 the build succeeds. But with -Og, the
exit() check trips, and when I disassemble I see that pg_malloc() et
all have infected the shared object. After all, we did tell the linker
to put that object file in, and we don't ask it to garbage-collect
sections.

> Also, one could look further and notice that restricted_token.c and
> sprompt.c both a) are not needed by libpq and b) can trigger exit()
> calls.  Then it's not clear why those are not affected.

I think it's easier for the linker to omit whole object files rather
than partial ones. If libpq doesn't use any of those APIs there's not
really a reason to trip over it.

(Maybe the _shlib variants should just contain the minimum objects
required to compile.)

> I'm reminded of thread [0].  I think there is quite a bit of confusion
> about the pqexpbuffer vs. stringinfo APIs, and they are probably used
> incorrectly quite a bit.  There are now also programs that use both of
> them!  This patch now introduces another layer on top of them.  I fear,
> at the end, nobody is going to understand any of this anymore.

"anymore"? :)

In all seriousness -- I agree that this isn't sustainable. At the
moment the worst pain (the new layer) is isolated to jsonapi.c, which
seems like an okay place to try something new, since there aren't that
many clients. But to be honest I'm not excited about deciding the Best
Way Forward based on a sample size of JSON.

> Also,
> changing all the programs to link in libpq for pqexpbuffer seems like
> the opposite direction from what was suggested in [0].

(I don't really want to keep that new libpq dependency. We'd just have
to decide where PQExpBuffer is going to go if we're not okay with it.)

> I think we need to do some deeper thinking here about how we want the
> memory management on the client side to work.  Maybe we could just use
> one API but have some flags or callbacks to control the out-of-memory
> behavior.

Any src/common code that needs to handle both in-band and out-of-band
failure modes will still have to decide whether it's going to 1)
duplicate code paths or 2) just act as if in-band failures can always
happen. I think that's probably essential complexity; an ideal API
might make it nicer to deal with but it can't abstract it away.

Thanks!
--Jacob

Re: ecdh support causes unnecessary roundtrips

2024-07-29 Thread Daniel Gustafsson

> On 17 Jun 2024, at 19:56, Andres Freund  wrote:
> On 2024-06-17 19:51:45 +0200, Daniel Gustafsson wrote:

>> Changing the default of the ecdh GUC would perhaps be doable?
> 
> I was wondering whether we could change the default so that it accepts both
> x25519 and secp256r1. Unfortunately that seems to requires changing what we
> use to set the parameter...

Right.  The patch in https://commitfest.postgresql.org/48/5025/ does allow for
accepting both but that's a different discussion.

Changing, and backpatching, the default to at least keep new installations from
extra roundtrips doesn't seem that far off in terms of scope from what
860fe27ee1e2 backpatched.  Maybe it can be an option.

>> Amending the documentation is the one thing we certainly can do but 99.9% of
>> affected users won't know they are affected so won't look for that section.
> 
> Yea. It's also possible that some other bindings changed their default to
> match ours...

There is that possibility, though I think we would've heard something about
that by now if that had happened.

--
Daniel Gustafsson

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Dean Rasheed

On Mon, 29 Jul 2024 at 21:39, Joel Jacobson  wrote:
>
> I like these simplifications, how `var2ndigits` is used instead of 
> `res_ndigits`:
> -   for (int i = res_ndigits - 3; i >= 1; i--)
> +   for (int i = var2ndigits - 1; i >= 1; i--)
>
> But I wonder why does `case 1:`  not follow the same pattern?
> for (int i = res_ndigits - 2; i >= 0; i--)
>

Ah yes, that should be made the same. (I think I did do that at one
point, but then accidentally reverted it during a code refactoring.)

> * v3-0002
>
> I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
> using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
> the benefits of simpler code.
>
> You brought up the question if 32-bit systems should be regarded
> as legacy previously in this thread.
>
> Unfortunately, we didn't get any feedback, so I'm starting a separate
> thread, with subject "Is fast 32-bit code still important?", hoping to get
> more input to help us make judgement calls.
>

Looking at that other thread that you found [1], I think it's entirely
possible that there are people who care about 32-bit systems, which
means that we might well get complaints, if we make it slower for
them. Unfortunately, I don't have any way to test that (I doubt that
running a 32-bit executable on my x86-64 system is a realistic test).

Regards,
Dean

[1] https://postgr.es/m/0a71b43129fb447988f152941e1db...@nidsa.net

Re: Remove last traces of HPPA support

2024-07-29 Thread Thomas Munro

On Wed, Jul 3, 2024 at 8:09 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Here are some experimental patches to try out some ideas mentioned
> > upthread, that are approximately unlocked by that cleanup.
>
> FWIW, I'm good with getting rid of --disable-spinlocks and
> --disable-atomics.  That's a fair amount of code and needing to
> support it causes problems, as you say.  I am very much less
> excited about ripping out our spinlock and/or atomics code in favor
> of ; I just don't see the gain there, and I do see risk
> in ceding control of the semantics and performance of those
> primitives.

OK,  part on ice for now.  Here's an update of the rest,
this time also removing the barrier fallbacks as discussed in the LTO
thread[1].

I guess we should also consider reimplementing the spinlock on the
atomic API, but I can see that Andres is poking at spinlock code right
now so I'll keep out of his way...

Side issue: I noticed via CI failure when I tried to require
read/write barriers to be provided (a choice I backed out of), that on
MSVC we seem to be using the full memory barrier fallback for those.
Huh?  For x86, I think they should be using pg_compiler_barrier() (no
code gen, just prevent reordering), not pg_pg_memory_barrier(), no?
Perhaps I'm missing something but I suspect we might be failing to
include arch-x86.h on that compiler when we should... maybe it needs
to detect _M_AMD64 too?  For ARM, from a quick look, the only way to
reach real acquire/release barriers seems to be to use the C11
interface (which would also be fine on x86 where it should degrade to
a no-op compiler barrier or signal fence as the standard calls it),
but IIRC the Windows/ARM basics haven't gone in yet anyway.

[1] 
https://www.postgresql.org/message-id/flat/721bf39a-ed8a-44b0-8b8e-be3bd81db748%40technowledgy.de#66ba381b05e8ee08b11503b846acc4a1
From 33533817949052f7af423aaee0ef6e737031effb Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 2 Jul 2024 14:47:59 +1200
Subject: [PATCH v2 1/4] Remove --disable-spinlocks.

A later change will require atomic support, so it wouldn't make sense
for a new system not to be able to implement true spinlocks.

Discussion: https://postgr.es/m/3351991.1697728588%40sss.pgh.pa.us
---
 configure   |  40 --
 configure.ac|  13 --
 doc/src/sgml/installation.sgml  |  37 +
 meson.build |   6 -
 src/backend/port/atomics.c  |  26 
 src/backend/port/posix_sema.c   |   3 +-
 src/backend/port/sysv_sema.c|   3 +-
 src/backend/postmaster/launch_backend.c |   8 --
 src/backend/storage/ipc/ipci.c  |  10 --
 src/backend/storage/lmgr/Makefile   |   1 -
 src/backend/storage/lmgr/meson.build|   1 -
 src/backend/storage/lmgr/s_lock.c   |   2 +-
 src/backend/storage/lmgr/spin.c | 180 
 src/include/pg_config.h.in  |   3 -
 src/include/pg_config_manual.h  |  15 --
 src/include/port/atomics.h  |   4 +-
 src/include/port/atomics/fallback.h |   4 +-
 src/include/storage/s_lock.h|  39 +
 src/include/storage/spin.h  |  18 +--
 src/test/regress/regress.c  |  86 ---
 20 files changed, 13 insertions(+), 486 deletions(-)
 delete mode 100644 src/backend/storage/lmgr/spin.c

diff --git a/configure b/configure
index ea5514fab1..f8deaa8d78 100755
--- a/configure
+++ b/configure
@@ -836,7 +836,6 @@ enable_integer_datetimes
 enable_nls
 with_pgport
 enable_rpath
-enable_spinlocks
 enable_atomics
 enable_debug
 enable_profiling
@@ -1529,7 +1528,6 @@ Optional Features:
   enable Native Language Support
   --disable-rpath do not embed shared library search path in
   executables
-  --disable-spinlocks do not use spinlocks
   --disable-atomics   do not use atomic operations
   --enable-debug  build with debugging symbols (-g)
   --enable-profiling  build with profiling enabled
@@ -3266,33 +3264,6 @@ fi
 
 
 
-#
-# Spinlocks
-#
-
-
-# Check whether --enable-spinlocks was given.
-if test "${enable_spinlocks+set}" = set; then :
-  enableval=$enable_spinlocks;
-  case $enableval in
-yes)
-  :
-  ;;
-no)
-  :
-  ;;
-*)
-  as_fn_error $? "no argument expected for --enable-spinlocks option" "$LINENO" 5
-  ;;
-  esac
-
-else
-  enable_spinlocks=yes
-
-fi
-
-
-
 #
 # Atomic operations
 #
@@ -12185,17 +12156,6 @@ fi
 
 fi
 
-if test "$enable_spinlocks" = yes; then
-
-$as_echo "#define HAVE_SPINLOCKS 1" >>confdefs.h
-
-else
-  { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING:
-*** Not using spinlocks will cause poor performance." >&5
-$as_echo "$as_me: WARNING:
-*** Not using spinlocks will cause poor performance." >&2;}
-fi
-
 if test "$enable_atomics" = yes; then
 
 $as_echo "#define HAVE_ATOMICS 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index

Re: xid_wraparound tests intermittent failure.

2024-07-29 Thread Masahiko Sawada

On Sat, Jul 27, 2024 at 1:06 PM Andrew Dunstan  wrote:
>
>
> On 2024-07-26 Fr 1:46 PM, Masahiko Sawada wrote:
> > On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan  wrote:
> >>
> >> On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:
> >>
> >> On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada  
> >> wrote:
> >>
> >> On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan  
> >> wrote:
> >>
> >> On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
> >>
> >> See 
> >> 
> >>
> >>
> >> The failure logs are from a run where both tests 1 and 2 failed.
> >>
> >> Thank you for sharing the logs.
> >>
> >> I think that the problem seems to match what Alexander Lakhin
> >> mentioned[1]. Probably we can fix such a race condition somehow but
> >> I'm not sure it's worth it as setting autovacuum = off and
> >> autovacuum_max_workers = 1 (or a low number) is an extremely rare
> >> case. I think it would be better to stabilize these tests. One idea is
> >> to turn the autovacuum GUC parameter on while setting
> >> autovacuum_enabled = off for each table. That way, we can ensure that
> >> autovacuum workers are launched. And I think it seems to align real
> >> use cases.
> >>
> >> Regards,
> >>
> >> [1] 
> >> https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com
> >>
> >>
> >> OK, do you want to propose a patch?
> >>
> >> Yes, I'll prepare and share it soon.
> >>
> >> I've attached the patch. Could you please test if the patch fixes the
> >> instability you observed?
> >>
> >> Since we turn off autovacuum on all three tests and we wait for
> >> autovacuum to complete processing databases, these tests potentially
> >> have a similar (but lower) risk. So I modified these tests to turn it
> >> on so we can ensure the autovacuum runs periodically.
> >>
> >>
> >> I assume you actually meant to remove the "autovacuum = off" in 
> >> 003_wraparound.pl. With that change in your patch I retried my test, but 
> >> on iteration 100 out of 100 it failed on test 002_limits.pl.
> >>
> > I think we need to remove the "autovacuum = off' also in 002_limits.pl
> > as it waits for autovacuum to process both template0 and template1
> > databases. Just to be clear, the failure happened even without
> > "autovacuum = off"?
> >
>
> The attached patch, a slight modification of yours, removes "autovacuum
> = off" for all three tests, and given that a set of 200 runs was clean
> for me.

Oh I missed that I left "autovacuum = off' for some reason in 002
test. Thank you for attaching the patch, it looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: speed up a logical replica setup

2024-07-29 Thread Euler Taveira

On Wed, Jul 17, 2024, at 11:37 PM, Amit Kapila wrote:
> I am thinking of transactions between restart_lsn and "consistent
> point lsn" (aka the point after which all commits will be replicated).
> You conclusion seems correct to me that such transactions won't be
> replicated by streaming replication and would be skipped by logical
> replication. Now, if we can avoid that anyway, we can consider that.

Under reflection what I proposed [1] seems more complex and possibly
error prone than other available solutions. The recovery step was slow
if the server is idle (that's the case for the test). An idle server
usually doesn't have another WAL record after creating the replication
slots. Since pg_createsubscriber is using the LSN returned by the last
replication slot as recovery_target_lsn, this LSN is ahead of the
current WAL position and the recovery waits until something writes a
WAL record to reach the target and ends the recovery.

Hayato already mentioned one of the solution in a previous email [2].
AFAICS we can use any solution that creates a WAL record. I tested the
following options: 

\timing
select * from pg_create_logical_replication_slot('pg_createsubscriber', 
'pgoutput', true);
select pg_logical_emit_message(false, 'pg_createsubscriber', 'dummy');
select pg_log_standby_snapshot();
select pg_create_restore_point('pg_createsubscriber');

that results in the following output:

  slot_name  |lsn
-+---
pg_createsubscriber | 0/942DD28
(1 row)

Time: 200.571 ms
pg_logical_emit_message 
-
0/942DD78
(1 row)

Time: 0.938 ms
pg_log_standby_snapshot 
-
0/942DDB0
(1 row)

Time: 0.741 ms
pg_create_restore_point 
-
0/942DE18
(1 row)

Time: 0.870 ms

and generates the following WAL records:

rmgr: Standby len (rec/tot): 50/50, tx:  0, lsn: 
0/0942DCF0, prev 0/0942DCB8, desc: RUNNING_XACTS nextXid 3939 
latestCompletedXid 3938 oldestRunningXid 3939
rmgr: LogicalMessage len (rec/tot): 75/75, tx:  0, lsn: 
0/0942DD28, prev 0/0942DCF0, desc: MESSAGE non-transactional, prefix 
"pg_createsubscriber"; payload (5 bytes): 64 75 6D 6D 79
rmgr: Standby len (rec/tot): 50/50, tx:  0, lsn: 
0/0942DD78, prev 0/0942DD28, desc: RUNNING_XACTS nextXid 3939 
latestCompletedXid 3938 oldestRunningXid 3939
rmgr: XLOGlen (rec/tot): 98/98, tx:  0, lsn: 
0/0942DDB0, prev 0/0942DD78, desc: RESTORE_POINT pg_createsubscriber

The options are:

(a) temporary replication slot: requires an additional replication slot.
small payload. it is extremely slow in comparison with the other
options.
(b) logical message: can be consumed by logical replication when/if it
is supported some day. big payload. fast.
(c) snapshot of running txn:  small payload. fast.
(d) named restore point: biggest payload. fast.

I don't have a strong preference but if I need to pick one I would
choose option (c) or option (d). The option (a) is out of the question.

Opinions?

[1] 
https://www.postgresql.org/message-id/b1f0f8c7-8f01-4950-af77-339df3dc4684%40app.fastmail.com
[2] 
https://www.postgresql.org/message-id/OSBPR01MB25521B15BF950D2523BBE143F5D32%40OSBPR01MB2552.jpnprd01.prod.outlook.com

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Daniel Gustafsson

Thanks for working on this patchset, I'm looking over 0004 and 0005 but came
across a thing I wanted to bring up one thing sooner than waiting for the
review. In parse_device_authz we have this:

  {"user_code", JSON_TOKEN_STRING, {>user_code}, REQUIRED},
  {"verification_uri", JSON_TOKEN_STRING, {>verification_uri}, REQUIRED},

  /*
   * The following fields are technically REQUIRED, but we don't use
   * them anywhere yet:
   *
   * - expires_in
   */

  {"interval", JSON_TOKEN_NUMBER, {>interval_str}, OPTIONAL},

Together with a colleage we found the Azure provider use "verification_url"
rather than xxx_uri.  Another discrepancy is that it uses a string for the
interval (ie: "interval":"5").  One can of course argue that Azure is wrong and
should feel bad, but I fear that virtually all (major) providers will have
differences like this, so we will have to deal with it in an extensible fashion
(compile time, not runtime configurable).

I was toying with making the name json_field name member an array, to allow
variations.  That won't help with the fieldtype differences though, so another
train of thought was to have some form of REQUIRED_XOR where fields can tied
together.  What do you think about something along these lines?

Another thing, shouldn't we really parse and interpret *all* REQUIRED fields
even if we don't use them to ensure that the JSON is wellformed?  If the JSON
we get is malformed in any way it seems like the safe/conservative option to
error out.

--
Daniel Gustafsson

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Joel Jacobson

On Mon, Jul 29, 2024, at 22:01, Dean Rasheed wrote:
> On Mon, 29 Jul 2024 at 18:57, Joel Jacobson  wrote:
>>
>> Thanks to v3-0002, they are all still significantly faster when both patches 
>> have been applied,
>> but I wonder if it is expected or not, that v3-0001 temporarily made them a 
>> bit slower?
>>
>
> There's no obvious reason why 0001 would make those cases slower, but
> the fact that, together with 0002, it's a significant net win, and the
> gains for 5 and 6-digit inputs make it worthwhile, in my opinion.

Yes, I agree, I just thought it was noteworthy, but not a problem per se.

> Something I did notice in my tests was that if ndigits was a small
> multiple of 8, the old code was disproportionately faster, which can
> be explained by the fact that the computation fits exactly into a
> whole number of XMM register operations, with no remaining digits to
> process. For example, these results from above:
>
>  ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
> --+--+---+---+--
>15 |   15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
>16 |   16 | 4.3353435e+06 |  4.970363e+06 | +14.65%
>17 |   17 | 3.9258755e+06 |  4.935394e+06 | +25.71%
>
>23 |   23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
>24 |   24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
>25 |   25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%
>
>31 |   31 |  2.169437e+06 | 3.7209152e+06 | +71.52%
>32 |   32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
>33 |   33 |   2.27133e+06 |  3.435459e+06 | +51.25%
>
> (Note how 16x16 was much faster than 15x15, for example.)
>
> The patched code seems to do a better job at levelling out and coping
> with arbitrary-sized inputs, not just those that fit exactly into a
> whole number of loops using SSE2 operations.

That's nice.

> Something else I noticed was that the relative gains for large numbers
> of digits were much higher with clang than with gcc:
>
> gcc 13.3.0:
>
> 16383 |16383 | 21.629467 |  73.58552 | +240.21%
>
> clang 15.0.7:
>
> 16383 |16383 | 11.562384 |  73.00517 | +531.40%
>
> That seems to be because clang doesn't do a good job of generating
> efficient SSE2 code in the old case of 16-bit x 16-bit
> multiplications. Looking on godbolt.org, it generates
> overly-complicated code using PMULUDQ, which actually does 32-bit x
> 32-bit multiplications. Gcc, on the other hand, generates a much more
> compact loop, using PMULHW and PMULLW, which is much faster. With the
> patch, they both generate the same SSE2 code, so the results are
> pretty consistent.

Very nice.

I've now also had an initial look at the actual code of the patches:

* v3-0001

Looks pretty straight forward, nice with the PRODSUM macros,
that really improved readability a lot.

I like these simplifications, how `var2ndigits` is used instead of 
`res_ndigits`:
-   for (int i = res_ndigits - 3; i >= 1; i--)
+   for (int i = var2ndigits - 1; i >= 1; i--)

But I wonder why does `case 1:`  not follow the same pattern?
for (int i = res_ndigits - 2; i >= 0; i--)

* v3-0002

I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
the benefits of simpler code.

You brought up the question if 32-bit systems should be regarded
as legacy previously in this thread.

Unfortunately, we didn't get any feedback, so I'm starting a separate
thread, with subject "Is fast 32-bit code still important?", hoping to get
more input to help us make judgement calls.

/Joel

Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

Robert Haas  writes:
> On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
>> ... However, I added a new open item about how the
>> 040_pg_createsubscriber.pl test is slow and still unstable.

> But that said, I see no commits in the commit history which purport to
> improve performance, so I guess the performance is probably still not
> what you want, though I am not clear on the details.

My concern is described at [1]:

>> I have a different but possibly-related complaint: why is
>> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
>> runs for a bit over 19 seconds, which seems completely out of line
>> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
>> other test scripts in this directory take much less).  It looks
>> like most of the blame falls on this step:
>> 
>> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
>> 
>> AFAICS the amount of data being replicated is completely trivial,
>> so that it doesn't make any sense for this to take so long --- and
>> if it does, that suggests that this tool will be impossibly slow
>> for production use.  But I suspect there is a logic flaw causing
>> this.  Speculating wildly, perhaps that is related to the failure
>> Alexander spotted?

The followup discussion in that thread made it sound like there's
some fairly fundamental deficiency in how wait_for_end_recovery()
detects end-of-recovery.  I'm not too conversant with the details
though, and it's possible that pg_createsubscriber is just falling
foul of a pre-existing infelicity.

If the problem can be correctly described as "pg_createsubscriber
takes 10 seconds or so to detect end-of-stream", then it's probably
only an annoyance for testing and not something that would be fatal
in the real world.  I'm not quite sure if that's accurate, though.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/2377319.1719766794%40sss.pgh.pa.us#bba9f5ee0efc73151cc521a6bd5182ed

Re: Refactoring postmaster's code to cleanup after child exit


On 06/07/2024 22:01, Heikki Linnakangas wrote:
Reading through postmaster code, I spotted some refactoring 
opportunities to make it slightly more readable.


Currently, when a child process exits, the postmaster first scans 
through BackgroundWorkerList to see if it was a bgworker process. If not 
found, it scans through the BackendList to see if it was a backend 
process (which it really should be then). That feels a bit silly, 
because every running background worker process also has an entry in 
BackendList. There's a lot of duplication between 
CleanupBackgroundWorker and CleanupBackend.


Before commit 8a02b3d732, we used to created Backend entries only for 
background worker processes that connected to a database, not for other 
background worker processes. I think that's why we have the code 
structure we have. But now that we have a Backend entry for all bgworker 
processes, it's more natural to have single function to deal with both 
regular backends and bgworkers.


So I came up with the attached patches. This doesn't make any meaningful 
user-visible changes, except for some incidental changes in log messages 
(see commit message for details).


New patch version attached. Fixed conflicts with recent commits, no 
other changes.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 84ca49efab16cc2699f8446684a5ebe63dad1c38 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 29 Jul 2024 23:13:16 +0300
Subject: [PATCH v2 1/4] Fix outdated comment; all running bgworkers are in
 BackendList

Before commit 8a02b3d732, only bgworkers that connected to a database
had an entry in the Backendlist. Commit 8a02b3d732 changed that, but
forgot to update this comment.

Discussion: https://www.postgresql.org/message-id/835232c0-a5f7-4f20-b95b-5b56ba57d...@iki.fi
---
 src/include/postmaster/bgworker_internals.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/include/postmaster/bgworker_internals.h b/src/include/postmaster/bgworker_internals.h
index 9106a0ef3f..61ba54117a 100644
--- a/src/include/postmaster/bgworker_internals.h
+++ b/src/include/postmaster/bgworker_internals.h
@@ -26,14 +26,14 @@
 /*
  * List of background workers, private to postmaster.
  *
- * A worker that requests a database connection during registration will have
- * rw_backend set, and will be present in BackendList.  Note: do not rely on
- * rw_backend being non-NULL for shmem-connected workers!
+ * All workers that are currently running will have rw_backend set, and will
+ * be present in BackendList.
  */
 typedef struct RegisteredBgWorker
 {
 	BackgroundWorker rw_worker; /* its registry entry */
-	struct bkend *rw_backend;	/* its BackendList entry, or NULL */
+	struct bkend *rw_backend;	/* its BackendList entry, or NULL if not
+ * running */
 	pid_t		rw_pid;			/* 0 if not running */
 	int			rw_child_slot;
 	TimestampTz rw_crashed_at;	/* if not 0, time it last crashed */
-- 
2.39.2

From e541a6d2c8481cd9f9c75fb2328e8e6031cddac6 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 29 Jul 2024 23:13:56 +0300
Subject: [PATCH v2 2/4] Minor refactoring of assign_backendlist_entry()

Make assign_backendlist_entry() responsible just for allocating the
Backend struct. Linking it to the RegisteredBgWorker is the caller's
responsibility now. Seems more clear that way.

Discussion: https://www.postgresql.org/message-id/835232c0-a5f7-4f20-b95b-5b56ba57d...@iki.fi
---
 src/backend/postmaster/postmaster.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 02442a4b85..a3e9e8fdc0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -416,7 +416,7 @@ static void TerminateChildren(int signal);
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
 static int	CountChildren(int target);
-static bool assign_backendlist_entry(RegisteredBgWorker *rw);
+static Backend *assign_backendlist_entry(void);
 static void maybe_start_bgworkers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static pid_t StartChildProcess(BackendType type);
@@ -4028,6 +4028,7 @@ MaxLivePostmasterChildren(void)
 static bool
 do_start_bgworker(RegisteredBgWorker *rw)
 {
+	Backend*bn;
 	pid_t		worker_pid;
 
 	Assert(rw->rw_pid == 0);
@@ -4042,11 +4043,14 @@ do_start_bgworker(RegisteredBgWorker *rw)
 	 * tried again right away, most likely we'd find ourselves hitting the
 	 * same resource-exhaustion condition.
 	 */
-	if (!assign_backendlist_entry(rw))
+	bn = assign_backendlist_entry();
+	if (bn == NULL)
 	{
 		rw->rw_crashed_at = GetCurrentTimestamp();
 		return false;
 	}
+	rw->rw_backend = bn;
+	rw->rw_child_slot = bn->child_slot;
 
 	ereport(DEBUG1,
 			(errmsg_internal("starting background worker process \"%s\"",
@@ -4119,12 +4123,10 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
  *

Re: tls 1.3: sending multiple tickets

On 2024-07-26 13:55:29 +0200, Daniel Gustafsson wrote:
> Thanks for review, I've applied this backpatched all the way.

Thanks for working on this!

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Dean Rasheed

On Mon, 29 Jul 2024 at 18:57, Joel Jacobson  wrote:
>
> Thanks to v3-0002, they are all still significantly faster when both patches 
> have been applied,
> but I wonder if it is expected or not, that v3-0001 temporarily made them a 
> bit slower?
>

There's no obvious reason why 0001 would make those cases slower, but
the fact that, together with 0002, it's a significant net win, and the
gains for 5 and 6-digit inputs make it worthwhile, in my opinion.

Something I did notice in my tests was that if ndigits was a small
multiple of 8, the old code was disproportionately faster, which can
be explained by the fact that the computation fits exactly into a
whole number of XMM register operations, with no remaining digits to
process. For example, these results from above:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
--+--+---+---+--
   15 |   15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
   16 |   16 | 4.3353435e+06 |  4.970363e+06 | +14.65%
   17 |   17 | 3.9258755e+06 |  4.935394e+06 | +25.71%

   23 |   23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
   24 |   24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
   25 |   25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%

   31 |   31 |  2.169437e+06 | 3.7209152e+06 | +71.52%
   32 |   32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
   33 |   33 |   2.27133e+06 |  3.435459e+06 | +51.25%

(Note how 16x16 was much faster than 15x15, for example.)

The patched code seems to do a better job at levelling out and coping
with arbitrary-sized inputs, not just those that fit exactly into a
whole number of loops using SSE2 operations.

Something else I noticed was that the relative gains for large numbers
of digits were much higher with clang than with gcc:

gcc 13.3.0:

16383 |16383 | 21.629467 |  73.58552 | +240.21%

clang 15.0.7:

16383 |16383 | 11.562384 |  73.00517 | +531.40%

That seems to be because clang doesn't do a good job of generating
efficient SSE2 code in the old case of 16-bit x 16-bit
multiplications. Looking on godbolt.org, it generates
overly-complicated code using PMULUDQ, which actually does 32-bit x
32-bit multiplications. Gcc, on the other hand, generates a much more
compact loop, using PMULHW and PMULLW, which is much faster. With the
patch, they both generate the same SSE2 code, so the results are
pretty consistent.

Regards,
Dean

Re: tiny step toward threading: reduce dependence on setlocale()

2024-07-29 Thread Peter Eisentraut


On 27.07.24 21:03, Andreas Karlsson wrote:

On 7/26/24 10:35 PM, Jeff Davis wrote:

database_ctype_is_c refers to the LC_CTYPE environment of the database
-- pg_database.datctype. default_locale.ctype_is_c is the ctype of the
database's default collation.

Confusing, I know, but it matters for a few things that still depend on
the LC_CTYPE, such as tsearch and maybe a few extensions. See
f413941f41.


Ah, right! That was thinko on my behalf.

The set of patches looks good to me now. There is further refactoring 
that can be done in this area (and should be done given all calls e.g to 
isalpha()) but I think this set of patches improves code readability 
while moving us away from setlocale().


And even if we take a tiny performance hit here, which your tests did 
not measure, I would say it is worth it both due to code clarity and due 
to not relying on thread unsafe state.


I do not see these patches in the commitfest app but if they were I 
would have marked them as ready for committer.


Here: https://commitfest.postgresql.org/48/5023/

I have also re-reviewed the patches and I agree they are good to go.

Re: Detach shared memory in Postmaster child if not needed


On 29/07/2024 21:10, Robert Haas wrote:

On Mon, Jul 29, 2024 at 5:57 AM Rui Zhao  wrote:

Prior to PG16, postmaster children would manually detach from shared memory
if it was not needed. However, this behavior was removed in fork mode in
commit aafc05d.


Oh. The commit message makes no mention of that. I wonder whether it
was inadvertent.


Detaching shared memory when it is no longer needed is beneficial, as
postmaster children (like syslogger) don't wish to take any risk of
accidentally corrupting shared memory. Additionally, any panic in these
processes will not reset shared memory.


+1.


The attached patch addresses this issue by detaching shared memory after
fork_process().


I don't know whether this is the correct approach or not, but
hopefully Heikki can comment.


Good catch, it was not intentional. The patch looks good to me, so 
committed. Thanks Rui!


--
Heikki Linnakangas
Neon (https://neon.tech)

040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
> In hopes of moving things along as we approach the v18 branch,
> I went ahead and pushed Kuroda-san's patches (with a bit of
> further editorialization).  AFAICS that allows closing out
> the concerns raised by Noah, so I've marked that open item
> done.  However, I added a new open item about how the
> 040_pg_createsubscriber.pl test is slow and still unstable.

This open item is now the only open item. It seems to have been open
for a month with no response from Peter which is, from my point of
view, far from ideal. However, another thing that is not ideal is that
we've been using the same thread to discuss every issue related to
this patch for 2.5 years. The thread spans hundreds of messages and it
is by no means obvious to what extent the messages posted after this
one addressed the underlying concern. Perhaps it would have been an
idea to start new threads when we started discussing post-commit
issues, instead of just tagging onto the same one.

But that said, I see no commits in the commit history which purport to
improve performance, so I guess the performance is probably still not
what you want, though I am not clear on the details. And as far as
stability is concerned, I peered through the dense haze of
027_stream_regress-related buildfarm failures for long enough to
discover that the stability issues with 040_pg_createsubscriber aren't
fixed either, because we have these recent buildfarm reports:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo=2024-07-26%2016%3A02%3A40
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder=2024-07-26%2009%3A20%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake=2024-07-25%2002%3A39%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-07-22%2002%3A31%3A32

So, Peter, as the committer responsible for pg_createsubscriber, what
are we going to do about this?

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Detect double-release of spinlock

Hi,

Partially replying here to an email on -committers [1].

On 2024-07-29 13:57:02 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > However, I still don't think it's a problem to assert that the lock is held 
> > in
> > in the unlock "routine". As mentioned before, the spinlock implementation
> > itself has never followed the "just straight line code" rule that users of
> > spinlocks are supposed to follow.
>
> Yeah, that's fair.

Cool.


On 2024-07-29 13:56:05 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > b) Instead define the spinlock to have 1 as the unlocked state and 0 as the
> >locked state. That makes it a bit harder to understand that 
> > initialization
> >is missing, compared to a dedicated state, as the first use of the 
> > spinlock
> >just blocks.
>
> This option works for me.

> > To make 1) b) easier to understand it might be worth changing the slock_t
> > typedef to be something like
>
> > typedef struct slock_t
> > {
> > char is_free;
> > } slock_t;
>
> +1

Cool. I've attached a prototype.


I just realized there's a nice little advantage to the "inverted"
representation - it detects missing initialization even in optimized builds.


> How much of this would we change across platforms, and how much
> would be x86-only?  I think there are enough people developing on
> ARM (e.g. Mac) now to make it worth covering that, but maybe we
> don't care so much about anything else.

Not sure. Right now I've only hacked up x86-64 (not even touching i386), but
it shouldn't be hard to change at least some additional platforms.

My current prototype requires S_UNLOCK, S_LOCK_FREE, S_INIT_LOCK to be
implemented for x86-64 instead of using the "generic" implementation. That'd
be mildly annoying duplication if we did so for a few more platforms.


It'd be more palatable to just change all platforms if we made more of them
use __sync_lock_test_and_set (or some other intrinsic(s))...

Greetings,

Andres Freund

[1] https://postgr.es/m/2812376.1722275765%40sss.pgh.pa.us
>From 06bd992f4be2f46586bfc7bab3135b3a2ca84e72 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 29 Jul 2024 09:48:26 -0700
Subject: [PATCH v2 1/2] Detect unlocking an unlocked spinlock

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20240729164026.yurg37tej34o7...@awork3.anarazel.de
Backpatch:
---
 src/include/storage/spin.h | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/spin.h b/src/include/storage/spin.h
index c0679c59992..7ec32cd816a 100644
--- a/src/include/storage/spin.h
+++ b/src/include/storage/spin.h
@@ -61,7 +61,22 @@
 
 #define SpinLockAcquire(lock) S_LOCK(lock)
 
-#define SpinLockRelease(lock) S_UNLOCK(lock)
+static inline void
+SpinLockRelease(volatile slock_t *lock)
+{
+	/*
+	 * Check that this backend actually held the spinlock. Implemented as a
+	 * static inline to avoid multiple-evaluation hazards.
+	 *
+	 * Can't assert when using the semaphore fallback implementation - it
+	 * doesn't implement S_LOCK_FREE() - but the fallback triggers failures in
+	 * the case of double-release on its own.
+	 */
+#ifdef HAVE_SPINLOCKS
+	Assert(!S_LOCK_FREE(lock));
+#endif
+	S_UNLOCK(lock);
+}
 
 #define SpinLockFree(lock)	S_LOCK_FREE(lock)
 
-- 
2.45.2.746.g06e570c0df.dirty

>From 574d02d3d4d7f38b764ca6a2e4d2617a417ab8a8 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 29 Jul 2024 10:55:11 -0700
Subject: [PATCH v2 2/2] Detect many uses of uninitialized spinlocks on x86-64

On most platforms we use 0 for a unlocked spinlock and 1 for a locked one. As
we often place spinlocks in zero-initialized memory, missing calls to
SpinLockInit() aren't noticed.

To address that, swap the meaning of the spinlock state and use 1 for unlocked
and 0 for locked. That way using an uninitialized spinlock blocks the first
time a lock is acquired. A dedicated state for initialized locks would allow
for a nicer assertion, but ends up with slightly less dense code (a three byte
cmp with an immediate, instead of a 2 byte test).

To make the swapped meaning of the slock_t state easier to understand when
debugging, change the slock_t to be a struct on x86-64, with an "is_free"
member.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20240729164026.yurg37tej34o7...@awork3.anarazel.de
Backpatch:
---
 src/include/storage/s_lock.h | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 02c68513a53..18291b284b0 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -205,7 +205,10 @@ spin_delay(void)
 #ifdef __x86_64__		/* AMD Opteron, Intel EM64T */
 #define HAS_TEST_AND_SET
 
-typedef unsigned char slock_t;
+typedef struct slock_t
+{
+	char	is_free;
+} slock_t;
 
 #define TAS(lock) tas(lock)
 
@@ -217,21 +220,27 @@ typedef unsigned char slock_t;
  * and IA32, by Michael Chynoweth and Mary R. Lee. As of this writing, it is
  *

Re: Detect double-release of spinlock

Hi,

On 2024-07-29 21:00:35 +0300, Heikki Linnakangas wrote:
> On 29/07/2024 20:48, Andres Freund wrote:
> > On 2024-07-29 13:25:22 -0400, Tom Lane wrote:
> > > Heikki Linnakangas  writes:
> > > > Yeah I'm not worried about that at all. Also, the assert is made when
> > > > you have already released the spinlock; you are already out of the
> > > > critical section.
> > > 
> > > Not in the patch Andres posted.
> > 
> > Which seems fairly fundamental - once outside of the critical section, we
> > can't actually assert that the lock isn't acquired, somebody else *validly*
> > might have acquired it by then.
> 
> You could do:
> 
> bool was_free = S_LOCK_FREE(lock);
> 
> S_UNLOCK(lock);
> Assert(!was_free);

I don't really see the point - we're about to crash with an assertion failure,
why would we want to do that outside of the critical section? If anything that
will make it harder to debug the issue in a core dump, because other backends
might "destroy evidence" due to being able to acquire the spinlock.

> Depending on the underlying implementation, you could also use
> compare-and-exchange.

That'd scale a lot worse, at least on x86-64, as it requires the unlock to be
an atomic op, whereas today it's a simple store (+ compiler barrier).

I've experimented with replacing all spinlocks with lwlocks, and the fact that
you need an atomic op for an rwlock release is one of the two major reasons
they have a higher overhead (the remainder is boring stuff like the overhead
of external function calls and ownership management).

Greetings,

Andres Freund

Re: Detach shared memory in Postmaster child if not needed

On Mon, Jul 29, 2024 at 5:57 AM Rui Zhao  wrote:
> Prior to PG16, postmaster children would manually detach from shared memory
> if it was not needed. However, this behavior was removed in fork mode in
> commit aafc05d.

Oh. The commit message makes no mention of that. I wonder whether it
was inadvertent.

> Detaching shared memory when it is no longer needed is beneficial, as
> postmaster children (like syslogger) don't wish to take any risk of
> accidentally corrupting shared memory. Additionally, any panic in these
> processes will not reset shared memory.

+1.

> The attached patch addresses this issue by detaching shared memory after
> fork_process().

I don't know whether this is the correct approach or not, but
hopefully Heikki can comment.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Interrupts vs signals

On Mon, Jul 29, 2024 at 1:24 PM Tom Lane  wrote:
> Robert Haas  writes:
> > I wonder how this works right now. Is there something that limits the
> > number of authentication requests that can be in flight concurrently,
> > or is it completely uncapped (except by machine resources)?
>
> The former.  IIRC, the postmaster won't spawn more than 2X max_connections
> subprocesses (don't recall the exact limit, but it's around there).

Hmm. Not to sidetrack this thread too much, but multiplying by two
doesn't really sound like the right idea to me. The basic idea
articulated in the comment for canAcceptConnections() makes sense:
some backends might fail authentication, or might be about to exit, so
it makes sense to allow for some slop. But 2X is a lot of slop even on
a machine with the default max_connections=100, and people with
connection management problems are likely to be running with
max_connections=500 or max_connections=900 or even (insanely)
max_connections=2000. Continuing with a connection attempt because we
think that hundreds or thousands of connections that are ahead of us
in the queue might clear out of the way before we need a PGPROC is not
a good bet.

I wonder if we ought to restrict this to a small, flat value, like say
50, or by a new GUC that defaults to such a value if a constant seems
problematic. Maybe it doesn't really matter. I'm not sure how much
work we'd save by booting out the doomed connection attempt earlier.

The unlimited number of dead-end backends doesn't sound too great
either. I don't have another idea, but presumably resisting DDOS
attacks and/or preserving resources for things that still have a
chance of working ought to take priority over printing a nicer error
message from a connection that's doomed to fail anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Detect double-release of spinlock


On 29/07/2024 20:48, Andres Freund wrote:

On 2024-07-29 13:25:22 -0400, Tom Lane wrote:

Heikki Linnakangas  writes:

Yeah I'm not worried about that at all. Also, the assert is made when
you have already released the spinlock; you are already out of the
critical section.


Not in the patch Andres posted.


Which seems fairly fundamental - once outside of the critical section, we
can't actually assert that the lock isn't acquired, somebody else *validly*
might have acquired it by then.


You could do:

bool was_free = S_LOCK_FREE(lock);

S_UNLOCK(lock);
Assert(!was_free);

Depending on the underlying implementation, you could also use 
compare-and-exchange. That makes the assertion-enabled instructions a 
little different than without assertions though.



However, I still don't think it's a problem to assert that the lock is held in
in the unlock "routine". As mentioned before, the spinlock implementation
itself has never followed the "just straight line code" rule that users of
spinlocks are supposed to follow.


Agreed.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Joel Jacobson

On Mon, Jul 29, 2024, at 16:42, Joel Jacobson wrote:
> New results with less noise below.
>
> Pardon the exceeding of 80 chars line width,
> but felt important to include commit hash and relative delta.
>
>
> ndigits|rate|  change   |   accum   | commit  | 
>  summary
> ---++---+---+-+

I've reviewed the benchmark results, and it looks like v3-0001 made some cases 
a bit slower:

 (32,32)   |  1.786e+06 | -13.27 %  | -11.26 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (32,64)   |  1.119e+06 | -16.72 %  | -20.45 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (32,128)  |  7.242e+05 | -13.55 %  | -9.24 %   | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (64,64)   |  5.515e+05 | -22.34 %  | -24.47 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (64,128)  |  3.204e+05 | -14.83 %  | -12.44 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (128,128) |  1.750e+05 | -16.01 %  | -15.24 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co

Thanks to v3-0002, they are all still significantly faster when both patches 
have been applied,
but I wonder if it is expected or not, that v3-0001 temporarily made them a bit 
slower?

Same cases with v3-0002 applied:

 (32,32)   |  3.408e+06 | +90.80 %  | +69.32 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (32,64)   |  2.356e+06 | +110.63 % | +67.56 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (32,128)  |  1.393e+06 | +92.39 %  | +74.61 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (64,64)   |  1.432e+06 | +159.69 % | +96.14 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (128,128) |  5.567e+05 | +218.07 % | +169.60 % | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2

/Joel

Re: Detect double-release of spinlock

Andres Freund  writes:
> However, I still don't think it's a problem to assert that the lock is held in
> in the unlock "routine". As mentioned before, the spinlock implementation
> itself has never followed the "just straight line code" rule that users of
> spinlocks are supposed to follow.

Yeah, that's fair.

regards, tom lane

Re: Detect double-release of spinlock

Hi,

On 2024-07-29 13:25:22 -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > Yeah I'm not worried about that at all. Also, the assert is made when 
> > you have already released the spinlock; you are already out of the 
> > critical section.
> 
> Not in the patch Andres posted.

Which seems fairly fundamental - once outside of the critical section, we
can't actually assert that the lock isn't acquired, somebody else *validly*
might have acquired it by then.

However, I still don't think it's a problem to assert that the lock is held in
in the unlock "routine". As mentioned before, the spinlock implementation
itself has never followed the "just straight line code" rule that users of
spinlocks are supposed to follow.

Greetings,

Andres Freund

Re: Remove dead code generation tools in src/backend/utils/mb/


On 29/07/2024 17:15, Tom Lane wrote:

Heikki Linnakangas  writes:

These programs in src/backend/utils/mb/ are unused, and have been unused
and unusable since 2003:
iso.c
win1251.c
win866.c
Attached patch removes them. See commit message for a little more
detailed analysis.


+1.  Seems to have been my oversight in 4c3c8c048d.


Removed.

(Aleksander, you forgot to CC the mailing list, but thanks for your 
review too.)


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Detect double-release of spinlock

Heikki Linnakangas  writes:
> Yeah I'm not worried about that at all. Also, the assert is made when 
> you have already released the spinlock; you are already out of the 
> critical section.

Not in the patch Andres posted.

regards, tom lane

Re: Interrupts vs signals

Robert Haas  writes:
> I wonder how this works right now. Is there something that limits the
> number of authentication requests that can be in flight concurrently,
> or is it completely uncapped (except by machine resources)?

The former.  IIRC, the postmaster won't spawn more than 2X max_connections
subprocesses (don't recall the exact limit, but it's around there).

regards, tom lane

Re: Interrupts vs signals


On 29/07/2024 19:56, Robert Haas wrote:

On Wed, Jul 24, 2024 at 8:58 AM Heikki Linnakangas  wrote:

a) assign every child process a PGPROC entry, and make postmaster
responsible for assigning them like you suggest. We'll need more PGPROC
entries, because currently a process doesn't reserve one until
authentication has happened. Or we change that behavior.


I wonder how this works right now. Is there something that limits the
number of authentication requests that can be in flight concurrently,
or is it completely uncapped (except by machine resources)?



My first thought when I read this was that it would be bad to have to
put a limit on something that's currently unlimited. But then I
started thinking that, even if it is currently unlimited, that might
be a bug rather than a feature. If you have hundreds of pending
authentication requests, that just means you're using a lot of machine
resources on something that doesn't really help anybody. A machine
with hundreds of authentication-pending connections is possibly
getting DDOS'd and probably getting buried. You'd be better off
focusing the machine's limited resources on the already-established
connections and a more limited number of new connection attempts. If
you accept so many connection attempts that you don't actually have
enough memory/CPU/kernel scheduling firepower to complete the
authentication process with any of them, it does nobody any good.

I'm not sure what's best to do here; just thinking out loud.


Yes, there's a limit, roughly 2x max_connections. see 
MaxLivePostmasterChildren().


There's another issue with that that I was about to post in another 
thread, but here goes: the MaxLivePostmasterChildren() limit is shared 
by all regular backends, bgworkers and autovacuum workers. If you open a 
lot of TCP connections to postmaster and don't send anything to the 
server, you exhaust those slots, and the server won't be able to start 
any autovacuum workers or background workers either. That's not great. I 
started to work on approach b), with separate pools of slots for 
different kinds of child processes, which fixes that. Stay tuned for a 
patch.


In addition to that, you can have an unlimited number of "dead-end" 
backends, which are doomed to just respond with "sorry, too many 
clients" error. The only limit on those is the amount of resources 
needed for all the processes and a little memory to track them.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Detect double-release of spinlock


On 29/07/2024 19:51, Andres Freund wrote:

On 2024-07-29 09:40:26 -0700, Andres Freund wrote:

On 2024-07-29 12:33:13 -0400, Tom Lane wrote:

Andres Freund  writes:

On 2024-07-29 11:31:56 -0400, Tom Lane wrote:

There was some recent discussion about getting rid of
--disable-spinlocks on the grounds that nobody would use
hardware that lacked native spinlocks.  But now I wonder
if there is a testing/debugging reason to keep it.



Seems it'd be a lot more straightforward to just add an assertion to the
x86-64 spinlock implementation verifying that the spinlock isn't already free?


FWIW, I quickly hacked that up, and it indeed quickly fails with 0393f542d72^
and passes with 0393f542d72.


+1. Thanks!


Other context from this discussion:
I dunno, is that the only extra check that the --disable-spinlocks
implementation is providing?


I think it also provides the (valuable!) check that spinlocks were actually
initialized. But that again seems like something we'd be better off adding
more general infrastructure for - nobody runs --disable-spinlocks locally, we
shouldn't need to run this on the buildfarm to find problems like this.


Note that the "check" for double-release with the fallback 
implementation wasn't an explicit check either. It just incremented the 
underlying semaphore, which caused very weird failures later in 
completely unrelated code. An explicit assert would be much nicer.


+1 for removing --disable-spinlocks, but let's add this assertion first.


I'm kind of allergic to putting Asserts into spinlocked code segments,
mostly on the grounds that it violates the straight-line-code precept.
I suppose it's not really that bad for tests that you don't expect
to fail, but still ...


I don't think the spinlock implementation itself is really affected by that
rule - after all, the --disable-spinlocks implementation actually consists out
of several layers of external function calls (including syscalls in some
cases!).


Yeah I'm not worried about that at all. Also, the assert is made when 
you have already released the spinlock; you are already out of the 
critical section.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Interrupts vs signals

On Wed, Jul 24, 2024 at 8:58 AM Heikki Linnakangas  wrote:
> a) assign every child process a PGPROC entry, and make postmaster
> responsible for assigning them like you suggest. We'll need more PGPROC
> entries, because currently a process doesn't reserve one until
> authentication has happened. Or we change that behavior.

I wonder how this works right now. Is there something that limits the
number of authentication requests that can be in flight concurrently,
or is it completely uncapped (except by machine resources)?

My first thought when I read this was that it would be bad to have to
put a limit on something that's currently unlimited. But then I
started thinking that, even if it is currently unlimited, that might
be a bug rather than a feature. If you have hundreds of pending
authentication requests, that just means you're using a lot of machine
resources on something that doesn't really help anybody. A machine
with hundreds of authentication-pending connections is possibly
getting DDOS'd and probably getting buried. You'd be better off
focusing the machine's limited resources on the already-established
connections and a more limited number of new connection attempts. If
you accept so many connection attempts that you don't actually have
enough memory/CPU/kernel scheduling firepower to complete the
authentication process with any of them, it does nobody any good.

I'm not sure what's best to do here; just thinking out loud.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: CREATE MATERIALIZED VIEW

2024-07-29 Thread Dagfinn Ilmari Mannsåker

Nathan Bossart  writes:

> Committed.

Thanks!

- ilmari

Re: CREATE MATERIALIZED VIEW

2024-07-29 Thread Nathan Bossart

Committed.

-- 
nathan

Re: Incremental backup from a streaming replication standby fails

On Fri, Jul 26, 2024 at 4:13 PM Alexander Korotkov  wrote:
> Great!  Should we mark the corresponding v17 open item as closed?

Done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

On Fri, Jul 26, 2024 at 4:10 PM Alexander Korotkov  wrote:
> 0002 could also use pg_indent and pgperltidy.  And I have couple other
> notes regarding 0002.

As Tom said elsewhere, we don't have a practice of requiring perltidy
for every commit, and I normally don't bother. The tree is
pgindent-clean currently so I believe I got that part right.

> >In the process of fixing these bugs, I realized that the logic to wait
> >for WAL summarization to catch up was spread out in a way that made
> >it difficult to reuse properly, so this code refactors things to make
> >it easier.
>
> It would be nice to split refactoring out of material logic changed.
> This way it would be easier to review and would look cleaner in the
> git history.

I support that idea in general but felt it was overkill in this case:
it's new code, and there was only one existing caller of the function
that got refactored, and I'm not a huge fan of cluttering the git
history with a bunch of tiny little refactoring commits to fix a
single bug. I might have changed it if I'd seen this note before
committing, though.

> >To make this fix work, also teach the WAL summarizer that after a
> >promotion has occurred, no more WAL can appear on the previous
> >timeline: previously, the WAL summarizer wouldn't switch to the new
> >timeline until we actually started writing WAL there, but that meant
> >that when the startup process was waiting for the WAL summarizer, it
> >was waiting for an action that the summarizer wasn't yet prepared to
> >take.
>
> I think this is a pretty long sentence, and I'm not sure I can
> understand it correctly.  Does startup process wait for the WAL
> summarizer without this patch?  If not, it's not clear for me that
> word "previously" doesn't distribute to this part of sentence.
> Breaking this into multiple sentences could improve the clarity for
> me.

Yes, I think that phrasing is muddled. It's too late to amend the
commit message, but for posterity: I initially thought that I could
fix this bug by just teaching the startup process to wait for WAL
summarization before performing the .partial renaming, but that was
not enough by itself. The problem is that the WAL summarizer would
read all the records that were present in the final WAL file on the
old timeline, but it wouldn't actually write out a summary file,
because that only happens when it reaches XLOG_CHECKPOINT_REDO or a
timeline switch point. Since no WAL had appeared on the new timeline
yet, it didn't view the end of the old timeline as a switch point, so
it just sat there waiting, without ever writing out a file; and the
startup process sat there waiting for it. So the second part of the
fix is to make the WAL summarizer realize that once the startup
process has chosen a new timeline ID, no more WAL is going to appear
on the old timeline, and thus it can (and should) write out the final
summary file for the old timeline and prepare to read WAL from the new
timeline.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: tls 1.3: sending multiple tickets

On Mon, Jul 29, 2024 at 5:57 AM Daniel Gustafsson  wrote:
> I'm sure there are more interactions with OpenSSL, and TLS in general, which
> warrants better comments but the attached takes a stab at the two examples in
> question here to get started (to avoid perfect get in the way of progress).

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Remove dead code generation tools in src/backend/utils/mb/

2024-07-29 Thread Andreas Karlsson


On 7/29/24 5:00 PM, Alexander Lakhin wrote:

I also wonder whether src/test/locale/ still makes sense; does anyone
run those tests (I could not run a single one on a quick attempt)?


I was actually wondering about those yesterday and they should probably 
be removed (or fixed if anyone can see a use for them). As they are 
right now they do not seem very useful, especially with the current 
selection of locales: de_DE.ISO8859-1, gr_GR.ISO8859-7 and koi8-r.


Andreas

Re: Remove dead code generation tools in src/backend/utils/mb/

2024-07-29 Thread Alexander Lakhin


Hello Tom and Heikki,

29.07.2024 17:15, Tom Lane wrote:

Heikki Linnakangas  writes:

These programs in src/backend/utils/mb/ are unused, and have been unused
and unusable since 2003:
iso.c
win1251.c
win866.c
Attached patch removes them. See commit message for a little more
detailed analysis.

+1.  Seems to have been my oversight in 4c3c8c048d.


I also wonder whether src/test/locale/ still makes sense; does anyone
run those tests (I could not run a single one on a quick attempt)?

(As far as I can tell, KOI8-R fallen out of mainstream usage in Russia
twenty years ago...)

Best regards,
Alexander

Re: Virtual generated columns

2024-07-29 Thread Peter Eisentraut


On 22.07.24 12:53, jian he wrote:

another bug?
drop table gtest12v;
CREATE TABLE gtest12v (a int PRIMARY KEY, b bigint, c int GENERATED
ALWAYS AS (b * 2) VIRTUAL);
insert into gtest12v (a,b) values (11,  22147483647);
table gtest12v;

insert ok, but select error:
ERROR:  integer out of range

should insert fail?


I think this is the correct behavior.

There has been a previous discussion: 
https://www.postgresql.org/message-id/2e3d5147-16f8-af0f-00ab-4c72cafc896f%402ndquadrant.com



CREATE TABLE gtest12v (a int PRIMARY KEY, b bigint, c int GENERATED
ALWAYS AS (b * 2) VIRTUAL);
CREATE SEQUENCE sequence_testx OWNED BY gtest12v.c;

seems to work. But I am not sure if there are any corner cases that
make it not work.
just want to raise this issue.


I don't think this matters.  You can make a sequence owned by any 
column, even if that column doesn't have a default that invokes the 
sequence.  So nonsensical setups are possible, but they are harmless.

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Joel Jacobson

On Mon, Jul 29, 2024, at 02:23, Joel Jacobson wrote:
> Then, I've used sched_setaffinity() from  to ensure the 
> benchmark run on CPU core id 31.

I fixed a bug in my measure function, I had forgot to reset affinity after each
benchmark, so the PostgreSQL backend continued to use the core even after
numeric_mul had finished.

New results with less noise below.

Pardon the exceeding of 80 chars line width,
but felt important to include commit hash and relative delta.


ndigits|rate|  change   |   accum   | commit  | 
 summary
---++---+---+-+
 (1,1) |  1.639e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,1) |  2.248e+07 | +37.16 %  | +37.16 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,1) |  2.333e+07 | +3.77 %   | +42.32 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,1) |  2.291e+07 | -1.81 %   | +39.75 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,1) |  2.276e+07 | -0.64 %   | +38.86 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,1) |  2.256e+07 | -0.86 %   | +37.66 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,1) |  2.182e+07 | -3.32 %   | +33.09 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,2) |  1.640e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,2) |  2.202e+07 | +34.28 %  | +34.28 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,2) |  2.214e+07 | +0.58 %   | +35.06 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,2) |  2.173e+07 | -1.85 %   | +32.55 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,2) |  2.260e+07 | +3.98 %   | +37.83 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,2) |  2.233e+07 | -1.19 %   | +36.19 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,2) |  2.144e+07 | -3.97 %   | +30.79 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,3) |  1.511e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,3) |  2.179e+07 | +44.20 %  | +44.20 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,3) |  2.134e+07 | -2.05 %   | +41.24 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,3) |  2.198e+07 | +2.99 %   | +45.47 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,3) |  2.190e+07 | -0.39 %   | +44.91 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,3) |  2.164e+07 | -1.16 %   | +43.23 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,3) |  2.104e+07 | -2.79 %   | +39.24 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,4) |  1.494e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,4) |  2.132e+07 | +42.71 %  | +42.71 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,4) |  2.151e+07 | +0.91 %   | +44.00 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,4) |  2.190e+07 | +1.82 %   | +46.62 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,4) |  2.172e+07 | -0.82 %   | +45.41 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,4) |  2.112e+07 | -2.75 %   | +41.41 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,4) |  2.077e+07 | -1.67 %   | +39.05 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,8) |  1.444e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,8) |  2.063e+07 | +42.85 %  | +42.85 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,8) |  1.996e+07 | -3.25 %   | +38.21 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,8) |  2.039e+07 | +2.12 %   | +41.14 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,8) |  2.020e+07 | -0.89 %   | +39.87 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,8) |  1.934e+07 | -4.28 %   | +33.89 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,8) |  1.948e+07 | +0.73 %   | +34.87 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,16)|  9.614e+06 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,16)|  1.215e+07 | +26.37 %  | +26.37 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,16)|  1.223e+07

Re: Remove dead code generation tools in src/backend/utils/mb/

Heikki Linnakangas  writes:
> These programs in src/backend/utils/mb/ are unused, and have been unused 
> and unusable since 2003:
> iso.c
> win1251.c
> win866.c
> Attached patch removes them. See commit message for a little more 
> detailed analysis.

+1.  Seems to have been my oversight in 4c3c8c048d.

regards, tom lane

Re: Make query cancellation keys longer


On 24/07/2024 19:12, Heikki Linnakangas wrote:

On 04/07/2024 15:20, Jelte Fennema-Nio wrote:

On Thu, 4 Jul 2024 at 12:32, Heikki Linnakangas  wrote:

We currently don't do any locking on the ProcSignal array. For query
cancellations, that's good because a query cancel packet is processed
without having a PGPROC entry, so we cannot take LWLocks. We could use
spinlocks though. In this patch, I used memory barriers to ensure that
we load/store the pid and the cancellation key in a sensible order, so
that you cannot e.g. send a cancellation signal to a backend that's just
starting up and hasn't advertised its cancellation key in ProcSignal
yet. But I think this might be simpler and less error-prone by just
adding a spinlock to each ProcSignal slot. That would also fix the
existing race condition where we might set the pss_signalFlags flag for
a slot, when the process concurrently terminates and the slot is reused
for a different process. Because of that, we currently have this caveat:
"... all the signals are such that no harm is done if they're mistakenly
fired". With a spinlock, we could eliminate that race.


I think a spinlock would make this thing a whole concurrency stuff a
lot easier to reason about.

+   slot->pss_cancel_key_valid = false;
+   slot->pss_cancel_key = 0;

If no spinlock is added, I think these accesses should still be made
atomic writes. Otherwise data-races on those fields are still
possible, resulting in undefined behaviour. The memory barriers you
added don't prevent that afaict. With atomic operations there are
still race conditions, but no data-races.

Actually it seems like that same argument applies to the already
existing reading/writing of pss_pid: it's written/read using
non-atomic operations so data-races can occur and thus undefined
behaviour too.


Ok, here's a version with spinlocks.

I went back and forth on what exactly is protected by the spinlock. I
kept the "volatile sig_atomic_t" type for pss_signalFlags, so that it
can still be safely read without holding the spinlock in
CheckProcSignal, but all the functions that set the flags now hold the
spinlock. That removes the race condition that you might set the flag
for wrong slot, if the target backend exits and the slot is recycled.
The race condition was harmless and there were comments to note it, but
it doesn't occur anymore with the spinlock.

(Note: Thomas's "Interrupts vs signals" patch will remove ps_signalFlags
altogether. I'm looking forward to that.)


-volatile pid_t pss_pid;
+pid_tpss_pid;

Why remove the volatile modifier?


Because I introduced a memory barrier to ensure the reads/writes of
pss_pid become visible to other processes in right order. That makes the
'volatile' unnecessary IIUC. With the spinlock, the point is moot however.


I:
- fixed a few comments,
- fixed a straightforward failure with EXEC_BACKEND (ProcSignal needs to 
be passed down in BackendParameters now),
- put back the snippet to signal the whole process group if supported, 
which you pointed out earlier


and committed this refactoring patch.

Thanks for the review!

--
Heikki Linnakangas
Neon (https://neon.tech)

RE: Conflict detection and logging in logical replication

2024-07-29 Thread Zhijie Hou (Fujitsu)

On Monday, July 29, 2024 6:59 PM Dilip Kumar  wrote:
> 
> On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> 
> I was going through v7-0001, and I have some initial comments.

Thanks for the comments !

> 
> @@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo
> *resultRelInfo, TupleTableSlot *slot,
>   ExprContext *econtext;
>   Datum values[INDEX_MAX_KEYS];
>   bool isnull[INDEX_MAX_KEYS];
> - ItemPointerData invalidItemPtr;
>   bool checkedIndex = false;
> 
>   ItemPointerSetInvalid(conflictTid);
> - ItemPointerSetInvalid();
> 
>   /*
>   * Get information from the result relation info structure.
> @@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo
> *resultRelInfo, TupleTableSlot *slot,
> 
>   satisfiesConstraint =
>   check_exclusion_or_unique_constraint(heapRelation, indexRelation,
> - indexInfo, ,
> + indexInfo, >tts_tid,
>   values, isnull, estate, false,
>   CEOUC_WAIT, true,
>   conflictTid);
> 
> What is the purpose of this change?  I mean
> 'check_exclusion_or_unique_constraint' says that 'tupleid'
> should be invalidItemPtr if the tuple is not yet inserted and
> ExecCheckIndexConstraints is called by ExecInsert before inserting the tuple.
> So what is this change?

Because the function ExecCheckIndexConstraints() is now invoked after inserting
a tuple (in the patch). So, we need to ignore the newly inserted tuple when
checking conflict in check_exclusion_or_unique_constraint().

> Would this change ExecInsert's behavior as well?

Thanks for pointing it out, I will check and reply.

> 
> 
> 
> 
> +ReCheckConflictIndexes(ResultRelInfo *resultRelInfo, EState *estate,
> +ConflictType type, List *recheckIndexes,
> +TupleTableSlot *slot)
> +{
> + /* Re-check all the unique indexes for potential conflicts */
> +foreach_oid(uniqueidx, resultRelInfo->ri_onConflictArbiterIndexes)
> + {
> + TupleTableSlot *conflictslot;
> +
> + if (list_member_oid(recheckIndexes, uniqueidx) &&
> + FindConflictTuple(resultRelInfo, estate, uniqueidx, slot,
> + )) { RepOriginId origin; TimestampTz committs;
> + TransactionId xmin;
> +
> + GetTupleCommitTs(conflictslot, , , );
> +ReportApplyConflict(ERROR, type, resultRelInfo->ri_RelationDesc,
> +uniqueidx,  xmin, origin, committs, conflictslot);  }  } }
>  and
> 
> + conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes;
> +
>   if (resultRelInfo->ri_NumIndices > 0)
>   recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
> -slot, estate, false, false,
> -NULL, NIL, false);
> +slot, estate, false,
> +conflictindexes ? true : false,
> +,
> +conflictindexes, false);
> +
> + /*
> + * Rechecks the conflict indexes to fetch the conflicting local tuple
> + * and reports the conflict. We perform this check here, instead of
> + * perform an additional index scan before the actual insertion and
> + * reporting the conflict if any conflicting tuples are found. This is
> + * to avoid the overhead of executing the extra scan for each INSERT
> + * operation, even when no conflict arises, which could introduce
> + * significant overhead to replication, particularly in cases where
> + * conflicts are rare.
> + */
> + if (conflict)
> + ReCheckConflictIndexes(resultRelInfo, estate, CT_INSERT_EXISTS,
> +recheckIndexes, slot);
> 
> 
>  This logic is confusing, first, you are calling
> ExecInsertIndexTuples() with no duplicate error for the indexes
> present in 'ri_onConflictArbiterIndexes' which means
>  the indexes returned by the function must be a subset of
> 'ri_onConflictArbiterIndexes' and later in ReCheckConflictIndexes()
> you are again processing all the
>  indexes of 'ri_onConflictArbiterIndexes' and checking if any of these
> is a subset of the indexes that is returned by
> ExecInsertIndexTuples().

I think that's not always true. The indexes returned by the function *may not*
be a subset of 'ri_onConflictArbiterIndexes'. Based on the comments atop of the
ExecInsertIndexTuples, it returns a list of index OIDs for any unique or
exclusion constraints that are deferred, and in addition to that, it will
include the indexes in 'arbiterIndexes' if noDupErr == true.

> 
>  Why are we doing that, I think we can directly use the
> 'recheckIndexes' which is returned by ExecInsertIndexTuples(), and
> those indexes are guaranteed to be a subset of
>  ri_onConflictArbiterIndexes. No?

Based on above, we need to filter the deferred indexes or exclusion constraints
in the 'ri_onConflictArbiterIndexes'.

Best Regards,
Hou zj

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Peter Eisentraut

I have some comments about the first three patches, that deal with
memory management.

v24-0001-Revert-ECPG-s-use-of-pnstrdup.patch

This looks right.

I suppose another approach would be to put a full replacement for
strndup() into src/port/. But since there is currently only one user,
and most other users should be using pnstrdup(), the presented approach
seems ok.

We should take the check for exit() calls from libpq and expand it to
cover the other libraries as well. Maybe there are other problems like
this?

v24-0002-Remove-fe_memutils-from-libpgcommon_shlib.patch

I don't quite understand how this problem can arise. The description says

"""
libpq appears to have no need for this, and the exit() references cause
our libpq-refs-stamp test to fail if the linker doesn't strip out the
unused code.
"""

But under what circumstances does "the linker doesn't strip out" happen?
If this happens accidentally, then we should have seen some buildfarm
failures or something?

Also, one could look further and notice that restricted_token.c and
sprompt.c both a) are not needed by libpq and b) can trigger exit()
calls. Then it's not clear why those are not affected.

v24-0003-common-jsonapi-support-libpq-as-a-client.patch

I'm reminded of thread [0]. I think there is quite a bit of confusion
about the pqexpbuffer vs. stringinfo APIs, and they are probably used
incorrectly quite a bit. There are now also programs that use both of
them! This patch now introduces another layer on top of them. I fear,
at the end, nobody is going to understand any of this anymore. Also,
changing all the programs to link in libpq for pqexpbuffer seems like
the opposite direction from what was suggested in [0].

I think we need to do some deeper thinking here about how we want the
memory management on the client side to work. Maybe we could just use
one API but have some flags or callbacks to control the out-of-memory
behavior.

[0]:
https://www.postgresql.org/message-id/flat/16d0beac-a141-e5d3-60e9-323da75f49bf%40eisentraut.org

Re: Conflict detection and logging in logical replication

2024-07-29 Thread shveta malik

On Mon, Jul 29, 2024 at 9:31 AM Amit Kapila  wrote:
>
> On Fri, Jul 26, 2024 at 4:28 PM shveta malik  wrote:
> >
> > On Fri, Jul 26, 2024 at 3:56 PM Amit Kapila  wrote:
> > >
> >
> > > One more thing we need to consider is whether we should LOG or ERROR
> > > for update/delete_differ conflicts. If we LOG as the patch is doing
> > > then we are intentionally overwriting the row when the user may not
> > > expect it. OTOH, without a patch anyway we are overwriting, so there
> > > is an argument that logging by default is what the user will expect.
> > > What do you think?
> >
> > I was under the impression that in this patch we do not intend to
> > change behaviour of HEAD and thus only LOG the conflict wherever
> > possible.
> >
>
> Earlier, I thought it was good to keep LOGGING the conflict where
> there is no chance of wrong data update but for cases where we can do
> something wrong, it is better to ERROR out. For example, for an
> update_differ case where the apply worker can overwrite the data from
> a different origin, it is better to ERROR out. I thought this case was
> comparable to an existing ERROR case like a unique constraint
> violation. But I see your point as well that one might expect the
> existing behavior where we are silently overwriting the different
> origin data. The one idea to address this concern is to suggest users
> set the detect_conflict subscription option as off but I guess that
> would make this feature unusable for users who don't want to ERROR out
> for different origin update cases.
>
> > And in the next patch of resolver, based on the user's input
> > of error/skip/or resolve, we take the action. I still think it is
> > better to stick to the said behaviour. Only if we commit the resolver
> > patch in the same version where we commit the detection patch, then we
> > can take the risk of changing this default behaviour to 'always
> > error'. Otherwise users will be left with conflicts arising but no
> > automatic way to resolve those. But for users who really want their
> > application to error out, we can provide an additional GUC in this
> > patch itself which changes the behaviour to 'always ERROR on
> > conflict'.
> >
>
> I don't see a need of GUC here, even if we want we can have a
> subscription option such conflict_log_level. But users may want to
> either LOG or ERROR based on conflict type. For example, there won't
> be any data inconsistency in two node replication for delete_missing
> case as one is trying to delete already deleted data, so LOGGING such
> a case should be sufficient whereas update_differ could lead to
> different data on two nodes, so the user may want to ERROR out in such
> a case.
>
> We can keep the current behavior as default for the purpose of
> conflict detection but can have a separate patch to decide whether to
> LOG/ERROR based on conflict_type.

+1 on the idea of giving an option to the user to choose either ERROR
or LOG for each conflict type separately.

thanks
Shveta

Re: Detach shared memory in Postmaster child if not needed

2024-07-29 Thread Aleksander Alekseev

Hi Rui,

> Prior to PG16, postmaster children would manually detach from shared memory
> if it was not needed. However, this behavior was removed in fork mode in
> commit aafc05d.
>
> Detaching shared memory when it is no longer needed is beneficial, as
> postmaster children (like syslogger) don't wish to take any risk of
> accidentally corrupting shared memory. Additionally, any panic in these
> processes will not reset shared memory.
>
> The attached patch addresses this issue by detaching shared memory after
> fork_process().

Thanks for the patch. How do you estimate its performance impact?

Note the comments for postmaster_child_launch(). This function is
exposed to the third-party code and guarantees to attach shared
memory. I doubt that there is much third-party code in existence to
break but you should change to comment.

-- 
Best regards,
Aleksander Alekseev

Re: Remove duplicate table scan in logical apply worker and code refactoring

2024-07-29 Thread Amit Kapila

On Thu, Jul 25, 2024 at 5:56 PM Kirill Reshke  wrote:
>
> > +/*
> > + * If the tuple to be modified could not be found, a log message is 
> > emitted.
> > + */
> > +static void
> > +report_tuple_not_found(CmdType cmd, Relation targetrel, bool is_partition)
> > +{
> > + Assert(cmd == CMD_UPDATE || cmd == CMD_DELETE);
> > +
> > + /* XXX should this be promoted to ereport(LOG) perhaps? */
> > + elog(DEBUG1,
> > + "logical replication did not find row to be %s in replication target 
> > relation%s \"%s\"",
> > + cmd == CMD_UPDATE ? "updated" : "deleted",
> > + is_partition ? "'s partition" : "",
> > + RelationGetRelationName(targetrel));
> > +}
>
> Encapsulating report logic inside function is ok,
>

This could even be a separate patch as it is not directly to other
parts of the 0002 patch. BTW, the problem statement for 0002 is not
explicitly stated like which part of the code we want to optimize by
removing duplication. Also, as proposed the name
apply_handle_tuple_routing() for the function doesn't seem suitable as
it no longer remains similar to other apply_handle_* functions where
we perform the required operation like insert or update the tuple. How
about naming it as apply_tuple_routing()?

-- 
With Regards,
Amit Kapila.

Re: Fix smgrtruncate code comment.


On 29/07/2024 13:50, Kirill Reshke wrote:

Today I was busy rebasing my Greenplum-related Extendable SMGR
patches on Cloudberry, and I faced some conflicts.
While resolving them I noticed code & comments inconsistency in smgr.c
in smgrtruncate function, which tracks down to
c5315f4f44843c20ada876fdb0d0828795dfbdf5. In this commit,
smgr_fsm_nblocks & smgr_vm_nblocks fields were removed, however
comments were not fixed accordingly.

So i suggest to fix this, PFA


Applied, thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Conflict detection and logging in logical replication

2024-07-29 Thread Dilip Kumar

On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
 wrote:
>

I was going through v7-0001, and I have some initial comments.

@@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo
*resultRelInfo, TupleTableSlot *slot,
  ExprContext *econtext;
  Datum values[INDEX_MAX_KEYS];
  bool isnull[INDEX_MAX_KEYS];
- ItemPointerData invalidItemPtr;
  bool checkedIndex = false;

  ItemPointerSetInvalid(conflictTid);
- ItemPointerSetInvalid();

  /*
  * Get information from the result relation info structure.
@@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo
*resultRelInfo, TupleTableSlot *slot,

  satisfiesConstraint =
  check_exclusion_or_unique_constraint(heapRelation, indexRelation,
- indexInfo, ,
+ indexInfo, >tts_tid,
  values, isnull, estate, false,
  CEOUC_WAIT, true,
  conflictTid);

What is the purpose of this change?  I mean
'check_exclusion_or_unique_constraint' says that 'tupleid'
should be invalidItemPtr if the tuple is not yet inserted and
ExecCheckIndexConstraints is called by ExecInsert
before inserting the tuple.  So what is this change? Would this change
ExecInsert's behavior as well?




+ReCheckConflictIndexes(ResultRelInfo *resultRelInfo, EState *estate,
+ConflictType type, List *recheckIndexes,
+TupleTableSlot *slot)
+{
+ /* Re-check all the unique indexes for potential conflicts */
+ foreach_oid(uniqueidx, resultRelInfo->ri_onConflictArbiterIndexes)
+ {
+ TupleTableSlot *conflictslot;
+
+ if (list_member_oid(recheckIndexes, uniqueidx) &&
+ FindConflictTuple(resultRelInfo, estate, uniqueidx, slot, ))
+ {
+ RepOriginId origin;
+ TimestampTz committs;
+ TransactionId xmin;
+
+ GetTupleCommitTs(conflictslot, , , );
+ ReportApplyConflict(ERROR, type, resultRelInfo->ri_RelationDesc, uniqueidx,
+ xmin, origin, committs, conflictslot);
+ }
+ }
+}
 and

+ conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes;
+
  if (resultRelInfo->ri_NumIndices > 0)
  recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-slot, estate, false, false,
-NULL, NIL, false);
+slot, estate, false,
+conflictindexes ? true : false,
+,
+conflictindexes, false);
+
+ /*
+ * Rechecks the conflict indexes to fetch the conflicting local tuple
+ * and reports the conflict. We perform this check here, instead of
+ * perform an additional index scan before the actual insertion and
+ * reporting the conflict if any conflicting tuples are found. This is
+ * to avoid the overhead of executing the extra scan for each INSERT
+ * operation, even when no conflict arises, which could introduce
+ * significant overhead to replication, particularly in cases where
+ * conflicts are rare.
+ */
+ if (conflict)
+ ReCheckConflictIndexes(resultRelInfo, estate, CT_INSERT_EXISTS,
+recheckIndexes, slot);


 This logic is confusing, first, you are calling
ExecInsertIndexTuples() with no duplicate error for the indexes
present in 'ri_onConflictArbiterIndexes' which means
 the indexes returned by the function must be a subset of
'ri_onConflictArbiterIndexes' and later in ReCheckConflictIndexes()
you are again processing all the
 indexes of 'ri_onConflictArbiterIndexes' and checking if any of these
is a subset of the indexes that is returned by
ExecInsertIndexTuples().

 Why are we doing that, I think we can directly use the
'recheckIndexes' which is returned by ExecInsertIndexTuples(), and
those indexes are guaranteed to be a subset of
 ri_onConflictArbiterIndexes. No?

 ---
 ---


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Add crc32(text) & crc32(bytea)

2024-07-29 Thread Aleksander Alekseev

Hi Nathan,

> I don't think adding crc32c() would sufficiently increase the scope.  We'd
> use the existing implementations for both crc32() and crc32c().  And
> besides, this could be useful for adding tests for that code.
>
> +crc32 ( text )
>
> Do we need a version of the function that takes a text input?  It's easy
> enough to cast to a bytea.
>
> +text
>
> My first reaction is that we should just have this return bytea like the
> SHA ones do, if for no other reason than commit 10cfce3 seems intended to
> move us away from returning text for these kinds of functions.  Upthread,
> you mentioned the possibility of returning a bigint, too.  I think I'd
> still prefer bytea in case we want to add, say, crc64() or crc16() in the
> future.  That would allow us to keep all of these functions consistent
> instead of returning different types for each.  However, I understand that
> returning the numeric types might be more convenient.  I'm curious what
> others think about this.
>
> +Computes the CRC32 hash of
> +the binary string, with the result written in hexadecimal.
>
> I'm not sure we should call the check values "hashes."  Wikipedia does
> include them in the "List of hash functions" page [0], but it seems to
> deliberately avoid calling them hashes in the CRC page [1].  I'd suggest
> calling them "CRC32 values" instead.

Thanks for the code review. Here is the updated patch.

-- 
Best regards,
Aleksander Alekseev


v2-0001-Add-crc32-bytea-crc32c-bytea.patch
Description: Binary data

Re: why is pg_upgrade's regression run so slow?

2024-07-29 Thread Andrew Dunstan




On 2024-07-29 Mo 4:00 AM, Alexander Lakhin wrote:

Hello Andrew,

28.07.2024 22:43, Andrew Dunstan wrote:


Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on 
HEAD for fairywren and drongo -  fairwren has just gone green, and I 
expect drongo will when it reports in a few hours.


I'm at a loss for an explanation.



I'm observing the same here (using a Windows 10 VM).

With no TEMP_CONFIG set, `meson test` gives me these numbers:
test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 72.34s

test: postgresql:regress / regress/regress
duration: 41.98s

But with debug_parallel_query=regress in TEMP_CONFIG, I see:
test: postgresql:regress / regress/regress
duration: 50.08s

test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 398.45s

With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many
parallel workers were started during the test:
...\postgresql\build>grep "starting background worker process" -r 
testrun/pg_upgrade | wc -l

17532

And another interesting fact is that TEMP_CONFIG is apparently ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.




Well, that last fact explains the discrepancy I originally complained 
about. How the heck did that happen? It looks like we just ignored its 
use in Makefile.global.in :-(



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Logical Replication of sequences

2024-07-29 Thread vignesh C

On Thu, 25 Jul 2024 at 15:41, shveta malik  wrote:
>
> On Thu, Jul 25, 2024 at 12:08 PM shveta malik  wrote:
> >
> > On Thu, Jul 25, 2024 at 9:06 AM vignesh C  wrote:
> > >
> > > The attached v20240725 version patch has the changes for the same.
> >
> > Thank You for addressing the comments. Please review below issues:
> >
> > 1) Sub ahead of pub due to wrong initial sync of last_value for
> > non-incremented sequences. Steps at [1]
> > 2) Sequence's min value is not honored on sub during replication. Steps at 
> > [2]
>
> One more issue:
> 3)  Sequence datatype's range is not honored on sub during
> replication, while it is honored for tables.
>
>
> Behaviour for tables:
> -
> Pub: create table tab1( i integer);
> Sub: create table tab1( i smallint);
>
> Pub: insert into tab1 values(generate_series(1, 32768));
>
> Error on sub:
> 2024-07-25 10:38:06.446 IST [178680] ERROR:  value "32768" is out of
> range for type smallint
>
> -
> Behaviour for sequences:
> -
>
> Pub:
> CREATE SEQUENCE myseq_i as integer INCREMENT 1 START 1;
>
> Sub:
> CREATE SEQUENCE myseq_i as smallint INCREMENT 1 START 1;
>
> Pub:
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i'); -->brings value to 40001
>
> Sub:
> ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION SEQUENCES;
> SELECT * from pg_sequences;  -->last_val reached till 40001, while the
> range is till  32767.

This issue is addressed in the v20240730_2 version patch attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3%2BXzHAbgyn8gmbBLK5goyv_uyGgHEsTQxRZ8bVk6nAEg%40mail.gmail.com

Regards,
Vignesh

Re: Remove duplicate table scan in logical apply worker and code refactoring

2024-07-29 Thread Amit Kapila

On Thu, Jul 25, 2024 at 4:00 PM Zhijie Hou (Fujitsu)
 wrote:
>
> When reviewing the code in logical/worker.c, I noticed that when applying a
> cross-partition update action, it scans the old partition twice.
> I am attaching the patch 0001 to remove this duplicate table scan.
>
> The test shows that it brings noticeable improvement:
>
> Steps
> -
> Pub:
> create table tab (a int not null, b int);
> alter table tab replica identity full;
> insert into tab select 1,generate_series(1, 100, 1);
>
> Sub:
> create table tab (a int not null, b int) partition by range (b);
> create table tab_1 partition of tab for values from (minvalue) to (500);
> create table tab_2 partition of tab for values from (500) to (maxvalue);
> alter table tab replica identity full;
>
>
> Test query:
> update tab set b = 600 where b > 00; -- UPDATE 100
>
> Results (The time spent by apply worker to apply the all the UPDATEs):
> Before  14s
> After   7s
> -
>

The idea sounds good to me. BTW, we don't need the following comment
in the 0001 patch:
We
+ * don't call apply_handle_delete_internal() here to avoid
+ * repeating some work already done above to find the
+ * local tuple in the partition.

It is implied by the change and we already follow the same for the update.

-- 
With Regards,
Amit Kapila.

Re: Add ALL_CANDIDATES option to EXPLAIN

2024-07-29 Thread Ashutosh Bapat

On Fri, Jul 26, 2024 at 10:47 PM Robert Haas  wrote:
>
> On Fri, Jul 26, 2024 at 12:59 PM Anthonin Bonnefoy
>  wrote:
> > I have a prototype for an ALL_CANDIDATES option for EXPLAIN. The goal
> > of this option is to print all plan candidates instead of only the
> > cheapest plan. It will output the plans from the most expensive at the
> > top to the cheapest. Here's an example:
>
> It's difficult for me to understand how this can work. Either it's not
> really going to print out all candidates, or it's going to print out
> gigabytes of output for moderately complex queries.
>
> I've thought about trying to figure out some way of identifying and
> printing out plans that are "interestingly different" from the chosen
> plan, with the costs they would have had, but I haven't been able to
> come up with a good algorithm. Printing out absolutely everything
> doesn't seem viable, because planning would be slow and use amazing
> amounts of memory and the output would be so large as to be useless.

If we print the path forest as a forest as against individual path
trees, we will be able to cut down on the size but it will still be
huge. Irrespective of that even with slightly non-trivial queries it's
going to be difficult to analyze these paths. The way I think of it is
dumping this information in the form of tables. Roughly something like
a table containing RelOptInfo id and RelOptInfo itself and another
containing all the paths identified by id and RelOptInfo id. The path
linkages are stored as path ids. That's a minimum. We will need more
tables to store query, and other metadata. If we do so we can use SQL
to carry out investigations.

-- 
Best Wishes,
Ashutosh Bapat

Re: tls 1.3: sending multiple tickets

2024-07-29 Thread Daniel Gustafsson

> On 26 Jul 2024, at 20:29, Robert Haas  wrote:

> One of my chronic complaints about comments is
> that they should say why we're doing things, not what we're doing.

Agreed.

> I feel like any
> place where we are doing X because of some property of a non-PG code
> base with which a particular reader might not be familiar, we should
> have a comment explaining why we're doing it. And especially if it's
> security-relevant.

I'm sure there are more interactions with OpenSSL, and TLS in general, which
warrants better comments but the attached takes a stab at the two examples in
question here to get started (to avoid perfect get in the way of progress). 

--
Daniel Gustafsson

openssl_comments.diff
Description: Binary data

Re: Conflict detection and logging in logical replication

2024-07-29 Thread Amit Kapila

On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, July 26, 2024 7:34 PM Amit Kapila  wrote:
> >
> > On Thu, Jul 25, 2024 at 4:12 PM Amit Kapila 
> > wrote:
> > > >
> > A few more comments:
>
> Thanks for the comments.
>
> > 1.
> > For duplicate key, the patch reports conflict as following:
> > ERROR:  conflict insert_exists detected on relation "public.t1"
> > 2024-07-26 11:06:34.570 IST [27800] DETAIL:  Key (c1)=(1) already exists in
> > unique index "t1_pkey", which was modified by origin 1 in transaction 770 at
> > 2024-07-26 09:16:47.79805+05:30.
> > 2024-07-26 11:06:34.570 IST [27800] CONTEXT:  processing remote data for
> > replication origin "pg_16387" during message type "INSERT" for replication
> > target relation "public.t1" in transaction 742, finished at 0/151A108
> >
> > In detail, it is better to display the origin name instead of the origin 
> > id. This will
> > be similar to what we do in CONTEXT information.
>
>
> Agreed. Before modifying this, I'd like to confirm the message style in the
> cases where origin id may not have a corresponding origin name (e.g., if the
> data was modified locally (id = 0), or if the origin that modified the data 
> has
> been dropped). I thought of two styles:
>
> 1)
> - for local change: "xxx was modified by a different origin \"(local)\" in 
> transaction 123 at 2024.."
> - for dropped origin: "xxx was modified by a different origin \"(unknown)\" 
> in transaction 123 at 2024.."
>
> One issue for this style is that user may create an origin with the same name
> here (e.g. "(local)" and "(unknown)").
>
> 2)
> - for local change: "xxx was modified locally in transaction 123 at 2024.."
>

This sounds good.

> - for dropped origin: "xxx was modified by an unknown different origin 1234 
> in transaction 123 at 2024.."
>

For this one, how about: "xxx was modified by a non-existent origin in
transaction 123 at 2024.."?

Also, in code please do write comments when each of these two can occur.

-- 
With Regards,
Amit Kapila.

Re: why is pg_upgrade's regression run so slow?

2024-07-29 Thread Alexander Lakhin


Hello Andrew,

28.07.2024 22:43, Andrew Dunstan wrote:


Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on HEAD for fairywren and drongo -  fairwren has 
just gone green, and I expect drongo will when it reports in a few hours.


I'm at a loss for an explanation.



I'm observing the same here (using a Windows 10 VM).

With no TEMP_CONFIG set, `meson test` gives me these numbers:
test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 72.34s

test: postgresql:regress / regress/regress
duration: 41.98s

But with debug_parallel_query=regress in TEMP_CONFIG, I see:
test: postgresql:regress / regress/regress
duration: 50.08s

test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 398.45s

With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many
parallel workers were started during the test:
...\postgresql\build>grep "starting background worker process" -r 
testrun/pg_upgrade | wc -l
17532

And another interesting fact is that TEMP_CONFIG is apparently ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.

Best regards,
Alexander

Re: Visibility bug with prepared transaction with subtransactions on standby

2024-07-29 Thread a.kozhemyakin


Hi,
I'm trying to run REL_13_STABLE recovey tests for windows and I get the 
error


waiting for server to shut down.. failed
pg_ctl: server does not shut down
# pg_ctl stop failed: 256
# Stale postmaster.pid file for node "paris": PID 1868 no longer exists
Bail out! pg_ctl stop failed

I noticed that on buildfarm recovey tests are disabled for windows, was 
this done intentionally?


'invocation_args' => [
'--config',
'./bowerbird.conf',
'--skip-steps',
'recovery-check',
'--verbose',
'REL_13_STABLE'
],

REL_13_STABLE (071e19a36) - test passed
REL_13_STABLE(e9c8747ee) - test failed

28.06.2024 1:35, Heikki Linnakangas пишет:

On 20/06/2024 17:10, Heikki Linnakangas wrote:

On 20/06/2024 16:41, Heikki Linnakangas wrote:

Attached is a patch to fix this, with a test case.


The patch did not compile, thanks to a last-minute change in a field
name. Here's a fixed version.


All I heard is crickets, so committed and backported to all supported 
versions.

Re: Pgoutput not capturing the generated columns

2024-07-29 Thread Peter Smith

Thanks for the patch updates.

Here are my review comments for v21-0001.

I think this patch is mostly OK now except there are still some
comments about the TAP test.

==
Commit Message

0.
Using Create Subscription:
CREATE SUBSCRIPTION sub2_gen_to_gen CONNECTION '$publisher_connstr' PUBLICATION
pub1 WITH (include_generated_columns = true, copy_data = false)"

If you are going to give an example, I think a gen-to-nogen example
would be a better choice. That's because the gen-to-gen (as you have
here) is not going to replicate anything due to the subscriber-side
column being generated.

==
src/test/subscription/t/011_generated.pl

1.
+$node_subscriber2->safe_psql('postgres',
+ "CREATE TABLE tab1 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a
* 22) STORED, c int)"
+);

The subscriber2 node was intended only for all the tables where we
need include_generated_columns to be true. Mostly that is the
combination tests. (tab_gen_to_nogen, tab_nogen_to_gen, etc) OTOH,
table 'tab1' already existed. I don't think we need to bother
subscribing to tab1 from subscriber2 because every combination is
already covered by the combination tests. Let's leave this one alone.


~~~

2.
Huh? Where is the "tab_nogen_to_gen" combination test that I sent to
you off-list?

~~~

3.
+$node_subscriber->safe_psql('postgres',
+ "CREATE TABLE tab_order (c int GENERATED ALWAYS AS (a * 22) STORED,
a int, b int)"
+);

Maybe you can test 'tab_order' on both subscription nodes but I think
it is overkill. IMO it is enough to test it on subscription2.

~~~

4.
+$node_subscriber->safe_psql('postgres',
+ "CREATE TABLE tab_alter (a int, b int, c int GENERATED ALWAYS AS (a
* 22) STORED)"
+);

Ditto above. Maybe you can test 'tab_order' on both subscription nodes
but I think it is overkill. IMO it is enough to test it on
subscription2.

~~~

5.
Don't forget to add initial data for the missing nogen_to_gen table/test.

~~~

6.
 $node_publisher->safe_psql('postgres',
- "CREATE PUBLICATION pub1 FOR ALL TABLES");
+ "CREATE PUBLICATION pub1 FOR TABLE tab1, tab_gen_to_gen,
tab_gen_to_nogen, tab_gen_to_missing, tab_missing_to_gen, tab_order");
+
 $node_subscriber->safe_psql('postgres',
  "CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
 );

It is not a bad idea to reduce the number of publications as you have
done, but IMO jamming all the tables into 1 publication is too much
because it makes it less understandable instead of simpler.

How about this:
- leave the 'pub1' just for 'tab1'.
- have a 'pub_combo' for publication all the gen_to_nogen,
nogen_to_gen etc combination tests.
- and a 'pub_misc' for any other misc tables like tab_order.

~~~

7.
+#
 # Wait for initial sync of all subscriptions
+#

I think you should write a note here that you have deliberately set
copy_data = false because COPY and include_generated_columns are not
allowed at the same time for patch 0001. And that is why all expected
results on subscriber2 will be empty. Also, say this limitation will
be changed in patch 0002.

~~~

(I didn't yet check 011_generated.pl file results beyond this point...
I'll wait for v22-0001 to review further)

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Use pgBufferUsage for block reporting in analyze

2024-07-29 Thread Anthonin Bonnefoy

On Sat, Jul 27, 2024 at 12:35 AM Masahiko Sawada  wrote:
> An alternative idea would
> be to pass StringInfo to AcquireSampleRowsFunc() so that callback can
> write its messages there. This is somewhat similar to what we do in
> the EXPLAIN command (cf, ExplainPropertyText() etc). It could be too
> much but I think it could be better than writing logs in the single
> format.

I've tested this approach, it definitely looks better. I've added a
logbuf StringInfo to AcquireSampleRowsFunc which will receive the
logs. elevel was removed as it is not used anymore. Since everything
is in the same log line, I've removed the relation name in the acquire
sample functions.

For partitioned tables, I've also added the processed partition table
being sampled. The output will look like:

INFO:  analyze of table "postgres.public.test_partition"
Sampling rows from child "public.test_partition_1"
pages: 5 of 5 scanned
tuples: 999 live tuples, 0 are dead; 999 tuples in sample, 999
estimated total tuples
Sampling rows from child "public.test_partition_2"
pages: 5 of 5 scanned
tuples: 1000 live tuples, 0 are dead; 1000 tuples in sample, 1000
estimated total tuples
avg read rate: 2.604 MB/s, avg write rate: 0.000 MB/s
...

For a file_fdw, the output will be:

INFO:  analyze of table "postgres.public.pglog"
tuples: 60043 tuples; 3 tuples in sample
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
...

Regards,
Anthonin

v4-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch
Description: Binary data

v4-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data

v4-0003-Pass-StringInfo-logbuf-to-AcquireSampleFunc.patch
Description: Binary data

Re: Logical Replication of sequences

2024-07-28 Thread vignesh C

On Fri, 26 Jul 2024 at 11:46, Peter Smith  wrote:
>
> Hi Vignesh,
>
> There are still pending changes from my previous review of the
> 0720-0003 patch [1], but here are some new review comments for your
> latest patch v20240525-0003.
> 2b.
> Is it better to name these returned by-ref ptrs like 'ret_log_cnt',
> and 'ret_lsn' to emphasise they are output variables? YMMV.

I felt this is ok as we have mentioned in function header too

> ==
> src/test/subscription/t/034_sequences.pl
>
> 4.
> Q. Should we be suspicious that log_cnt changes from '32' to '31', or
> is there a valid explanation? It smells like some calculation is
> off-by-one, but without debugging I can't tell if it is right or
> wrong.

 It works like this: for every 33 nextval we will get log_cnt as 0. So
for 33 * 6(198) log_cnt will be 0, then for 199 log_cnt will be 32 and
for 200 log_cnt will be 31. This pattern repeats, so this is ok.

These are handled in the v20240729 version attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3SucGGLe-B-a_aqWNWQZ-yfxFTiAA0JyP-SwX4jq9Y3A%40mail.gmail.com

Regards,
Vignesh

Re: Logical Replication of sequences

2024-07-28 Thread vignesh C

On Fri, 26 Jul 2024 at 08:04, Peter Smith  wrote:
>
> Here are some review comments for latest patch v20240725-0002
>
> ==
> doc/src/sgml/ref/create_publication.sgml
>
> nitpick - tweak to the description of the example.
>
> ==
> src/backend/parser/gram.y
>
> preprocess_pub_all_objtype_list:
> nitpick - typo "allbjects_list"
> nitpick - reword function header
> nitpick - /alltables/all_tables/
> nitpick - /allsequences/all_sequences/
> nitpick - I think code is safe as-is because makeNode internally does
> palloc0, but OTOH adding Assert would be nicer just to remove any
> doubts.
>
> ==
> src/bin/psql/describe.c
>
> 1.
> + /* Print any publications */
> + if (pset.sversion >= 18)
> + {
> + int tuples = 0;
>
> No need to assign value 0 here, because this will be unconditionally
> assigned before use anyway.
>
> 
>
> 2. describePublications
>
>   has_pubviaroot = (pset.sversion >= 13);
> + has_pubsequence = (pset.sversion >= 18000);
>
> That's a bug! Should be 18, not 18000.
>
> ==
>
> And, please see the attached diffs patch, which implements the
> nitpicks mentioned above.

These are handled in the v20240729 version attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3SucGGLe-B-a_aqWNWQZ-yfxFTiAA0JyP-SwX4jq9Y3A%40mail.gmail.com

Regards,
Vignesh

Re: Add mention of execution time memory for enable_partitionwise_* GUCs

2024-07-28 Thread David Rowley

On Fri, 19 Jul 2024 at 17:24, Ashutosh Bapat
 wrote:
> I am fine if we want to mention that the executor may consume a large
> amount of memory when these GUCs are turned ON. Users may decide to
> turn those OFF if they can not afford to spend that much memory during
> execution. But I don't like tying execution time consumption with
> default OFF.

Would the attached address your concern about the reasons for defaulting to off?

David


partitionwise_docs.patch
Description: Binary data

Re: Allow logical failover slots to wait on synchronous replication

2024-07-28 Thread Bertrand Drouvot

Hi John,

On Thu, Jul 18, 2024 at 02:22:08PM -0700, John H wrote:
> Hi Bertrand,
> 
> > 1 ===
> > ...
> > That's worth additional comments in the code.
> 
> There's this comment already about caching the value already, not sure
> if you prefer something more?
> 
> /* Cache values to reduce contention on lock */

Yeah, at the same place as the static lsn[] declaration, something like:

static XLogRecPtr   lsn[NUM_SYNC_REP_WAIT_MODE]; /* cached LSNs */

but that may just be a matter of taste.

> > 3 ===
> > ...
> > NUM_SYNC_REP_WAIT_MODE is small but as the goal is the keep the lock time as
> > short as possible I wonder if it wouldn't be better to use memcpy() here 
> > instead
> > of this for loop.
> >
> 
> It results in a "Wdiscarded-qualifiers" which is safe given we take
> the lock, but adds noise?
> What do you think?
> 
> "slot.c:2756:46: warning: passing argument 2 of ‘memcpy’ discards
> ‘volatile’ qualifier from pointer target type
> [-Wdiscarded-qualifiers]"

Right, we may want to cast it then but given that the for loop is "small" I 
think
that's also fine to keep the for loop.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Flush pgstats file during checkpoints

2024-07-28 Thread Bertrand Drouvot

Hi,

On Tue, Jul 23, 2024 at 12:52:11PM +0900, Michael Paquier wrote:
> On Mon, Jul 22, 2024 at 07:01:41AM +, Bertrand Drouvot wrote:
> > 3 ===
> > 
> > +   /*
> > +* Read the redo LSN stored in the file.
> > +*/
> > +   if (!read_chunk_s(fpin, _redo) ||
> > +   file_redo != redo)
> > +   goto error;
> > 
> > I wonder if it would make sense to have dedicated error messages for
> > "file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would
> > ease to diagnose as to why the stat file is discarded.
> 
> Yep.  This has been itching me quite a bit, and that's a bit more than
> just the format ID or the redo LSN: it relates to all the read_chunk()
> callers.  I've taken a shot at this with patch 0001, implemented on
> top of the rest.

Thanks! 0001 attached is 
v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch
so I guess you did not attached the right one.

> Attaching a new v4 series, with all these comments addressed.

Thanks!

Looking at 0002:

1 ===

   if (!read_chunk(fpin, ptr, info->shared_data_len))
+  {
+   elog(WARNING, "could not read data of stats kind %d for entry 
of type %c",
+   kind, t);

Nit: what about to include the "info->shared_data_len" value in the WARNING?

2 ===

   if (!read_chunk_s(fpin, ))
+  {
+   elog(WARNING, "could not read name of stats kind %d for entry 
of type %c",
+kind, t);
goto error;
+  }
   if (!pgstat_is_kind_valid(kind))
+  {
+   elog(WARNING, "invalid stats kind %d for entry of type %c",
+kind, t);
goto error;
+  }

Shouldn't we swap those 2 tests so that we check that the kind is valid right
after this one?

  if (!read_chunk_s(fpin, ))
+ {
+  elog(WARNING, "could not read stats kind for entry of type %c", t);
   goto error;
+ }

Looking at 0003: LGTM

Looking at 0004: LGTM

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Conflict detection and logging in logical replication

2024-07-28 Thread Amit Kapila

On Fri, Jul 26, 2024 at 4:28 PM shveta malik  wrote:
>
> On Fri, Jul 26, 2024 at 3:56 PM Amit Kapila  wrote:
> >
>
> > One more thing we need to consider is whether we should LOG or ERROR
> > for update/delete_differ conflicts. If we LOG as the patch is doing
> > then we are intentionally overwriting the row when the user may not
> > expect it. OTOH, without a patch anyway we are overwriting, so there
> > is an argument that logging by default is what the user will expect.
> > What do you think?
>
> I was under the impression that in this patch we do not intend to
> change behaviour of HEAD and thus only LOG the conflict wherever
> possible.
>

Earlier, I thought it was good to keep LOGGING the conflict where
there is no chance of wrong data update but for cases where we can do
something wrong, it is better to ERROR out. For example, for an
update_differ case where the apply worker can overwrite the data from
a different origin, it is better to ERROR out. I thought this case was
comparable to an existing ERROR case like a unique constraint
violation. But I see your point as well that one might expect the
existing behavior where we are silently overwriting the different
origin data. The one idea to address this concern is to suggest users
set the detect_conflict subscription option as off but I guess that
would make this feature unusable for users who don't want to ERROR out
for different origin update cases.

> And in the next patch of resolver, based on the user's input
> of error/skip/or resolve, we take the action. I still think it is
> better to stick to the said behaviour. Only if we commit the resolver
> patch in the same version where we commit the detection patch, then we
> can take the risk of changing this default behaviour to 'always
> error'. Otherwise users will be left with conflicts arising but no
> automatic way to resolve those. But for users who really want their
> application to error out, we can provide an additional GUC in this
> patch itself which changes the behaviour to 'always ERROR on
> conflict'.
>

I don't see a need of GUC here, even if we want we can have a
subscription option such conflict_log_level. But users may want to
either LOG or ERROR based on conflict type. For example, there won't
be any data inconsistency in two node replication for delete_missing
case as one is trying to delete already deleted data, so LOGGING such
a case should be sufficient whereas update_differ could lead to
different data on two nodes, so the user may want to ERROR out in such
a case.

We can keep the current behavior as default for the purpose of
conflict detection but can have a separate patch to decide whether to
LOG/ERROR based on conflict_type.

-- 
With Regards,
Amit Kapila.

RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-28 Thread Hayato Kuroda (Fujitsu)

Dear Fujii-san,

> > IIUC, the patch which adds user_name attribute to get_connection() can be
> discussed
> > in later stage, is it right?
> 
> No, let's work on the patch at this stage :)

OK, here is a rebased patch.

- Changed the name of new API from `GetUserMappingFromOid` to 
`GetUserMappingByOid`
  to keep the name consistent with others.
- Comments and docs were updated.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



0001-Extend-postgres_fdw_get_connections-to-return-user-n.patch
Description:  0001-Extend-postgres_fdw_get_connections-to-return-user-n.patch

RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-28 Thread Hayato Kuroda (Fujitsu)

Dear Fujii-san,

Thanks for pushing and analyzing the failure!

> The regression.diffs shows that pgfdw_conn_check returned 0 even though
> pgfdw_conn_checkable()
> returned true. This can happen if the "revents" from poll() indicates 
> something
> other than
> POLLRDHUP. I think that "revents" could indicate POLLHUP, POLLERR, or
> POLLNVAL. Therefore,
> IMO pgfdw_conn_check() should be updated as follows. I will test this change.
> 
> -   return (input_fd.revents & POLLRDHUP) ? 1 : 0;
> +   return (input_fd.revents &
> +   (POLLRDHUP | POLLHUP | POLLERR |
> POLLNVAL)) ? 1 : 0;

I think you are right.
According to the man page, the input socket is invalid or disconnected if 
revents
has such bits. So such cases should be also regarded as *failure*.
After the fix, the animal said OK. Great works!

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2024-07-28 Thread Ashutosh Bapat

Thanks a lot Richard.

On Mon, Jul 29, 2024 at 8:56 AM Richard Guo  wrote:
>
> On Fri, Jul 26, 2024 at 5:44 PM Richard Guo  wrote:
> > I've worked a bit more on the comments and commit message, and I plan
> > to push the attached soon, barring any objections or comments.
>
> Pushed.
>
> Thanks
> Richard



-- 
Best Wishes,
Ashutosh Bapat

Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2024-07-28 Thread Richard Guo

On Fri, Jul 26, 2024 at 5:44 PM Richard Guo  wrote:
> I've worked a bit more on the comments and commit message, and I plan
> to push the attached soon, barring any objections or comments.

Pushed.

Thanks
Richard

Re: Allow logical failover slots to wait on synchronous replication

2024-07-28 Thread shveta malik

On Fri, Jul 26, 2024 at 5:11 PM Amit Kapila  wrote:
>
> On Fri, Jul 26, 2024 at 3:28 PM shveta malik  wrote:
> >
> > On Tue, Jul 23, 2024 at 10:35 AM Amit Kapila  
> > wrote:
> > >
> > > On Tue, Jul 9, 2024 at 12:39 AM John H  wrote:
> > > >
> > > > > Out of curiosity, did you compare with 
> > > > > standby_slot_names_from_syncrep set to off
> > > > > and standby_slot_names not empty?
> > > >
> > > > I didn't think 'standby_slot_names' would impact TPS as much since
> > > > it's not grabbing the SyncRepLock but here's a quick test.
> > > > Writer with 5 synchronous replicas, 10 pg_recvlogical clients and
> > > > pgbench all running from the same server.
> > > >
> > > > Command: pgbench -c 4 -j 4 -T 600 -U "ec2-user" -d postgres -r -P 5
> > > >
> > > > Result with: standby_slot_names =
> > > > 'replica_1,replica_2,replica_3,replica_4,replica_5'
> > > >
> > > > latency average = 5.600 ms
> > > > latency stddev = 2.854 ms
> > > > initial connection time = 5.503 ms
> > > > tps = 714.148263 (without initial connection time)
> > > >
> > > > Result with: standby_slot_names_from_syncrep = 'true',
> > > > synchronous_standby_names = 'ANY 3 (A,B,C,D,E)'
> > > >
> > > > latency average = 5.740 ms
> > > > latency stddev = 2.543 ms
> > > > initial connection time = 4.093 ms
> > > > tps = 696.776249 (without initial connection time)
> > > >
> > > > Result with nothing set:
> > > >
> > > > latency average = 5.090 ms
> > > > latency stddev = 3.467 ms
> > > > initial connection time = 4.989 ms
> > > > tps = 785.665963 (without initial connection time)
> > > >
> > > > Again I think it's possible to improve the synchronous numbers if we
> > > > cache but I'll try that out in a bit.
> > > >
> > >
> > > Okay, so the tests done till now conclude that we won't get the
> > > benefit by using 'standby_slot_names_from_syncrep'. Now, if we
> > > increase the number of standby's in both lists and still keep ANY 3 in
> > > synchronous_standby_names then the results may vary. We should try to
> > > find out if there is a performance benefit with the use of
> > > synchronous_standby_names in the normal configurations like the one
> > > you used in the above tests to prove the value of this patch.
> > >
> >
> > I didn't fully understand the parameters mentioned above, specifically
> > what 'latency stddev' and 'latency average' represent.. But shouldn't
> > we see the benefit/value of this patch by having a setup where a
> > particular standby is slow in sending the response back to primary
> > (could be due to network lag or other reasons) and then measuring the
> > latency in receiving changes on failover-enabled logical subscribers?
> > We can perform this test with both of the below settings and say make
> > D and E slow in sending responses:
> > 1) synchronous_standby_names = 'ANY 3 (A,B,C,D,E)'
> > 2) standby_slot_names = A_slot, B_slot, C_slot, D_slot, E_slot.
> >
>
> Yes, I also expect the patch should perform better in such a scenario
> but it is better to test it. Also, irrespective of that, we should
> investigate why the reported case is slower for
> synchronous_standby_names and see if we can improve it.

+1

> BTW, you for 2), I think you wanted to say synchronized_standby_slots,
> not standby_slot_names. We have recently changed the GUC name.

yes, sorry, synchronized_standby_slots it is.

thanks
Shveta

Re: Simplify create_merge_append_path a bit for clarity

2024-07-28 Thread Richard Guo

On Fri, Jul 26, 2024 at 1:28 PM Paul A Jungwirth
 wrote:
> Is there a reason you don't want to remove the required_outer
> parameter altogether? I guess because it is such a common pattern to
> pass it?

I think it's best to keep this parameter unchanged to maintain
consistency with other functions that create path nodes in pathnode.c.

> Do you think it is worth keeping this assertion?:
>
> -
> -/* All child paths must have same parameterization */
> -Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
>
> I understand any failure would trigger one of the prior asserts
> instead, but it does communicate an extra requirement, and there is no
> cost.

I don't think it's a good idea to keep this Assert: with this change
it becomes redundant.

> But I'd be fine with committing this patch as-is.

I've pushed this patch.  Thanks for review.

Thanks
Richard

Re: Maybe don't process multi xmax in FreezeMultiXactId() if it is already marked as invalid?

2024-07-28 Thread Anton A. Melnikov




On 19.06.2024 21:06, Peter Geoghegan wrote:

On Wed, Jun 19, 2024 at 1:39 PM Alvaro Herrera  wrote:

FWIW I don't think HEAP_XMAX_INVALID as purely a hint.
HEAP_XMAX_COMMITTED is a hint, for sure, as is HEAP_XMIN_COMMITTED on
its own; but as far as I recall, the INVALID flags must persist once
set.


Seems we disagree on some pretty fundamental things in this area, then.


To resolve this situation seems it is necessary to agree on what
is a "hint bit" exactly means and how to use it.

For example, in this way:

1) Definition. The "hint bit" if it is set represents presence of the property 
of some object (e.g. tuple).
The value of a hint bit can be derived again at any time. So it is acceptable 
for a hint
bit to be lost during some operations.

2) Purpose. (It has already been defined by Yura Sokolov in one of the previous 
letters)
Some work (e.g CPU + mem usage) must be done to check the property of some 
object.
Checking the hint bit, if it is set, saves this work.
So the purpose of the hint bit is optimization.

3) Use. By default code that needs to check some property of the object
must firstly check the corresponding hint bit. If hint is set, determine that 
the property
is present. If hint is not set, do the work to check this property of the 
object and set
hint bit if that property is present.
Also, non-standard behavior is allowed, when the hint bit is ignored and the 
work on property
check will be performed unconditionally for some reasons. In this case the code 
must contain
a comment with an explanation of this reason.

And maybe for clarity, explicitly say that some bit is a hint right in its 
definition?
For instance, use HEAP_XMIN_COMMITTED_HINT instead of HEAP_XMIN_COMMITTED.


Remarks and concerns are gratefully welcome.
 


With the best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: POC, WIP: OR-clause support for indexes

2024-07-28 Thread Alexander Korotkov

On Sun, Jul 28, 2024 at 12:59 PM Alena Rybakina
 wrote:
> On 27.07.2024 13:56, Alexander Korotkov wrote:
> > On Thu, Jul 25, 2024 at 5:04 PM Alena Rybakina
> >   wrote:
> >> To be honest, I have found a big problem in this patch - we try to perform 
> >> the transformation every time we examime a column:
> >>
> >> for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++) { ...
> >>
> >> }
> >>
> >> I have fixed it and moved the transformation before going through the loop.
> > What makes you think there is a problem?
>
> To be honest, I was bothered by the fact that we need to go through
> expressions several times that obviously will not fit under other
> conditions.
> Just yesterday I thought that it would be worthwhile to create a list of
> candidates - expressions that did not fit because the column did not
> match the index (!match_index_to_operand(nconst_expr, indexcol, index)).

I admit that this area probably could use some optimization,
especially for case of many clauses and many indexes.  But in the
scope of this patch, I think this is enough to not make things worse
in this area.

> Another problem that is related to the first one that the boolexpr could
> contain expressions referring to different operands, for example, both x
> and y. that is, we may have the problem that the optimal "ANY"
> expression may not be used, because the expression with x may come
> earlier and the loop may end earlier.
>
> alena@postgres=# create table b (x int, y int);
> CREATE TABLE
> alena@postgres=# insert into b select id, id from
> generate_series(1,1000) as id;
> INSERT 0 1000
> alena@postgres=# create index x_idx on b(x);
> CREATE INDEX
> alena@postgres=# analyze;
> ANALYZE
> alena@postgres=# explain select * from b where y =3 or x =4 or x=5 or
> x=6 or x = 7 or x=8 or x=9;
>QUERY PLAN
> ---
>   Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
> Filter: ((y = 3) OR (x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x =
> 8) OR (x = 9))
> (2 rows)
> alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x =
> 7 or x=8 or x=9 or y=1;
>QUERY PLAN
> ---
>   Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
> Filter: ((x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x =
> 9) OR (y = 1))
> (2 rows)
> alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x =
> 7 or x=8 or x=9;
> QUERY PLAN
> 
>   Index Scan using x_idx on b  (cost=0.28..12.75 rows=6 width=8)
> Index Cond: (x = ANY ('{4,5,6,7,8,9}'::integer[]))
> (2 rows)
>
> Furthermore expressions can be stored in a different order.
> For example, first comes "AND" expr, and then group of "OR" expr, which
> we can convert to "ANY" expr, but we won't do this due to the fact that
> we will exit the loop early, according to this condition:
>
> if (!IsA(sub_rinfo->clause, OpExpr))
> return NULL;
>
> or it may occur due to other conditions.
>
> alena@postgres=# create index x_y_idx on b(x,y);
> CREATE INDEX
> alena@postgres=# analyze;
> ANALYZE
>
> alena@postgres=# explain select * from b where (x = 2 and y =3) or x =4
> or x=5 or x=6 or x = 7 or x=8 or x=9;
>   QUERY PLAN
> -
>   Seq Scan on b  (cost=0.00..35.00 rows=6 width=8)
> Filter: (((x = 2) AND (y = 3)) OR (x = 4) OR (x = 5) OR (x = 6) OR
> (x = 7) OR (x = 8) OR (x = 9))
> (2 rows)
>
> Because of these reasons, I tried to save this and that transformation
> together for each column and try to analyze for each expr separately
> which method would be optimal.

Yes, with v27 of the patch, optimization wouldn't work in these cases.
However, you are using quite small table.  If you will use larger
table or disable sequential scans, there would be bitmap plans to
handle these queries.  So, v27 doesn't make the situation worse. It
just doesn't optimize all that it could potentially optimize and
that's OK.

I've written a separate 0002 patch to address this.  Now, before
generation of paths for bitmap OR, similar OR entries are grouped
together.  When considering a group of similar entries, they are
considered both together and one-by-one.  Ideally we could consider
more sophisticated grouping, but that seems fine for now.  You can
check how this patch handles the cases of above.

Also, 0002 address issue of duplicated bitmap scan conditions in
different forms. During generate_bitmap_or_paths() we need to exclude
considered condition for other clauses.  It couldn't be as normal
filtered out in the latter stage, because could reach the index in
another form.

> > Do you have a test

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-28 Thread Joel Jacobson

On Sun, Jul 28, 2024, at 21:18, Dean Rasheed wrote:
> Attachments:
> * v3-0002-Optimise-numeric-multiplication-using-base-NBASE-.patch
> * v3-0001-Extend-mul_var_short-to-5-and-6-digit-inputs.patch

Very nice.

I've done some initial benchmarks on my Intel Core i9-14900K machine.

To reduce noise, I've isolated a single CPU core, specifically CPU core id 31, 
to not get any work scheduled by the kernel:

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.15.0-116-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv 
ro quiet splash isolcpus=31 intel_pstate=disable vt.handoff=7

Then, I've used sched_setaffinity() from  to ensure the benchmark run 
on CPU core id 31.

I've also fixed the CPU frequency to 3.20 GHz:

$ sudo cpufreq-info -c 31
...
  current CPU frequency is 3.20 GHz (asserted by call to hardware).

I've benchmarked each (var1ndigits, var2ndigits) 10 times per commit, in random 
order.

I've benchmarked all commits after "SQL/JSON: Various improvements to SQL/JSON 
query function docs"
which is the parent commit to "Optimise numeric multiplication for short 
inputs.",
including the two patches.

I've benchmarked each commit affecting numeric.c, and each such commit's parent 
commit, for comparison.

The accum_change column shows the accumulative percentage change since the 
baseline commit (SQL/JSON: Various improvements).

There is at least single digit percentage noise in the measurements,
which is apparent since the rate fluctuates even between commits
for cases we know can't be affected by the change.
Still, even with this noise level, the improvements are very visible and 
consistent.

ndigits|rate| accum_change |summary
---++--+
 (1,1) |  1.702e+07 |  | SQL/JSON: Various improvements
 (1,1) |  2.201e+07 | +29.32 % | Optimise numeric multiplicatio
 (1,1) |  2.268e+07 | +33.30 % | Use diff's --strip-trailing-cr
 (1,1) |  2.228e+07 | +30.92 % | Improve the numeric width_buck
 (1,1) |  2.195e+07 | +29.01 % | Add missing pointer dereferenc
 (1,1) |  2.241e+07 | +31.68 % | Extend mul_var_short() to 5 an
 (1,1) |  2.130e+07 | +25.17 % | Optimise numeric multiplicatio
 (1,2) |  1.585e+07 |  | SQL/JSON: Various improvements
 (1,2) |  2.227e+07 | +40.49 % | Optimise numeric multiplicatio
 (1,2) |  2.140e+07 | +35.00 % | Use diff's --strip-trailing-cr
 (1,2) |  2.227e+07 | +40.51 % | Improve the numeric width_buck
 (1,2) |  2.183e+07 | +37.75 % | Add missing pointer dereferenc
 (1,2) |  2.241e+07 | +41.41 % | Extend mul_var_short() to 5 an
 (1,2) |  2.223e+07 | +40.26 % | Optimise numeric multiplicatio
 (1,3) |  1.554e+07 |  | SQL/JSON: Various improvements
 (1,3) |  2.155e+07 | +38.70 % | Optimise numeric multiplicatio
 (1,3) |  2.140e+07 | +37.74 % | Use diff's --strip-trailing-cr
 (1,3) |  2.139e+07 | +37.66 % | Improve the numeric width_buck
 (1,3) |  2.234e+07 | +43.76 % | Add missing pointer dereferenc
 (1,3) |  2.142e+07 | +37.83 % | Extend mul_var_short() to 5 an
 (1,3) |  2.066e+07 | +32.97 % | Optimise numeric multiplicatio
 (1,4) |  1.450e+07 |  | SQL/JSON: Various improvements
 (1,4) |  2.113e+07 | +45.70 % | Optimise numeric multiplicatio
 (1,4) |  2.121e+07 | +46.30 % | Use diff's --strip-trailing-cr
 (1,4) |  2.115e+07 | +45.85 % | Improve the numeric width_buck
 (1,4) |  2.166e+07 | +49.37 % | Add missing pointer dereferenc
 (1,4) |  2.053e+07 | +41.56 % | Extend mul_var_short() to 5 an
 (1,4) |  2.085e+07 | +43.82 % | Optimise numeric multiplicatio
 (1,8) |  1.440e+07 |  | SQL/JSON: Various improvements
 (1,8) |  1.963e+07 | +36.38 % | Optimise numeric multiplicatio
 (1,8) |  2.018e+07 | +40.19 % | Use diff's --strip-trailing-cr
 (1,8) |  2.045e+07 | +42.05 % | Improve the numeric width_buck
 (1,8) |  1.998e+07 | +38.79 % | Add missing pointer dereferenc
 (1,8) |  1.953e+07 | +35.68 % | Extend mul_var_short() to 5 an
 (1,8) |  1.992e+07 | +38.36 % | Optimise numeric multiplicatio
 (1,16)|  9.444e+06 |  | SQL/JSON: Various improvements
 (1,16)|  1.235e+07 | +30.75 % | Optimise numeric multiplicatio
 (1,16)|  1.232e+07 | +30.47 % | Use diff's --strip-trailing-cr
 (1,16)|  1.239e+07 | +31.18 % | Improve the numeric width_buck
 (1,16)|  1.222e+07 | +29.35 % | Add missing pointer dereferenc
 (1,16)|  1.220e+07 | +29.14 % | Extend mul_var_short() to 5 an
 (1,16)|  1.271e+07 | +34.54 % | Optimise numeric multiplicatio
 (1,32)|  5.790e+06 |  |

Re: pg_upgrade failing for 200+ million Large Objects

I wrote:
> Alexander Korotkov  writes:
>> J4F, I have an idea to count number of ';' sings and use it for
>> transaction size counter, since it is as upper bound estimate of
>> number of SQL commands :-)

> Hmm ... that's not a completely silly idea.  Let's keep it in
> the back pocket in case we can't easily reduce the number of
> SQL commands in some cases.

After poking at this for awhile, we can fix Justin's example
case by avoiding repeated UPDATEs on pg_attribute, so I think
we should do that.  It seems clearly a win, with no downside
other than a small increment of complexity in pg_dump.

However, that's probably not sufficient to mark this issue
as closed.  It seems likely that there are other patterns
that would cause backend memory bloat.  One case that I found
is tables with a lot of inherited constraints (not partitions,
but old-style inheritance).  For example, load the output of
this Perl script into a database:

-
for (my $i = 0; $i < 100; $i++)
{
print "CREATE TABLE test_inh_check$i (\n";
for (my $j = 0; $j < 1000; $j++)
{
print "a$j float check (a$j > 10.2),\n";
}
print "b float);\n";
print "CREATE TABLE test_inh_check_child$i() 
INHERITS(test_inh_check$i);\n";
}
-

pg_dump is horrendously slow on this, thanks to O(N^2) behavior in
ruleutils.c, and pg_upgrade is worse --- and leaks memory too in
HEAD/v17.  The slowness was there before, so I think the lack of
field complaints indicates that this isn't a real-world use case.
Still, it's bad if pg_upgrade fails when it would not have before,
and there may be other similar issues.

So I'm forced to the conclusion that we'd better make the transaction
size adaptive as per Alexander's suggestion.

In addition to the patches attached, I experimented with making
dumpTableSchema fold all the ALTER TABLE commands for a single table
into one command.  That's do-able without too much effort, but I'm now
convinced that we shouldn't.  It would break the semicolon-counting
hack for detecting that tables like these involve extra work.
I'm also not very confident that the backend won't have trouble with
ALTER TABLE commands containing hundreds of subcommands.  That's
something we ought to work on probably, but it's not a project that
I want to condition v17 pg_upgrade's stability on.

Anyway, proposed patches attached.  0001 is some trivial cleanup
that I noticed while working on the failed single-ALTER-TABLE idea.
0002 merges the catalog-UPDATE commands that dumpTableSchema issues,
and 0003 is Alexander's suggestion.

regards, tom lane

From 34ebed72e9029f690e5d3f0cb7464670e83caa5c Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Sun, 28 Jul 2024 13:02:27 -0400
Subject: [PATCH v1 1/3] Fix missing ONLY in one dumpTableSchema command.

The various ALTER [FOREIGN] TABLE commands issued by dumpTableSchema
all use ONLY, except for this one.  I think it's not a live bug
because we don't permit foreign tables to have children, but
it's still inconsistent.  I happened across this while refactoring
dumpTableSchema to merge all its ALTERs into one; while I'm not
certain we should actually do that, this seems like good cleanup.
---
 src/bin/pg_dump/pg_dump.c| 2 +-
 src/bin/pg_dump/t/002_pg_dump.pl | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index b8b1888bd3..7cd6a7fb97 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -16344,7 +16344,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 			if (tbinfo->relkind == RELKIND_FOREIGN_TABLE &&
 tbinfo->attfdwoptions[j][0] != '\0')
 appendPQExpBuffer(q,
-  "ALTER FOREIGN TABLE %s ALTER COLUMN %s OPTIONS (\n"
+  "ALTER FOREIGN TABLE ONLY %s ALTER COLUMN %s OPTIONS (\n"
   "%s\n"
   ");\n",
   qualrelname,
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index d3dd8784d6..5bcc2244d5 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1154,7 +1154,7 @@ my %tests = (

 	'ALTER FOREIGN TABLE foreign_table ALTER COLUMN c1 OPTIONS' => {
 		regexp => qr/^
-			\QALTER FOREIGN TABLE dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n
+			\QALTER FOREIGN TABLE ONLY dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n
 			\s+\Qcolumn_name 'col1'\E\n
 			\Q);\E\n
 			/xm,
-- 
2.43.5

From c8f0d0252e292f276fe9631ae31e6aea11d294d2 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Sun, 28 Jul 2024 16:19:48 -0400
Subject: [PATCH v1 2/3] Reduce number of commands dumpTableSchema emits for
 binary upgrade.

Avoid issuing a separate SQL UPDATE command for each column when
directly manipulating pg_attribute contents in binary upgrade mode.
With the separate updates, we triggered a relcache invalidation with
each update.  For a table with N columns, that causes O(N^2) relcache
bloat in txn_size mode because the

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-07-28 Thread David Rowley

On Mon, 29 Jul 2024 at 04:31, Tom Lane  wrote:
> It is not clear to me exactly which of these pointers should be
> presumed to be possibly-null, but certainly testing ident after
> storing through it is pretty pointless.  Maybe what was intended
> was
>
> -   if (ident && strcmp(*name, "dynahash") == 0)
> +   if (*name && strcmp(*name, "dynahash") == 0)

It should be *ident. I just missed adding the pointer dereference when
moving that code to a function.

Thanks for the report. I'll fix shortly.

David

Re: Pluggable cumulative statistics

2024-07-28 Thread Dmitry Dolgov

> On Sun, Jul 28, 2024 at 10:20:45PM GMT, Michael Paquier wrote:
> I would like to apply this new infrastructure stuff and move on to the
> problems related to the scability of pg_stat_statements.  So, are
> there any objections with all that?

So far I've got nothing against :)

Re: Speed up collation cache

2024-07-28 Thread Jeff Davis

On Sun, 2024-07-28 at 00:14 +0200, Andreas Karlsson wrote:
> But even without that extra optimization I think this patch is worth 
> merging and the patch is small, simple and clean and easy to
> understand 
> and a just a clear speed up. Feels like a no brainer. I think that it
> is 
> ready for committer.

Committed, thank you.

> And then we can discuss after committing if an additional cache of
> the 
> last locale is worth it or not.

Yeah, I'm holding off on that until refactoring in the area settles,
and we'll see if it's still worth it.

Regards,
Jeff Davis

Re: why is pg_upgrade's regression run so slow?

2024-07-28 Thread Andrew Dunstan




On 2024-07-27 Sa 6:48 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 2024-07-27 Sa 10:20 AM, Tom Lane wrote:

Just to add some more fuel to the fire: I do *not* observe such an
effect on my own animals.

The culprit appears to be meson. When I tested running crake with
"using_meson => 0" I got results in line with yours.

Interesting.  Maybe meson is over-aggressively trying to run these
test suites in parallel?




Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on 
HEAD for fairywren and drongo -  fairwren has just gone green, and I 
expect drongo will when it reports in a few hours.


I'm at a loss for an explanation.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-28 Thread Dean Rasheed

On Fri, 12 Jul 2024 at 13:34, Dean Rasheed  wrote:
>
> Then I tried compiling with -m32, and unfortunately this made the
> patch slower than HEAD for small inputs:
>
> -- var1ndigits1=5, var2ndigits2=5 [-m32, SIMD disabled]
> call rate=5.052332e+06  -- HEAD
> call rate=3.883459e+06  -- v2 patch
>
> -- var1ndigits1=6, var2ndigits2=6 [-m32, SIMD disabled]
> call rate=4.7221405e+06  -- HEAD
> call rate=3.7965685e+06  -- v2 patch
>
> -- var1ndigits1=7, var2ndigits2=7 [-m32, SIMD disabled]
> call rate=4.4638375e+06   -- HEAD
> call rate=3.39948e+06 -- v2 patch
>
> and it doesn't reach parity until around ndigits=26, which is
> disappointing. It does still get much faster for large inputs
>

I spent some more time hacking on this, to try to improve the
situation for 32-bit builds. One of the biggest factors was the 64-bit
division that is necessary during carry propagation, which is very
slow -- every compiler/platform I looked at on godbolt.org turns this
into a call to a slow function like __udivdi3(). However, since we are
dividing by a constant (NBASE^2), it can be done using the same
multiply-and-shift trick that compilers use in 64-bit builds, and that
significantly improves performance.

Unfortunately, in 32-bit builds, using 64-bit integers is still slower
for small inputs (below 20-30 NBASE digits), so in the end I figured
that it's best to retain the old 32-bit base-NBASE multiplication code
for small inputs, and only use 64-bit base-NBASE^2 multiplication when
the inputs are above a certain size threshold. Furthermore, since this
threshold is quite low, it's possible to make an additional
simplification: as long as the threshold is less than or equal to 42,
it can be shown that there is no chance of 32-bit integer overflow,
and so the "maxdig" tracking and renormalisation code is not needed.
Getting rid of that makes the outer multiplication loop very simple,
and makes quite a noticeable difference to the performance for inputs
below the threshold.

In 64-bit builds, doing 64-bit base-NBASE^2 multiplication is faster
for all inputs over 4 or 5 NBASE digits, so the threshold can be much
lower. However, for numbers near that threshold, it's a close thing,
so it makes sense to also extend mul_var_small() to cover 1-6 digit
inputs, since that gives a much larger gain for numbers of that size.
That's quite nice because it equates to inputs with up to 21-24
decimal digits, which I suspect are quite commonly used in practice.

One risk in having different thresholds in 32-bit and 64-bit builds is
that it opens up the possibility that the results from the
reduced-rscale computations used by inexact functions like ln() and
exp() might be be different (actually, this may already be a
possibility, due to div_var_fast()'s use of floating point arithmetic,
but that seems considerably less likely). In practice though, this
should be extremely unlikely, due to the fact that any difference
would have to propagate through MUL_GUARD_DIGITS NBASE digits (8
decimal digits), and it doesn't show up in any of the existing tests.
IMO a very small chance of different results on different platforms is
probably acceptable, but it wouldn't be acceptable to make the
threshold a runtime configurable parameter that could lead to
different results on the same platform.

Overall, this has turned out to be more code than I would have liked,
but I think it's worth it, because the performance gains are pretty
substantial across the board.

(Here, I'm comparing with REL_17_STABLE, so that the effect of
mul_var_short() for ndigits <= 6 can be seen.)

32-bit build

Balanced inputs:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
--+--+---+---+--
1 |1 |  5.973068e+06 |  6.873059e+06 | +15.07%
2 |2 |  5.646598e+06 | 6.6456665e+06 | +17.69%
3 |3 | 5.8176995e+06 | 7.0903175e+06 | +21.87%
4 |4 |  5.545298e+06 | 6.9701605e+06 | +25.69%
5 |5 | 5.2109155e+06 | 6.6406185e+06 | +27.44%
6 |6 | 4.9276335e+06 |  6.478698e+06 | +31.48%
7 |7 | 4.6595095e+06 | 4.8514485e+06 | +4.12%
8 |8 |  4.898536e+06 | 5.1599975e+06 | +5.34%
9 |9 |  4.537171e+06 |  4.830987e+06 | +6.48%
   10 |   10 |  4.308139e+06 | 4.6029265e+06 | +6.84%
   11 |   11 |  4.092612e+06 |   4.37966e+06 | +7.01%
   12 |   12 | 3.9345035e+06 |  4.213998e+06 | +7.10%
   13 |   13 | 3.7600162e+06 | 4.0115955e+06 | +6.69%
   14 |   14 | 3.5959855e+06 | 3.8216508e+06 | +6.28%
   15 |   15 | 3.3576732e+06 | 3.6070588e+06 | +7.43%
   16 |   16 |  3.657139e+06 | 3.9067975e+06 | +6.83%
   17 |   17 | 3.3601955e+06 | 3.5651722e+06 | +6.10%
   18 |   18 | 3.1844472e+06 | 3.4542238e+06 | +8.47%
   19 |   19 | 3.0286392e+06 | 3.2380662e+06 | +6.91%
   20 |   20 | 2.9012185e+06 | 3.0496352e+06

Re: UUID v7

2024-07-28 Thread Andrey M. Borodin

> On 24 Jul 2024, at 04:09, Sergey Prokhorenko  
> wrote:
> 
> Implementations MAY alter the actual timestamp.

Hmm… looks like we slightly misinterpreted words about clock source.
Well, that’s great, let’s get offset back.
PFA version accepting offset interval.
It works like this:
postgres=# select uuidv7(interval '-2 months’);
 018fc02f-0996-7136-aeb4-8936b5a516a1

postgres=# select uuid_extract_timestamp(uuidv7(interval '-2 months'));
 2024-05-28 22:11:15.71+05

What do you think?

Best regards, Andrey Borodin.

v26-0001-Implement-UUID-v7.patch
Description: Binary data

Re: race condition in pg_class

Noah Misch  writes:
> On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote:
>> Is it time to worry yet?  If this were HEAD only, I'd not be too
>> concerned; but two of these three are on allegedly-stable branches.
>> And we have releases coming up fast.

> I don't know; neither decision feels terrible to me.

Yeah, same here.  Obviously, it'd be better to spend effort on getting
the bug fix committed than to spend effort on some cosmetic
workaround.

The fact that the failure is in the isolation tests not the core
regression tests reduces my level of concern somewhat about shipping
it this way.  I think that packagers typically run the core tests
not check-world during package verification, so they won't hit this.

regards, tom lane

Re: race condition in pg_class

2024-07-28 Thread Noah Misch

On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
> >> A recent buildfarm test failure [1] showed that the
> >> intra-grant-inplace-db.spec test added with 0844b3968 may fail
> 
> >> But as the test going to be modified by the inplace110-successors-v8.patch
> >> and the modified test (with all three latest patches applied) passes
> >> reliably in the same conditions, maybe this failure doesn't deserve a
> >> deeper exploration.
> 
> > Agreed.  Let's just wait for code review of the actual bug fix, not develop 
> > a
> > separate change to stabilize the test.  One flake in three weeks is low 
> > enough
> > to make that okay.
> 
> It's now up to three similar failures in the past ten days

> Is it time to worry yet?  If this were HEAD only, I'd not be too
> concerned; but two of these three are on allegedly-stable branches.
> And we have releases coming up fast.

I don't know; neither decision feels terrible to me.  A bug fix that would
address both the data corruption causes and those buildfarm failures has been
awaiting review on this thread for 77 days.  The data corruption causes are
more problematic than 0.03% of buildfarm runs getting noise failures.  Two
wrongs don't make a right, but a commit masking that level of buildfarm noise
also feels like sending the wrong message.

Re: Parent/child context relation in pg_get_backend_memory_contexts()

David Rowley  writes:
> I ended up fixing that another way as the above seems to be casting
> away the const for those variables. Instead, I changed the signature
> of the function to:
> static void get_memory_context_name_and_ident(MemoryContext context,
> const char **const name,  const char **const ident);
> which I think takes into account for the call site variables being
> defined as "const char *".

I did not check the history to see quite what happened here,
but Coverity thinks the end result is rather confused,
and I agree:

*** CID 1615190:  Null pointer dereferences  (REVERSE_INULL)
/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/adt/mcxtfuncs.c: 58 in 
get_memory_context_name_and_ident()
52  *ident = context->ident;
53 
54  /*
55   * To be consistent with logging output, we label dynahash contexts with
56   * just the hash table name as with MemoryContextStatsPrint().
57   */
>>> CID 1615190:  Null pointer dereferences  (REVERSE_INULL)
>>> Null-checking "ident" suggests that it may be null, but it has already 
>>> been dereferenced on all paths leading to the check.
58  if (ident && strcmp(*name, "dynahash") == 0)
59  {
60  *name = *ident;
61  *ident = NULL;
62  }
63 }

It is not clear to me exactly which of these pointers should be
presumed to be possibly-null, but certainly testing ident after
storing through it is pretty pointless.  Maybe what was intended
was

-   if (ident && strcmp(*name, "dynahash") == 0)
+   if (*name && strcmp(*name, "dynahash") == 0)

?

regards, tom lane

Re: race condition in pg_class