Re: psql: Add leakproof field to \dAo+ meta-command results

2024-07-29 Thread Erik Wienhold
On 2024-07-01 15:08 +0200, Yugo NAGATA wrote:
> I would like to propose to add a new field to psql's \dAo+ meta-command
> to show whether the underlying function of an operator is leak-proof.

+1 for making that info easily accessible.

> This idea is inspired from [1] that claims some indexes uses non-LEAKPROOF
> functions under the associated operators, as a result, it can not be selected
> for queries with security_barrier views or row-level security policies.
> The original proposal was to add a query over system catalogs for looking up
> non-leakproof operators to the documentation, but I thought it is useful
> to improve \dAo results rather than putting such query to the doc.
> 
> The attached patch adds the field to \dAo+ and also a description that
> explains the relation between indexes and security quals with referencing
> \dAo+ meta-command.
> 
> [1] 
> https://www.postgresql.org/message-id/raw/5af3bf0c-5e0c-4128-81dc-084c5258b1af%40code406.com

\dAo+ output looks good.

But this patch fails regression tests in src/test/regress/sql/psql.sql
(\dAo+ btree float_ops) because of the new leak-proof column.  I think
this could even be changed to "\dAo+ btree array_ops|float_ops" to also
cover operators that are not leak-proof.

+
+For example, an index scan can not be selected for queries with

I check the docs and "cannot" is more commonly used than "can not".

+security_barrier views or row-level security policies 
if an
+operator used in the WHERE clause is associated with the
+operator family of the index, but its underlying function is not marked
+LEAKPROOF. The  program's
+\dAo+ meta-command is useful for listing the operators
+with associated operator families and whether it is leak-proof.
+

I think the last sentence can be improved.  How about: "Use psql's \dAo+
command to list operator families and tell which of their operators are
marked as leak-proof."?  Should something similar be added to [1] which
also talks about leak-proof operators?

The rest is just formatting nitpicks:

+ ", ofs.opfname AS \"%s\"\n,"

The trailing comma should come before the newline.

+ "  CASE\n"
+ "   WHEN p.proleakproof THEN '%s'\n"
+ "   ELSE '%s'\n"
+ " END AS \"%s\"\n",

WHEN/ELSE/END should be intended with one additional space to be
consistent with the other CASE expressions in this query.

[1] https://www.postgresql.org/docs/devel/planner-stats-security.html

-- 
Erik




Re: Remove last traces of HPPA support

2024-07-29 Thread Heikki Linnakangas

On 30/07/2024 00:50, Thomas Munro wrote:

On Wed, Jul 3, 2024 at 8:09 PM Tom Lane  wrote:

Thomas Munro  writes:

Here are some experimental patches to try out some ideas mentioned
upthread, that are approximately unlocked by that cleanup.


FWIW, I'm good with getting rid of --disable-spinlocks and
--disable-atomics.  That's a fair amount of code and needing to
support it causes problems, as you say.  I am very much less
excited about ripping out our spinlock and/or atomics code in favor
of ; I just don't see the gain there, and I do see risk
in ceding control of the semantics and performance of those
primitives.


OK,  part on ice for now.  Here's an update of the rest,
this time also removing the barrier fallbacks as discussed in the LTO
thread[1].


Looks good to me.


I guess we should also consider reimplementing the spinlock on the
atomic API, but I can see that Andres is poking at spinlock code right
now so I'll keep out of his way...

Side issue: I noticed via CI failure when I tried to require
read/write barriers to be provided (a choice I backed out of), that on
MSVC we seem to be using the full memory barrier fallback for those.
Huh?  For x86, I think they should be using pg_compiler_barrier() (no
code gen, just prevent reordering), not pg_pg_memory_barrier(), no?


Agreed, arch-x86.h is quite clear on that.


Perhaps I'm missing something but I suspect we might be failing to
include arch-x86.h on that compiler when we should... maybe it needs
to detect _M_AMD64 too? 


Aha, yes I think that's it. Apparently, __x86_64__ is not defined on 
MSVC. To prove that, I added garbage to the "#ifdef __x86_64__" guarded 
block in atomics.h. The compilation passes on MSVC, but not on other 
platforms: https://cirrus-ci.com/build/6310061188841472.


That means that we're not getting the x86-64 instructions in 
src/port/pg_crc32c_sse42.c on MSVC either.


I think we should do:

#ifdef _M_AMD64
#define __x86_64__
#endif

somewhere, perhaps in src/include/port/win32.h.

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Restart pg_usleep when interrupted

2024-07-29 Thread Sami Imseih



> On Jul 26, 2024, at 3:27 AM, Bertrand Drouvot  
> wrote:
> 
> Hi,
> 
> On Thu, Jul 25, 2024 at 05:27:15PM -0500, Sami Imseih wrote:
>> I am attaching v3 of the patch which addresses the comments made
>> earlier by Bertrand about the comment in the patch [1].
> 
> Thanks!
> 
> Looking at it:
> 
> 1 ===
> 
> +   struct instr_time start_time;
> 
> I think we can get rid of the "struct" keyword here.
> 
> 2 ===
> 
> +   struct instr_time current_time;
> +   struct instr_time elapsed_time;
> 
> Same as above.

Will fix those 2.

> 
> 3 ===
> 
> I gave more thoughts and I think it can be simplified a bit to reduce the
> number of operations in the while loop.
> 
> What about relying on a "absolute" time that way:
> 
>   instr_time absolute;
>absolute.ticks = start_time.ticks + msec * 100;
> 
> and then in the while loop:
> 
>while (nanosleep(, ) == -1 && errno == EINTR)
>{
>instr_time current_time;
>INSTR_TIME_SET_CURRENT(current_time);
> 
>if (current_time.ticks > absolute.ticks)
>{
>break;

While I agree this code is cleaner, myy hesitation there is we don’t 
have any other place in which we access .ticks directly and the 
common practice is to use the intsr_time.h APIs.


What do you think?


Regards,

Sami 






Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Jacob Champion
On Mon, Jul 29, 2024 at 1:51 PM Daniel Gustafsson  wrote:
> Together with a colleage we found the Azure provider use "verification_url"
> rather than xxx_uri.

Yeah, I think that's originally a Google-ism. (As far as I can tell
they helped author the spec for this and then didn't follow it. :/ ) I
didn't recall Azure having used it back when I was testing against it,
though, so that's good to know.

> Another discrepancy is that it uses a string for the
> interval (ie: "interval":"5").

Oh, that's a new one. I don't remember needing to hack around that
either; maybe iddawc handled it silently?

> One can of course argue that Azure is wrong and
> should feel bad, but I fear that virtually all (major) providers will have
> differences like this, so we will have to deal with it in an extensible 
> fashion
> (compile time, not runtime configurable).

Such is life... verification_url we will just have to deal with by
default, I think, since Google does/did it too. Not sure about
interval -- but do we want to make our distribution maintainers deal
with a compile-time setting for libpq, just to support various OAuth
flavors? To me it seems like we should just hold our noses and support
known (large) departures in the core.

> I was toying with making the name json_field name member an array, to allow
> variations.  That won't help with the fieldtype differences though, so another
> train of thought was to have some form of REQUIRED_XOR where fields can tied
> together.  What do you think about something along these lines?

If I designed it right, just adding alternative spellings directly to
the fields list should work. (The "required" check is by struct
member, not name, so both spellings can point to the same
destination.) The alternative typing on the other hand might require
something like a new sentinel "type" that will accept both... I hadn't
expected that.

> Another thing, shouldn't we really parse and interpret *all* REQUIRED fields
> even if we don't use them to ensure that the JSON is wellformed?  If the JSON
> we get is malformed in any way it seems like the safe/conservative option to
> error out.

Good, I was hoping to have a conversation about that. I am fine with
either option in principle. In practice I expect to add code to use
`expires_in` (so that we can pass it to custom OAuth hook
implementations) and `scope` (to check if the server has changed it on
us).

That leaves the provider... Forcing the provider itself to implement
unused stuff in order to interoperate seems like it could backfire on
us, especially since IETF standardized an alternate .well-known URI
[1] that changes some of these REQUIRED things into OPTIONAL. (One way
for us to interpret this: those fields may be required for OpenID, but
your OAuth provider might not be an OpenID provider, and our code
doesn't require OpenID.) I think we should probably tread lightly in
that particular case. Thoughts on that?

Thanks!
--Jacob

[1] https://www.rfc-editor.org/rfc/rfc8414.html




Re: Use pgBufferUsage for block reporting in analyze

2024-07-29 Thread Masahiko Sawada
Hi,

On Mon, Jul 29, 2024 at 12:12 AM Anthonin Bonnefoy
 wrote:
>
> On Sat, Jul 27, 2024 at 12:35 AM Masahiko Sawada  
> wrote:
> > An alternative idea would
> > be to pass StringInfo to AcquireSampleRowsFunc() so that callback can
> > write its messages there. This is somewhat similar to what we do in
> > the EXPLAIN command (cf, ExplainPropertyText() etc). It could be too
> > much but I think it could be better than writing logs in the single
> > format.
>

I have one comment on 0001 patch:

 /*
  * Calculate the difference in the Page
Hit/Miss/Dirty that
  * happened as part of the analyze by
subtracting out the
  * pre-analyze values which we saved above.
  */
-AnalyzePageHit = VacuumPageHit - AnalyzePageHit;
-AnalyzePageMiss = VacuumPageMiss - AnalyzePageMiss;
-AnalyzePageDirty = VacuumPageDirty - AnalyzePageDirty;
+memset(, 0, sizeof(BufferUsage));
+BufferUsageAccumDiff(,
, );
+
+total_blks_hit = bufferusage.shared_blks_hit +
+bufferusage.local_blks_hit;
+total_blks_read = bufferusage.shared_blks_read +
+bufferusage.local_blks_read;
+total_blks_dirtied = bufferusage.shared_blks_dirtied +
+bufferusage.local_blks_dirtied;

The comment should also be updated or removed.

And here are some comments on 0002 patch:

-   TimestampDifference(starttime, endtime, _dur, _dur);
+   delay_in_ms = TimestampDifferenceMilliseconds(starttime, endtime);

I think that this change is to make vacuum code consistent with
analyze code, particularly the following part:

/*
 * We do not expect an analyze to take > 25 days and it simplifies
 * things a bit to use TimestampDifferenceMilliseconds.
 */
delay_in_ms = TimestampDifferenceMilliseconds(starttime, endtime);

However, as the above comment says, delay_in_ms can have a duration up
to 25 days. I guess it would not be a problem for analyze cases but
could be in vacuum cases as vacuum could sometimes be running for a
very long time. I've seen vacuums running even for almost 1 month. So
I think we should keep this part.

---
 /* measure elapsed time iff autovacuum logging requires it */
-if (AmAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
+if (instrument)

The comment should also be updated.

---
Could you split the 0002 patch into two patches? One is to have
ANALYZE command (with VERBOSE option) write the buffer usage, and
second one is to add WAL usage to both ANALYZE command and
autoanalyze. I think adding WAL usage to ANALYZE could be
controversial as it should not be WAL-intensive operation, so I'd like
to keep them separate.


> I've tested this approach, it definitely looks better. I've added a
> logbuf StringInfo to AcquireSampleRowsFunc which will receive the
> logs. elevel was removed as it is not used anymore. Since everything
> is in the same log line, I've removed the relation name in the acquire
> sample functions.
>
> For partitioned tables, I've also added the processed partition table
> being sampled. The output will look like:
>
> INFO:  analyze of table "postgres.public.test_partition"
> Sampling rows from child "public.test_partition_1"
> pages: 5 of 5 scanned
> tuples: 999 live tuples, 0 are dead; 999 tuples in sample, 999
> estimated total tuples
> Sampling rows from child "public.test_partition_2"
> pages: 5 of 5 scanned
> tuples: 1000 live tuples, 0 are dead; 1000 tuples in sample, 1000
> estimated total tuples
> avg read rate: 2.604 MB/s, avg write rate: 0.000 MB/s
> ...
>
> For a file_fdw, the output will be:
>
> INFO:  analyze of table "postgres.public.pglog"
> tuples: 60043 tuples; 3 tuples in sample
> avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
> ...

Thank you for updating the patch. With your patch, I got the following
logs for when executing ANALYZE VERBOSE on a partitioned table:

postgres(1:3971560)=# analyze (verbose) p;
INFO:  analyzing "public.p" inheritance tree
INFO:  analyze of table "postgres.public.p"
Sampling rows from child "public.c1"
pages: 1 of 14750 scanned
tuples: 2259833 live tuples, 0 are dead; 1 tuples in sample,
254 estimated total tuples
Sampling rows from child "public.c2"
pages: 1 of 14750 scanned
tuples: 226 live tuples, 0 are dead; 1 tuples in sample,
500 estimated total tuples
Sampling rows from child "public.c3"
pages: 1 of 14750 scanned
tuples: 2259833 live tuples, 0 are dead; 1 tuples in sample,
254 estimated total tuples
avg read rate: 335.184 MB/s, avg write rate: 0.031 MB/s
buffer usage: 8249 hits, 21795 reads, 2 dirtied
WAL usage: 6 records, 1 

Re: Incremental View Maintenance, take 2

2024-07-29 Thread Kirill Reshke
On Sat, 27 Jul 2024 at 13:26, Kirill Reshke  wrote:
>
> Hi!
> Cloudberry DB (Greenplum fork) uses IMMV feature for AQUMV (auto query
> use matview) feature, so i got interested in how it is implemented.
>
> On Thu, 11 Jul 2024 at 09:24, Yugo NAGATA  wrote:
> >
> > I updated the patch to bump up the version numbers in psql and pg_dump codes
> > from 17 to 18.
>
> Few suggestions:
>
> 1) `Add-relisivm-column-to-pg_class-system-catalog` commit message
> should be fixed, there is "isimmv" in the last line.
> 2) I dont get why `Add-Incremental-View-Maintenance-support.patch`
> goes after 0005 & 0004. Shoulndt we first implement feature server
> side, only when client (psql & pg_dump) side?
> 3) Can we provide regression tests for each function separately? Test
> for main feature in main patch, test for DISTINCT support in
> v34-0007-Add-DISTINCT-support-for-IVM.patch etc? This way the patchset
> will be easier to review, and can be committed separelety.
> 4) v34-0006-Add-Incremental-View-Maintenance-support.patch no longer
> applies due to 4b74ebf726d444ba820830cad986a1f92f724649. After
> resolving issues manually, it does not compile, because
> 4b74ebf726d444ba820830cad986a1f92f724649 also removes
> save_userid/save_sec_context fields from ExecCreateTableAs.
>
> >   if (RelationIsIVM(matviewRel) && stmt->skipData)
> Now this function accepts skipData param.
>
> 5) For DISTINCT support patch uses hidden __ivm* columns. Is this
> design discussed anywhere? I wonder if this is a necessity (only
> solution) or if there are alternatives.
> 6)
> What are the caveats of supporting some simple cases for aggregation
> funcs like in example?
> ```
> regress=# CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm_2 AS SELECT
> sum(j) + sum(i) from mv_base_a;
> ERROR:  expression containing an aggregate in it is not supported on
> incrementally maintainable materialized view
> ```
> I can see some difficulties with division CREATE IMMV  AS SELECT
> 1/sum(i) from mv_base_a;  (sum(i) == 0 case), but adding &
> multiplication should be ok, aren't they?
>
>
> Overall, patchset looks mature, however it is far from being
> committable due to lack of testing/feedback/discussion. There is only
> one way to fix this... Test and discuss it!
>
>
> [1] https://github.com/cloudberrydb/cloudberrydb

Hi! Small update: I tried to run a regression test and all
IMMV-related tests failed on my vm. Maybe I'm doing something wrong, I
will try to investigate.

Another suggestion: support for \d and \d+ commands in psql. With v34
patchset applied, psql does not show anything IMMV-related in \d mode.

```
reshke=# \d m1
   Materialized view "public.m1"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 i  | integer |   |  |
Distributed by: (i)


reshke=# \d+ m1
 Materialized view "public.m1"
 Column |  Type   | Collation | Nullable | Default | Storage |
Compression | Stats target | Description
+-+---+--+-+-+-+--+-
 i  | integer |   |  | | plain   |
|  |
View definition:
 SELECT t1.i
   FROM t1;
Distributed by: (i)
Access method: heap

```

Output should be 'Incrementally materialized view "public.m1"' IMO.




Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Tom Lane
Dean Rasheed  writes:
> On Mon, 29 Jul 2024 at 21:39, Joel Jacobson  wrote:
>> I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
>> using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
>> the benefits of simpler code.

> Looking at that other thread that you found [1], I think it's entirely
> possible that there are people who care about 32-bit systems, which
> means that we might well get complaints, if we make it slower for
> them. Unfortunately, I don't have any way to test that (I doubt that
> running a 32-bit executable on my x86-64 system is a realistic test).

I think we've already done things that might impact 32-bit systems
negatively (5e1f3b9eb for instance), and not heard a lot of pushback.
I would argue that anyone still running PG on 32-bit must have pretty
minimal performance requirements, so that they're unlikely to care if
numeric_mul gets slightly faster or slower.  Obviously a *big*
performance drop might get pushback.

regards, tom lane




Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Jacob Champion
On Mon, Jul 29, 2024 at 5:02 AM Peter Eisentraut  wrote:
> We should take the check for exit() calls from libpq and expand it to
> cover the other libraries as well.  Maybe there are other problems like
> this?

Seems reasonable, yeah.

> But under what circumstances does "the linker doesn't strip out" happen?
>   If this happens accidentally, then we should have seen some buildfarm
> failures or something?

On my machine, for example, I see differences with optimization
levels. Say you inadvertently call pfree() in a _shlib build, as I did
multiple times upthread. By itself, that shouldn't actually be a
problem (it eventually redirects to free()), so it should be legal to
call pfree(), and with -O2 the build succeeds. But with -Og, the
exit() check trips, and when I disassemble I see that pg_malloc() et
all have infected the shared object. After all, we did tell the linker
to put that object file in, and we don't ask it to garbage-collect
sections.

> Also, one could look further and notice that restricted_token.c and
> sprompt.c both a) are not needed by libpq and b) can trigger exit()
> calls.  Then it's not clear why those are not affected.

I think it's easier for the linker to omit whole object files rather
than partial ones. If libpq doesn't use any of those APIs there's not
really a reason to trip over it.

(Maybe the _shlib variants should just contain the minimum objects
required to compile.)

> I'm reminded of thread [0].  I think there is quite a bit of confusion
> about the pqexpbuffer vs. stringinfo APIs, and they are probably used
> incorrectly quite a bit.  There are now also programs that use both of
> them!  This patch now introduces another layer on top of them.  I fear,
> at the end, nobody is going to understand any of this anymore.

"anymore"? :)

In all seriousness -- I agree that this isn't sustainable. At the
moment the worst pain (the new layer) is isolated to jsonapi.c, which
seems like an okay place to try something new, since there aren't that
many clients. But to be honest I'm not excited about deciding the Best
Way Forward based on a sample size of JSON.

> Also,
> changing all the programs to link in libpq for pqexpbuffer seems like
> the opposite direction from what was suggested in [0].

(I don't really want to keep that new libpq dependency. We'd just have
to decide where PQExpBuffer is going to go if we're not okay with it.)

> I think we need to do some deeper thinking here about how we want the
> memory management on the client side to work.  Maybe we could just use
> one API but have some flags or callbacks to control the out-of-memory
> behavior.

Any src/common code that needs to handle both in-band and out-of-band
failure modes will still have to decide whether it's going to 1)
duplicate code paths or 2) just act as if in-band failures can always
happen. I think that's probably essential complexity; an ideal API
might make it nicer to deal with but it can't abstract it away.

Thanks!
--Jacob




Re: ecdh support causes unnecessary roundtrips

2024-07-29 Thread Daniel Gustafsson
> On 17 Jun 2024, at 19:56, Andres Freund  wrote:
> On 2024-06-17 19:51:45 +0200, Daniel Gustafsson wrote:

>> Changing the default of the ecdh GUC would perhaps be doable?
> 
> I was wondering whether we could change the default so that it accepts both
> x25519 and secp256r1. Unfortunately that seems to requires changing what we
> use to set the parameter...

Right.  The patch in https://commitfest.postgresql.org/48/5025/ does allow for
accepting both but that's a different discussion.

Changing, and backpatching, the default to at least keep new installations from
extra roundtrips doesn't seem that far off in terms of scope from what
860fe27ee1e2 backpatched.  Maybe it can be an option.

>> Amending the documentation is the one thing we certainly can do but 99.9% of
>> affected users won't know they are affected so won't look for that section.
> 
> Yea. It's also possible that some other bindings changed their default to
> match ours...

There is that possibility, though I think we would've heard something about
that by now if that had happened.

--
Daniel Gustafsson





Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Dean Rasheed
On Mon, 29 Jul 2024 at 21:39, Joel Jacobson  wrote:
>
> I like these simplifications, how `var2ndigits` is used instead of 
> `res_ndigits`:
> -   for (int i = res_ndigits - 3; i >= 1; i--)
> +   for (int i = var2ndigits - 1; i >= 1; i--)
>
> But I wonder why does `case 1:`  not follow the same pattern?
> for (int i = res_ndigits - 2; i >= 0; i--)
>

Ah yes, that should be made the same. (I think I did do that at one
point, but then accidentally reverted it during a code refactoring.)

> * v3-0002
>
> I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
> using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
> the benefits of simpler code.
>
> You brought up the question if 32-bit systems should be regarded
> as legacy previously in this thread.
>
> Unfortunately, we didn't get any feedback, so I'm starting a separate
> thread, with subject "Is fast 32-bit code still important?", hoping to get
> more input to help us make judgement calls.
>

Looking at that other thread that you found [1], I think it's entirely
possible that there are people who care about 32-bit systems, which
means that we might well get complaints, if we make it slower for
them. Unfortunately, I don't have any way to test that (I doubt that
running a 32-bit executable on my x86-64 system is a realistic test).

Regards,
Dean

[1] https://postgr.es/m/0a71b43129fb447988f152941e1db...@nidsa.net




Re: Remove last traces of HPPA support

2024-07-29 Thread Thomas Munro
On Wed, Jul 3, 2024 at 8:09 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Here are some experimental patches to try out some ideas mentioned
> > upthread, that are approximately unlocked by that cleanup.
>
> FWIW, I'm good with getting rid of --disable-spinlocks and
> --disable-atomics.  That's a fair amount of code and needing to
> support it causes problems, as you say.  I am very much less
> excited about ripping out our spinlock and/or atomics code in favor
> of ; I just don't see the gain there, and I do see risk
> in ceding control of the semantics and performance of those
> primitives.

OK,  part on ice for now.  Here's an update of the rest,
this time also removing the barrier fallbacks as discussed in the LTO
thread[1].

I guess we should also consider reimplementing the spinlock on the
atomic API, but I can see that Andres is poking at spinlock code right
now so I'll keep out of his way...

Side issue: I noticed via CI failure when I tried to require
read/write barriers to be provided (a choice I backed out of), that on
MSVC we seem to be using the full memory barrier fallback for those.
Huh?  For x86, I think they should be using pg_compiler_barrier() (no
code gen, just prevent reordering), not pg_pg_memory_barrier(), no?
Perhaps I'm missing something but I suspect we might be failing to
include arch-x86.h on that compiler when we should... maybe it needs
to detect _M_AMD64 too?  For ARM, from a quick look, the only way to
reach real acquire/release barriers seems to be to use the C11
interface (which would also be fine on x86 where it should degrade to
a no-op compiler barrier or signal fence as the standard calls it),
but IIRC the Windows/ARM basics haven't gone in yet anyway.

[1] 
https://www.postgresql.org/message-id/flat/721bf39a-ed8a-44b0-8b8e-be3bd81db748%40technowledgy.de#66ba381b05e8ee08b11503b846acc4a1
From 33533817949052f7af423aaee0ef6e737031effb Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 2 Jul 2024 14:47:59 +1200
Subject: [PATCH v2 1/4] Remove --disable-spinlocks.

A later change will require atomic support, so it wouldn't make sense
for a new system not to be able to implement true spinlocks.

Discussion: https://postgr.es/m/3351991.1697728588%40sss.pgh.pa.us
---
 configure   |  40 --
 configure.ac|  13 --
 doc/src/sgml/installation.sgml  |  37 +
 meson.build |   6 -
 src/backend/port/atomics.c  |  26 
 src/backend/port/posix_sema.c   |   3 +-
 src/backend/port/sysv_sema.c|   3 +-
 src/backend/postmaster/launch_backend.c |   8 --
 src/backend/storage/ipc/ipci.c  |  10 --
 src/backend/storage/lmgr/Makefile   |   1 -
 src/backend/storage/lmgr/meson.build|   1 -
 src/backend/storage/lmgr/s_lock.c   |   2 +-
 src/backend/storage/lmgr/spin.c | 180 
 src/include/pg_config.h.in  |   3 -
 src/include/pg_config_manual.h  |  15 --
 src/include/port/atomics.h  |   4 +-
 src/include/port/atomics/fallback.h |   4 +-
 src/include/storage/s_lock.h|  39 +
 src/include/storage/spin.h  |  18 +--
 src/test/regress/regress.c  |  86 ---
 20 files changed, 13 insertions(+), 486 deletions(-)
 delete mode 100644 src/backend/storage/lmgr/spin.c

diff --git a/configure b/configure
index ea5514fab1..f8deaa8d78 100755
--- a/configure
+++ b/configure
@@ -836,7 +836,6 @@ enable_integer_datetimes
 enable_nls
 with_pgport
 enable_rpath
-enable_spinlocks
 enable_atomics
 enable_debug
 enable_profiling
@@ -1529,7 +1528,6 @@ Optional Features:
   enable Native Language Support
   --disable-rpath do not embed shared library search path in
   executables
-  --disable-spinlocks do not use spinlocks
   --disable-atomics   do not use atomic operations
   --enable-debug  build with debugging symbols (-g)
   --enable-profiling  build with profiling enabled
@@ -3266,33 +3264,6 @@ fi
 
 
 
-#
-# Spinlocks
-#
-
-
-# Check whether --enable-spinlocks was given.
-if test "${enable_spinlocks+set}" = set; then :
-  enableval=$enable_spinlocks;
-  case $enableval in
-yes)
-  :
-  ;;
-no)
-  :
-  ;;
-*)
-  as_fn_error $? "no argument expected for --enable-spinlocks option" "$LINENO" 5
-  ;;
-  esac
-
-else
-  enable_spinlocks=yes
-
-fi
-
-
-
 #
 # Atomic operations
 #
@@ -12185,17 +12156,6 @@ fi
 
 fi
 
-if test "$enable_spinlocks" = yes; then
-
-$as_echo "#define HAVE_SPINLOCKS 1" >>confdefs.h
-
-else
-  { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING:
-*** Not using spinlocks will cause poor performance." >&5
-$as_echo "$as_me: WARNING:
-*** Not using spinlocks will cause poor performance." >&2;}
-fi
-
 if test "$enable_atomics" = yes; then
 
 $as_echo "#define HAVE_ATOMICS 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index 

Re: xid_wraparound tests intermittent failure.

2024-07-29 Thread Masahiko Sawada
On Sat, Jul 27, 2024 at 1:06 PM Andrew Dunstan  wrote:
>
>
> On 2024-07-26 Fr 1:46 PM, Masahiko Sawada wrote:
> > On Thu, Jul 25, 2024 at 6:52 PM Andrew Dunstan  wrote:
> >>
> >> On 2024-07-25 Th 3:40 PM, Masahiko Sawada wrote:
> >>
> >> On Thu, Jul 25, 2024 at 11:06 AM Masahiko Sawada  
> >> wrote:
> >>
> >> On Thu, Jul 25, 2024 at 10:56 AM Andrew Dunstan  
> >> wrote:
> >>
> >> On 2024-07-23 Tu 6:59 PM, Masahiko Sawada wrote:
> >>
> >> See 
> >> 
> >>
> >>
> >> The failure logs are from a run where both tests 1 and 2 failed.
> >>
> >> Thank you for sharing the logs.
> >>
> >> I think that the problem seems to match what Alexander Lakhin
> >> mentioned[1]. Probably we can fix such a race condition somehow but
> >> I'm not sure it's worth it as setting autovacuum = off and
> >> autovacuum_max_workers = 1 (or a low number) is an extremely rare
> >> case. I think it would be better to stabilize these tests. One idea is
> >> to turn the autovacuum GUC parameter on while setting
> >> autovacuum_enabled = off for each table. That way, we can ensure that
> >> autovacuum workers are launched. And I think it seems to align real
> >> use cases.
> >>
> >> Regards,
> >>
> >> [1] 
> >> https://www.postgresql.org/message-id/02373ec3-50c6-df5a-0d65-5b9b1c0c86d6%40gmail.com
> >>
> >>
> >> OK, do you want to propose a patch?
> >>
> >> Yes, I'll prepare and share it soon.
> >>
> >> I've attached the patch. Could you please test if the patch fixes the
> >> instability you observed?
> >>
> >> Since we turn off autovacuum on all three tests and we wait for
> >> autovacuum to complete processing databases, these tests potentially
> >> have a similar (but lower) risk. So I modified these tests to turn it
> >> on so we can ensure the autovacuum runs periodically.
> >>
> >>
> >> I assume you actually meant to remove the "autovacuum = off" in 
> >> 003_wraparound.pl. With that change in your patch I retried my test, but 
> >> on iteration 100 out of 100 it failed on test 002_limits.pl.
> >>
> > I think we need to remove the "autovacuum = off' also in 002_limits.pl
> > as it waits for autovacuum to process both template0 and template1
> > databases. Just to be clear, the failure happened even without
> > "autovacuum = off"?
> >
>
> The attached patch, a slight modification of yours, removes "autovacuum
> = off" for all three tests, and given that a set of 200 runs was clean
> for me.

Oh I missed that I left "autovacuum = off' for some reason in 002
test. Thank you for attaching the patch, it looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: speed up a logical replica setup

2024-07-29 Thread Euler Taveira
On Wed, Jul 17, 2024, at 11:37 PM, Amit Kapila wrote:
> I am thinking of transactions between restart_lsn and "consistent
> point lsn" (aka the point after which all commits will be replicated).
> You conclusion seems correct to me that such transactions won't be
> replicated by streaming replication and would be skipped by logical
> replication. Now, if we can avoid that anyway, we can consider that.

Under reflection what I proposed [1] seems more complex and possibly
error prone than other available solutions. The recovery step was slow
if the server is idle (that's the case for the test). An idle server
usually doesn't have another WAL record after creating the replication
slots. Since pg_createsubscriber is using the LSN returned by the last
replication slot as recovery_target_lsn, this LSN is ahead of the
current WAL position and the recovery waits until something writes a
WAL record to reach the target and ends the recovery.

Hayato already mentioned one of the solution in a previous email [2].
AFAICS we can use any solution that creates a WAL record. I tested the
following options: 

\timing
select * from pg_create_logical_replication_slot('pg_createsubscriber', 
'pgoutput', true);
select pg_logical_emit_message(false, 'pg_createsubscriber', 'dummy');
select pg_log_standby_snapshot();
select pg_create_restore_point('pg_createsubscriber');

that results in the following output:

  slot_name  |lsn
-+---
pg_createsubscriber | 0/942DD28
(1 row)

Time: 200.571 ms
pg_logical_emit_message 
-
0/942DD78
(1 row)

Time: 0.938 ms
pg_log_standby_snapshot 
-
0/942DDB0
(1 row)

Time: 0.741 ms
pg_create_restore_point 
-
0/942DE18
(1 row)

Time: 0.870 ms

and generates the following WAL records:

rmgr: Standby len (rec/tot): 50/50, tx:  0, lsn: 
0/0942DCF0, prev 0/0942DCB8, desc: RUNNING_XACTS nextXid 3939 
latestCompletedXid 3938 oldestRunningXid 3939
rmgr: LogicalMessage len (rec/tot): 75/75, tx:  0, lsn: 
0/0942DD28, prev 0/0942DCF0, desc: MESSAGE non-transactional, prefix 
"pg_createsubscriber"; payload (5 bytes): 64 75 6D 6D 79
rmgr: Standby len (rec/tot): 50/50, tx:  0, lsn: 
0/0942DD78, prev 0/0942DD28, desc: RUNNING_XACTS nextXid 3939 
latestCompletedXid 3938 oldestRunningXid 3939
rmgr: XLOGlen (rec/tot): 98/98, tx:  0, lsn: 
0/0942DDB0, prev 0/0942DD78, desc: RESTORE_POINT pg_createsubscriber

The options are:

(a) temporary replication slot: requires an additional replication slot.
small payload. it is extremely slow in comparison with the other
options.
(b) logical message: can be consumed by logical replication when/if it
is supported some day. big payload. fast.
(c) snapshot of running txn:  small payload. fast.
(d) named restore point: biggest payload. fast.

I don't have a strong preference but if I need to pick one I would
choose option (c) or option (d). The option (a) is out of the question.

Opinions?

[1] 
https://www.postgresql.org/message-id/b1f0f8c7-8f01-4950-af77-339df3dc4684%40app.fastmail.com
[2] 
https://www.postgresql.org/message-id/OSBPR01MB25521B15BF950D2523BBE143F5D32%40OSBPR01MB2552.jpnprd01.prod.outlook.com


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Daniel Gustafsson
Thanks for working on this patchset, I'm looking over 0004 and 0005 but came
across a thing I wanted to bring up one thing sooner than waiting for the
review. In parse_device_authz we have this:

  {"user_code", JSON_TOKEN_STRING, {>user_code}, REQUIRED},
  {"verification_uri", JSON_TOKEN_STRING, {>verification_uri}, REQUIRED},

  /*
   * The following fields are technically REQUIRED, but we don't use
   * them anywhere yet:
   *
   * - expires_in
   */

  {"interval", JSON_TOKEN_NUMBER, {>interval_str}, OPTIONAL},

Together with a colleage we found the Azure provider use "verification_url"
rather than xxx_uri.  Another discrepancy is that it uses a string for the
interval (ie: "interval":"5").  One can of course argue that Azure is wrong and
should feel bad, but I fear that virtually all (major) providers will have
differences like this, so we will have to deal with it in an extensible fashion
(compile time, not runtime configurable).

I was toying with making the name json_field name member an array, to allow
variations.  That won't help with the fieldtype differences though, so another
train of thought was to have some form of REQUIRED_XOR where fields can tied
together.  What do you think about something along these lines?

Another thing, shouldn't we really parse and interpret *all* REQUIRED fields
even if we don't use them to ensure that the JSON is wellformed?  If the JSON
we get is malformed in any way it seems like the safe/conservative option to
error out.

--
Daniel Gustafsson





Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Joel Jacobson
On Mon, Jul 29, 2024, at 22:01, Dean Rasheed wrote:
> On Mon, 29 Jul 2024 at 18:57, Joel Jacobson  wrote:
>>
>> Thanks to v3-0002, they are all still significantly faster when both patches 
>> have been applied,
>> but I wonder if it is expected or not, that v3-0001 temporarily made them a 
>> bit slower?
>>
>
> There's no obvious reason why 0001 would make those cases slower, but
> the fact that, together with 0002, it's a significant net win, and the
> gains for 5 and 6-digit inputs make it worthwhile, in my opinion.

Yes, I agree, I just thought it was noteworthy, but not a problem per se.

> Something I did notice in my tests was that if ndigits was a small
> multiple of 8, the old code was disproportionately faster, which can
> be explained by the fact that the computation fits exactly into a
> whole number of XMM register operations, with no remaining digits to
> process. For example, these results from above:
>
>  ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
> --+--+---+---+--
>15 |   15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
>16 |   16 | 4.3353435e+06 |  4.970363e+06 | +14.65%
>17 |   17 | 3.9258755e+06 |  4.935394e+06 | +25.71%
>
>23 |   23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
>24 |   24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
>25 |   25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%
>
>31 |   31 |  2.169437e+06 | 3.7209152e+06 | +71.52%
>32 |   32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
>33 |   33 |   2.27133e+06 |  3.435459e+06 | +51.25%
>
> (Note how 16x16 was much faster than 15x15, for example.)
>
> The patched code seems to do a better job at levelling out and coping
> with arbitrary-sized inputs, not just those that fit exactly into a
> whole number of loops using SSE2 operations.

That's nice.

> Something else I noticed was that the relative gains for large numbers
> of digits were much higher with clang than with gcc:
>
> gcc 13.3.0:
>
> 16383 |16383 | 21.629467 |  73.58552 | +240.21%
>
> clang 15.0.7:
>
> 16383 |16383 | 11.562384 |  73.00517 | +531.40%
>
> That seems to be because clang doesn't do a good job of generating
> efficient SSE2 code in the old case of 16-bit x 16-bit
> multiplications. Looking on godbolt.org, it generates
> overly-complicated code using PMULUDQ, which actually does 32-bit x
> 32-bit multiplications. Gcc, on the other hand, generates a much more
> compact loop, using PMULHW and PMULLW, which is much faster. With the
> patch, they both generate the same SSE2 code, so the results are
> pretty consistent.

Very nice.

I've now also had an initial look at the actual code of the patches:

* v3-0001

Looks pretty straight forward, nice with the PRODSUM macros,
that really improved readability a lot.

I like these simplifications, how `var2ndigits` is used instead of 
`res_ndigits`:
-   for (int i = res_ndigits - 3; i >= 1; i--)
+   for (int i = var2ndigits - 1; i >= 1; i--)

But I wonder why does `case 1:`  not follow the same pattern?
for (int i = res_ndigits - 2; i >= 0; i--)

* v3-0002

I think it's non-obvious if the separate code paths for 32-bit and 64-bit,
using `#if SIZEOF_DATUM < 8`, to get *fast* 32-bit support, outweighs
the benefits of simpler code.

You brought up the question if 32-bit systems should be regarded
as legacy previously in this thread.

Unfortunately, we didn't get any feedback, so I'm starting a separate
thread, with subject "Is fast 32-bit code still important?", hoping to get
more input to help us make judgement calls.

/Joel




Re: 040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

2024-07-29 Thread Tom Lane
Robert Haas  writes:
> On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
>> ... However, I added a new open item about how the
>> 040_pg_createsubscriber.pl test is slow and still unstable.

> But that said, I see no commits in the commit history which purport to
> improve performance, so I guess the performance is probably still not
> what you want, though I am not clear on the details.

My concern is described at [1]:

>> I have a different but possibly-related complaint: why is
>> 040_pg_createsubscriber.pl so miserably slow?  On my machine it
>> runs for a bit over 19 seconds, which seems completely out of line
>> (for comparison, 010_pg_basebackup.pl takes 6 seconds, and the
>> other test scripts in this directory take much less).  It looks
>> like most of the blame falls on this step:
>> 
>> [12:47:22.292](14.534s) ok 28 - run pg_createsubscriber on node S
>> 
>> AFAICS the amount of data being replicated is completely trivial,
>> so that it doesn't make any sense for this to take so long --- and
>> if it does, that suggests that this tool will be impossibly slow
>> for production use.  But I suspect there is a logic flaw causing
>> this.  Speculating wildly, perhaps that is related to the failure
>> Alexander spotted?

The followup discussion in that thread made it sound like there's
some fairly fundamental deficiency in how wait_for_end_recovery()
detects end-of-recovery.  I'm not too conversant with the details
though, and it's possible that pg_createsubscriber is just falling
foul of a pre-existing infelicity.

If the problem can be correctly described as "pg_createsubscriber
takes 10 seconds or so to detect end-of-stream", then it's probably
only an annoyance for testing and not something that would be fatal
in the real world.  I'm not quite sure if that's accurate, though.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/2377319.1719766794%40sss.pgh.pa.us#bba9f5ee0efc73151cc521a6bd5182ed




Re: Refactoring postmaster's code to cleanup after child exit

2024-07-29 Thread Heikki Linnakangas

On 06/07/2024 22:01, Heikki Linnakangas wrote:
Reading through postmaster code, I spotted some refactoring 
opportunities to make it slightly more readable.


Currently, when a child process exits, the postmaster first scans 
through BackgroundWorkerList to see if it was a bgworker process. If not 
found, it scans through the BackendList to see if it was a backend 
process (which it really should be then). That feels a bit silly, 
because every running background worker process also has an entry in 
BackendList. There's a lot of duplication between 
CleanupBackgroundWorker and CleanupBackend.


Before commit 8a02b3d732, we used to created Backend entries only for 
background worker processes that connected to a database, not for other 
background worker processes. I think that's why we have the code 
structure we have. But now that we have a Backend entry for all bgworker 
processes, it's more natural to have single function to deal with both 
regular backends and bgworkers.


So I came up with the attached patches. This doesn't make any meaningful 
user-visible changes, except for some incidental changes in log messages 
(see commit message for details).


New patch version attached. Fixed conflicts with recent commits, no 
other changes.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 84ca49efab16cc2699f8446684a5ebe63dad1c38 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 29 Jul 2024 23:13:16 +0300
Subject: [PATCH v2 1/4] Fix outdated comment; all running bgworkers are in
 BackendList

Before commit 8a02b3d732, only bgworkers that connected to a database
had an entry in the Backendlist. Commit 8a02b3d732 changed that, but
forgot to update this comment.

Discussion: https://www.postgresql.org/message-id/835232c0-a5f7-4f20-b95b-5b56ba57d...@iki.fi
---
 src/include/postmaster/bgworker_internals.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/include/postmaster/bgworker_internals.h b/src/include/postmaster/bgworker_internals.h
index 9106a0ef3f..61ba54117a 100644
--- a/src/include/postmaster/bgworker_internals.h
+++ b/src/include/postmaster/bgworker_internals.h
@@ -26,14 +26,14 @@
 /*
  * List of background workers, private to postmaster.
  *
- * A worker that requests a database connection during registration will have
- * rw_backend set, and will be present in BackendList.  Note: do not rely on
- * rw_backend being non-NULL for shmem-connected workers!
+ * All workers that are currently running will have rw_backend set, and will
+ * be present in BackendList.
  */
 typedef struct RegisteredBgWorker
 {
 	BackgroundWorker rw_worker; /* its registry entry */
-	struct bkend *rw_backend;	/* its BackendList entry, or NULL */
+	struct bkend *rw_backend;	/* its BackendList entry, or NULL if not
+ * running */
 	pid_t		rw_pid;			/* 0 if not running */
 	int			rw_child_slot;
 	TimestampTz rw_crashed_at;	/* if not 0, time it last crashed */
-- 
2.39.2

From e541a6d2c8481cd9f9c75fb2328e8e6031cddac6 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 29 Jul 2024 23:13:56 +0300
Subject: [PATCH v2 2/4] Minor refactoring of assign_backendlist_entry()

Make assign_backendlist_entry() responsible just for allocating the
Backend struct. Linking it to the RegisteredBgWorker is the caller's
responsibility now. Seems more clear that way.

Discussion: https://www.postgresql.org/message-id/835232c0-a5f7-4f20-b95b-5b56ba57d...@iki.fi
---
 src/backend/postmaster/postmaster.c | 25 -
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 02442a4b85..a3e9e8fdc0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -416,7 +416,7 @@ static void TerminateChildren(int signal);
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
 static int	CountChildren(int target);
-static bool assign_backendlist_entry(RegisteredBgWorker *rw);
+static Backend *assign_backendlist_entry(void);
 static void maybe_start_bgworkers(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static pid_t StartChildProcess(BackendType type);
@@ -4028,6 +4028,7 @@ MaxLivePostmasterChildren(void)
 static bool
 do_start_bgworker(RegisteredBgWorker *rw)
 {
+	Backend*bn;
 	pid_t		worker_pid;
 
 	Assert(rw->rw_pid == 0);
@@ -4042,11 +4043,14 @@ do_start_bgworker(RegisteredBgWorker *rw)
 	 * tried again right away, most likely we'd find ourselves hitting the
 	 * same resource-exhaustion condition.
 	 */
-	if (!assign_backendlist_entry(rw))
+	bn = assign_backendlist_entry();
+	if (bn == NULL)
 	{
 		rw->rw_crashed_at = GetCurrentTimestamp();
 		return false;
 	}
+	rw->rw_backend = bn;
+	rw->rw_child_slot = bn->child_slot;
 
 	ereport(DEBUG1,
 			(errmsg_internal("starting background worker process \"%s\"",
@@ -4119,12 +4123,10 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
  * 

Re: tls 1.3: sending multiple tickets

2024-07-29 Thread Andres Freund
On 2024-07-26 13:55:29 +0200, Daniel Gustafsson wrote:
> Thanks for review, I've applied this backpatched all the way.

Thanks for working on this!




Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Dean Rasheed
On Mon, 29 Jul 2024 at 18:57, Joel Jacobson  wrote:
>
> Thanks to v3-0002, they are all still significantly faster when both patches 
> have been applied,
> but I wonder if it is expected or not, that v3-0001 temporarily made them a 
> bit slower?
>

There's no obvious reason why 0001 would make those cases slower, but
the fact that, together with 0002, it's a significant net win, and the
gains for 5 and 6-digit inputs make it worthwhile, in my opinion.

Something I did notice in my tests was that if ndigits was a small
multiple of 8, the old code was disproportionately faster, which can
be explained by the fact that the computation fits exactly into a
whole number of XMM register operations, with no remaining digits to
process. For example, these results from above:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
--+--+---+---+--
   15 |   15 | 3.7595882e+06 | 5.0751355e+06 | +34.99%
   16 |   16 | 4.3353435e+06 |  4.970363e+06 | +14.65%
   17 |   17 | 3.9258755e+06 |  4.935394e+06 | +25.71%

   23 |   23 | 2.7975982e+06 | 4.5065035e+06 | +61.08%
   24 |   24 | 3.2456168e+06 | 4.4578115e+06 | +37.35%
   25 |   25 | 2.9515055e+06 | 4.0208335e+06 | +36.23%

   31 |   31 |  2.169437e+06 | 3.7209152e+06 | +71.52%
   32 |   32 | 2.5022498e+06 | 3.6609378e+06 | +46.31%
   33 |   33 |   2.27133e+06 |  3.435459e+06 | +51.25%

(Note how 16x16 was much faster than 15x15, for example.)

The patched code seems to do a better job at levelling out and coping
with arbitrary-sized inputs, not just those that fit exactly into a
whole number of loops using SSE2 operations.

Something else I noticed was that the relative gains for large numbers
of digits were much higher with clang than with gcc:

gcc 13.3.0:

16383 |16383 | 21.629467 |  73.58552 | +240.21%

clang 15.0.7:

16383 |16383 | 11.562384 |  73.00517 | +531.40%

That seems to be because clang doesn't do a good job of generating
efficient SSE2 code in the old case of 16-bit x 16-bit
multiplications. Looking on godbolt.org, it generates
overly-complicated code using PMULUDQ, which actually does 32-bit x
32-bit multiplications. Gcc, on the other hand, generates a much more
compact loop, using PMULHW and PMULLW, which is much faster. With the
patch, they both generate the same SSE2 code, so the results are
pretty consistent.

Regards,
Dean




Re: tiny step toward threading: reduce dependence on setlocale()

2024-07-29 Thread Peter Eisentraut

On 27.07.24 21:03, Andreas Karlsson wrote:

On 7/26/24 10:35 PM, Jeff Davis wrote:

database_ctype_is_c refers to the LC_CTYPE environment of the database
-- pg_database.datctype. default_locale.ctype_is_c is the ctype of the
database's default collation.

Confusing, I know, but it matters for a few things that still depend on
the LC_CTYPE, such as tsearch and maybe a few extensions. See
f413941f41.


Ah, right! That was thinko on my behalf.

The set of patches looks good to me now. There is further refactoring 
that can be done in this area (and should be done given all calls e.g to 
isalpha()) but I think this set of patches improves code readability 
while moving us away from setlocale().


And even if we take a tiny performance hit here, which your tests did 
not measure, I would say it is worth it both due to code clarity and due 
to not relying on thread unsafe state.


I do not see these patches in the commitfest app but if they were I 
would have marked them as ready for committer.


Here: https://commitfest.postgresql.org/48/5023/

I have also re-reviewed the patches and I agree they are good to go.





Re: Detach shared memory in Postmaster child if not needed

2024-07-29 Thread Heikki Linnakangas

On 29/07/2024 21:10, Robert Haas wrote:

On Mon, Jul 29, 2024 at 5:57 AM Rui Zhao  wrote:

Prior to PG16, postmaster children would manually detach from shared memory
if it was not needed. However, this behavior was removed in fork mode in
commit aafc05d.


Oh. The commit message makes no mention of that. I wonder whether it
was inadvertent.


Detaching shared memory when it is no longer needed is beneficial, as
postmaster children (like syslogger) don't wish to take any risk of
accidentally corrupting shared memory. Additionally, any panic in these
processes will not reset shared memory.


+1.


The attached patch addresses this issue by detaching shared memory after
fork_process().


I don't know whether this is the correct approach or not, but
hopefully Heikki can comment.


Good catch, it was not intentional. The patch looks good to me, so 
committed. Thanks Rui!


--
Heikki Linnakangas
Neon (https://neon.tech)





040_pg_createsubscriber.pl is slow and unstable (was Re: speed up a logical replica setup)

2024-07-29 Thread Robert Haas
On Sun, Jun 30, 2024 at 2:40 PM Tom Lane  wrote:
> In hopes of moving things along as we approach the v18 branch,
> I went ahead and pushed Kuroda-san's patches (with a bit of
> further editorialization).  AFAICS that allows closing out
> the concerns raised by Noah, so I've marked that open item
> done.  However, I added a new open item about how the
> 040_pg_createsubscriber.pl test is slow and still unstable.

This open item is now the only open item. It seems to have been open
for a month with no response from Peter which is, from my point of
view, far from ideal. However, another thing that is not ideal is that
we've been using the same thread to discuss every issue related to
this patch for 2.5 years. The thread spans hundreds of messages and it
is by no means obvious to what extent the messages posted after this
one addressed the underlying concern. Perhaps it would have been an
idea to start new threads when we started discussing post-commit
issues, instead of just tagging onto the same one.

But that said, I see no commits in the commit history which purport to
improve performance, so I guess the performance is probably still not
what you want, though I am not clear on the details. And as far as
stability is concerned, I peered through the dense haze of
027_stream_regress-related buildfarm failures for long enough to
discover that the stability issues with 040_pg_createsubscriber aren't
fixed either, because we have these recent buildfarm reports:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo=2024-07-26%2016%3A02%3A40
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder=2024-07-26%2009%3A20%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=canebrake=2024-07-25%2002%3A39%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tamandua=2024-07-22%2002%3A31%3A32

So, Peter, as the committer responsible for pg_createsubscriber, what
are we going to do about this?

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Detect double-release of spinlock

2024-07-29 Thread Andres Freund
Hi,

Partially replying here to an email on -committers [1].

On 2024-07-29 13:57:02 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > However, I still don't think it's a problem to assert that the lock is held 
> > in
> > in the unlock "routine". As mentioned before, the spinlock implementation
> > itself has never followed the "just straight line code" rule that users of
> > spinlocks are supposed to follow.
>
> Yeah, that's fair.

Cool.


On 2024-07-29 13:56:05 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > b) Instead define the spinlock to have 1 as the unlocked state and 0 as the
> >locked state. That makes it a bit harder to understand that 
> > initialization
> >is missing, compared to a dedicated state, as the first use of the 
> > spinlock
> >just blocks.
>
> This option works for me.

> > To make 1) b) easier to understand it might be worth changing the slock_t
> > typedef to be something like
>
> > typedef struct slock_t
> > {
> > char is_free;
> > } slock_t;
>
> +1

Cool. I've attached a prototype.


I just realized there's a nice little advantage to the "inverted"
representation - it detects missing initialization even in optimized builds.


> How much of this would we change across platforms, and how much
> would be x86-only?  I think there are enough people developing on
> ARM (e.g. Mac) now to make it worth covering that, but maybe we
> don't care so much about anything else.

Not sure. Right now I've only hacked up x86-64 (not even touching i386), but
it shouldn't be hard to change at least some additional platforms.

My current prototype requires S_UNLOCK, S_LOCK_FREE, S_INIT_LOCK to be
implemented for x86-64 instead of using the "generic" implementation. That'd
be mildly annoying duplication if we did so for a few more platforms.


It'd be more palatable to just change all platforms if we made more of them
use __sync_lock_test_and_set (or some other intrinsic(s))...

Greetings,

Andres Freund

[1] https://postgr.es/m/2812376.1722275765%40sss.pgh.pa.us
>From 06bd992f4be2f46586bfc7bab3135b3a2ca84e72 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 29 Jul 2024 09:48:26 -0700
Subject: [PATCH v2 1/2] Detect unlocking an unlocked spinlock

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20240729164026.yurg37tej34o7...@awork3.anarazel.de
Backpatch:
---
 src/include/storage/spin.h | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/include/storage/spin.h b/src/include/storage/spin.h
index c0679c59992..7ec32cd816a 100644
--- a/src/include/storage/spin.h
+++ b/src/include/storage/spin.h
@@ -61,7 +61,22 @@
 
 #define SpinLockAcquire(lock) S_LOCK(lock)
 
-#define SpinLockRelease(lock) S_UNLOCK(lock)
+static inline void
+SpinLockRelease(volatile slock_t *lock)
+{
+	/*
+	 * Check that this backend actually held the spinlock. Implemented as a
+	 * static inline to avoid multiple-evaluation hazards.
+	 *
+	 * Can't assert when using the semaphore fallback implementation - it
+	 * doesn't implement S_LOCK_FREE() - but the fallback triggers failures in
+	 * the case of double-release on its own.
+	 */
+#ifdef HAVE_SPINLOCKS
+	Assert(!S_LOCK_FREE(lock));
+#endif
+	S_UNLOCK(lock);
+}
 
 #define SpinLockFree(lock)	S_LOCK_FREE(lock)
 
-- 
2.45.2.746.g06e570c0df.dirty

>From 574d02d3d4d7f38b764ca6a2e4d2617a417ab8a8 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 29 Jul 2024 10:55:11 -0700
Subject: [PATCH v2 2/2] Detect many uses of uninitialized spinlocks on x86-64

On most platforms we use 0 for a unlocked spinlock and 1 for a locked one. As
we often place spinlocks in zero-initialized memory, missing calls to
SpinLockInit() aren't noticed.

To address that, swap the meaning of the spinlock state and use 1 for unlocked
and 0 for locked. That way using an uninitialized spinlock blocks the first
time a lock is acquired. A dedicated state for initialized locks would allow
for a nicer assertion, but ends up with slightly less dense code (a three byte
cmp with an immediate, instead of a 2 byte test).

To make the swapped meaning of the slock_t state easier to understand when
debugging, change the slock_t to be a struct on x86-64, with an "is_free"
member.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20240729164026.yurg37tej34o7...@awork3.anarazel.de
Backpatch:
---
 src/include/storage/s_lock.h | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 02c68513a53..18291b284b0 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -205,7 +205,10 @@ spin_delay(void)
 #ifdef __x86_64__		/* AMD Opteron, Intel EM64T */
 #define HAS_TEST_AND_SET
 
-typedef unsigned char slock_t;
+typedef struct slock_t
+{
+	char	is_free;
+} slock_t;
 
 #define TAS(lock) tas(lock)
 
@@ -217,21 +220,27 @@ typedef unsigned char slock_t;
  * and IA32, by Michael Chynoweth and Mary R. Lee. As of this writing, it is
  * 

Re: Detect double-release of spinlock

2024-07-29 Thread Andres Freund
Hi,

On 2024-07-29 21:00:35 +0300, Heikki Linnakangas wrote:
> On 29/07/2024 20:48, Andres Freund wrote:
> > On 2024-07-29 13:25:22 -0400, Tom Lane wrote:
> > > Heikki Linnakangas  writes:
> > > > Yeah I'm not worried about that at all. Also, the assert is made when
> > > > you have already released the spinlock; you are already out of the
> > > > critical section.
> > > 
> > > Not in the patch Andres posted.
> > 
> > Which seems fairly fundamental - once outside of the critical section, we
> > can't actually assert that the lock isn't acquired, somebody else *validly*
> > might have acquired it by then.
> 
> You could do:
> 
> bool was_free = S_LOCK_FREE(lock);
> 
> S_UNLOCK(lock);
> Assert(!was_free);

I don't really see the point - we're about to crash with an assertion failure,
why would we want to do that outside of the critical section? If anything that
will make it harder to debug the issue in a core dump, because other backends
might "destroy evidence" due to being able to acquire the spinlock.


> Depending on the underlying implementation, you could also use
> compare-and-exchange.

That'd scale a lot worse, at least on x86-64, as it requires the unlock to be
an atomic op, whereas today it's a simple store (+ compiler barrier).

I've experimented with replacing all spinlocks with lwlocks, and the fact that
you need an atomic op for an rwlock release is one of the two major reasons
they have a higher overhead (the remainder is boring stuff like the overhead
of external function calls and ownership management).

Greetings,

Andres Freund




Re: Detach shared memory in Postmaster child if not needed

2024-07-29 Thread Robert Haas
On Mon, Jul 29, 2024 at 5:57 AM Rui Zhao  wrote:
> Prior to PG16, postmaster children would manually detach from shared memory
> if it was not needed. However, this behavior was removed in fork mode in
> commit aafc05d.

Oh. The commit message makes no mention of that. I wonder whether it
was inadvertent.

> Detaching shared memory when it is no longer needed is beneficial, as
> postmaster children (like syslogger) don't wish to take any risk of
> accidentally corrupting shared memory. Additionally, any panic in these
> processes will not reset shared memory.

+1.

> The attached patch addresses this issue by detaching shared memory after
> fork_process().

I don't know whether this is the correct approach or not, but
hopefully Heikki can comment.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Interrupts vs signals

2024-07-29 Thread Robert Haas
On Mon, Jul 29, 2024 at 1:24 PM Tom Lane  wrote:
> Robert Haas  writes:
> > I wonder how this works right now. Is there something that limits the
> > number of authentication requests that can be in flight concurrently,
> > or is it completely uncapped (except by machine resources)?
>
> The former.  IIRC, the postmaster won't spawn more than 2X max_connections
> subprocesses (don't recall the exact limit, but it's around there).

Hmm. Not to sidetrack this thread too much, but multiplying by two
doesn't really sound like the right idea to me. The basic idea
articulated in the comment for canAcceptConnections() makes sense:
some backends might fail authentication, or might be about to exit, so
it makes sense to allow for some slop. But 2X is a lot of slop even on
a machine with the default max_connections=100, and people with
connection management problems are likely to be running with
max_connections=500 or max_connections=900 or even (insanely)
max_connections=2000. Continuing with a connection attempt because we
think that hundreds or thousands of connections that are ahead of us
in the queue might clear out of the way before we need a PGPROC is not
a good bet.

I wonder if we ought to restrict this to a small, flat value, like say
50, or by a new GUC that defaults to such a value if a constant seems
problematic. Maybe it doesn't really matter. I'm not sure how much
work we'd save by booting out the doomed connection attempt earlier.

The unlimited number of dead-end backends doesn't sound too great
either. I don't have another idea, but presumably resisting DDOS
attacks and/or preserving resources for things that still have a
chance of working ought to take priority over printing a nicer error
message from a connection that's doomed to fail anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Detect double-release of spinlock

2024-07-29 Thread Heikki Linnakangas

On 29/07/2024 20:48, Andres Freund wrote:

On 2024-07-29 13:25:22 -0400, Tom Lane wrote:

Heikki Linnakangas  writes:

Yeah I'm not worried about that at all. Also, the assert is made when
you have already released the spinlock; you are already out of the
critical section.


Not in the patch Andres posted.


Which seems fairly fundamental - once outside of the critical section, we
can't actually assert that the lock isn't acquired, somebody else *validly*
might have acquired it by then.


You could do:

bool was_free = S_LOCK_FREE(lock);

S_UNLOCK(lock);
Assert(!was_free);

Depending on the underlying implementation, you could also use 
compare-and-exchange. That makes the assertion-enabled instructions a 
little different than without assertions though.



However, I still don't think it's a problem to assert that the lock is held in
in the unlock "routine". As mentioned before, the spinlock implementation
itself has never followed the "just straight line code" rule that users of
spinlocks are supposed to follow.


Agreed.

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Joel Jacobson
On Mon, Jul 29, 2024, at 16:42, Joel Jacobson wrote:
> New results with less noise below.
>
> Pardon the exceeding of 80 chars line width,
> but felt important to include commit hash and relative delta.
>
>
> ndigits|rate|  change   |   accum   | commit  | 
>  summary
> ---++---+---+-+

I've reviewed the benchmark results, and it looks like v3-0001 made some cases 
a bit slower:

 (32,32)   |  1.786e+06 | -13.27 %  | -11.26 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (32,64)   |  1.119e+06 | -16.72 %  | -20.45 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (32,128)  |  7.242e+05 | -13.55 %  | -9.24 %   | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (64,64)   |  5.515e+05 | -22.34 %  | -24.47 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (64,128)  |  3.204e+05 | -14.83 %  | -12.44 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (128,128) |  1.750e+05 | -16.01 %  | -15.24 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co

Thanks to v3-0002, they are all still significantly faster when both patches 
have been applied,
but I wonder if it is expected or not, that v3-0001 temporarily made them a bit 
slower?

Same cases with v3-0002 applied:

 (32,32)   |  3.408e+06 | +90.80 %  | +69.32 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (32,64)   |  2.356e+06 | +110.63 % | +67.56 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (32,128)  |  1.393e+06 | +92.39 %  | +74.61 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (64,64)   |  1.432e+06 | +159.69 % | +96.14 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (128,128) |  5.567e+05 | +218.07 % | +169.60 % | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2

/Joel




Re: Detect double-release of spinlock

2024-07-29 Thread Tom Lane
Andres Freund  writes:
> However, I still don't think it's a problem to assert that the lock is held in
> in the unlock "routine". As mentioned before, the spinlock implementation
> itself has never followed the "just straight line code" rule that users of
> spinlocks are supposed to follow.

Yeah, that's fair.

regards, tom lane




Re: Detect double-release of spinlock

2024-07-29 Thread Andres Freund
Hi,

On 2024-07-29 13:25:22 -0400, Tom Lane wrote:
> Heikki Linnakangas  writes:
> > Yeah I'm not worried about that at all. Also, the assert is made when 
> > you have already released the spinlock; you are already out of the 
> > critical section.
> 
> Not in the patch Andres posted.

Which seems fairly fundamental - once outside of the critical section, we
can't actually assert that the lock isn't acquired, somebody else *validly*
might have acquired it by then.

However, I still don't think it's a problem to assert that the lock is held in
in the unlock "routine". As mentioned before, the spinlock implementation
itself has never followed the "just straight line code" rule that users of
spinlocks are supposed to follow.

Greetings,

Andres Freund




Re: Remove dead code generation tools in src/backend/utils/mb/

2024-07-29 Thread Heikki Linnakangas

On 29/07/2024 17:15, Tom Lane wrote:

Heikki Linnakangas  writes:

These programs in src/backend/utils/mb/ are unused, and have been unused
and unusable since 2003:
iso.c
win1251.c
win866.c
Attached patch removes them. See commit message for a little more
detailed analysis.


+1.  Seems to have been my oversight in 4c3c8c048d.


Removed.

(Aleksander, you forgot to CC the mailing list, but thanks for your 
review too.)


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Detect double-release of spinlock

2024-07-29 Thread Tom Lane
Heikki Linnakangas  writes:
> Yeah I'm not worried about that at all. Also, the assert is made when 
> you have already released the spinlock; you are already out of the 
> critical section.

Not in the patch Andres posted.

regards, tom lane




Re: Interrupts vs signals

2024-07-29 Thread Tom Lane
Robert Haas  writes:
> I wonder how this works right now. Is there something that limits the
> number of authentication requests that can be in flight concurrently,
> or is it completely uncapped (except by machine resources)?

The former.  IIRC, the postmaster won't spawn more than 2X max_connections
subprocesses (don't recall the exact limit, but it's around there).

regards, tom lane




Re: Interrupts vs signals

2024-07-29 Thread Heikki Linnakangas

On 29/07/2024 19:56, Robert Haas wrote:

On Wed, Jul 24, 2024 at 8:58 AM Heikki Linnakangas  wrote:

a) assign every child process a PGPROC entry, and make postmaster
responsible for assigning them like you suggest. We'll need more PGPROC
entries, because currently a process doesn't reserve one until
authentication has happened. Or we change that behavior.


I wonder how this works right now. Is there something that limits the
number of authentication requests that can be in flight concurrently,
or is it completely uncapped (except by machine resources)?



My first thought when I read this was that it would be bad to have to
put a limit on something that's currently unlimited. But then I
started thinking that, even if it is currently unlimited, that might
be a bug rather than a feature. If you have hundreds of pending
authentication requests, that just means you're using a lot of machine
resources on something that doesn't really help anybody. A machine
with hundreds of authentication-pending connections is possibly
getting DDOS'd and probably getting buried. You'd be better off
focusing the machine's limited resources on the already-established
connections and a more limited number of new connection attempts. If
you accept so many connection attempts that you don't actually have
enough memory/CPU/kernel scheduling firepower to complete the
authentication process with any of them, it does nobody any good.

I'm not sure what's best to do here; just thinking out loud.


Yes, there's a limit, roughly 2x max_connections. see 
MaxLivePostmasterChildren().


There's another issue with that that I was about to post in another 
thread, but here goes: the MaxLivePostmasterChildren() limit is shared 
by all regular backends, bgworkers and autovacuum workers. If you open a 
lot of TCP connections to postmaster and don't send anything to the 
server, you exhaust those slots, and the server won't be able to start 
any autovacuum workers or background workers either. That's not great. I 
started to work on approach b), with separate pools of slots for 
different kinds of child processes, which fixes that. Stay tuned for a 
patch.


In addition to that, you can have an unlimited number of "dead-end" 
backends, which are doomed to just respond with "sorry, too many 
clients" error. The only limit on those is the amount of resources 
needed for all the processes and a little memory to track them.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Detect double-release of spinlock

2024-07-29 Thread Heikki Linnakangas

On 29/07/2024 19:51, Andres Freund wrote:

On 2024-07-29 09:40:26 -0700, Andres Freund wrote:

On 2024-07-29 12:33:13 -0400, Tom Lane wrote:

Andres Freund  writes:

On 2024-07-29 11:31:56 -0400, Tom Lane wrote:

There was some recent discussion about getting rid of
--disable-spinlocks on the grounds that nobody would use
hardware that lacked native spinlocks.  But now I wonder
if there is a testing/debugging reason to keep it.



Seems it'd be a lot more straightforward to just add an assertion to the
x86-64 spinlock implementation verifying that the spinlock isn't already free?


FWIW, I quickly hacked that up, and it indeed quickly fails with 0393f542d72^
and passes with 0393f542d72.


+1. Thanks!


Other context from this discussion:
I dunno, is that the only extra check that the --disable-spinlocks
implementation is providing?


I think it also provides the (valuable!) check that spinlocks were actually
initialized. But that again seems like something we'd be better off adding
more general infrastructure for - nobody runs --disable-spinlocks locally, we
shouldn't need to run this on the buildfarm to find problems like this.


Note that the "check" for double-release with the fallback 
implementation wasn't an explicit check either. It just incremented the 
underlying semaphore, which caused very weird failures later in 
completely unrelated code. An explicit assert would be much nicer.


+1 for removing --disable-spinlocks, but let's add this assertion first.


I'm kind of allergic to putting Asserts into spinlocked code segments,
mostly on the grounds that it violates the straight-line-code precept.
I suppose it's not really that bad for tests that you don't expect
to fail, but still ...


I don't think the spinlock implementation itself is really affected by that
rule - after all, the --disable-spinlocks implementation actually consists out
of several layers of external function calls (including syscalls in some
cases!).


Yeah I'm not worried about that at all. Also, the assert is made when 
you have already released the spinlock; you are already out of the 
critical section.


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Interrupts vs signals

2024-07-29 Thread Robert Haas
On Wed, Jul 24, 2024 at 8:58 AM Heikki Linnakangas  wrote:
> a) assign every child process a PGPROC entry, and make postmaster
> responsible for assigning them like you suggest. We'll need more PGPROC
> entries, because currently a process doesn't reserve one until
> authentication has happened. Or we change that behavior.

I wonder how this works right now. Is there something that limits the
number of authentication requests that can be in flight concurrently,
or is it completely uncapped (except by machine resources)?

My first thought when I read this was that it would be bad to have to
put a limit on something that's currently unlimited. But then I
started thinking that, even if it is currently unlimited, that might
be a bug rather than a feature. If you have hundreds of pending
authentication requests, that just means you're using a lot of machine
resources on something that doesn't really help anybody. A machine
with hundreds of authentication-pending connections is possibly
getting DDOS'd and probably getting buried. You'd be better off
focusing the machine's limited resources on the already-established
connections and a more limited number of new connection attempts. If
you accept so many connection attempts that you don't actually have
enough memory/CPU/kernel scheduling firepower to complete the
authentication process with any of them, it does nobody any good.

I'm not sure what's best to do here; just thinking out loud.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: CREATE MATERIALIZED VIEW

2024-07-29 Thread Dagfinn Ilmari Mannsåker
Nathan Bossart  writes:

> Committed.

Thanks!

- ilmari




Re: CREATE MATERIALIZED VIEW

2024-07-29 Thread Nathan Bossart
Committed.

-- 
nathan




Re: Incremental backup from a streaming replication standby fails

2024-07-29 Thread Robert Haas
On Fri, Jul 26, 2024 at 4:13 PM Alexander Korotkov  wrote:
> Great!  Should we mark the corresponding v17 open item as closed?

Done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Assertion failure with summarize_wal enabled during pg_createsubscriber

2024-07-29 Thread Robert Haas
On Fri, Jul 26, 2024 at 4:10 PM Alexander Korotkov  wrote:
> 0002 could also use pg_indent and pgperltidy.  And I have couple other
> notes regarding 0002.

As Tom said elsewhere, we don't have a practice of requiring perltidy
for every commit, and I normally don't bother. The tree is
pgindent-clean currently so I believe I got that part right.

> >In the process of fixing these bugs, I realized that the logic to wait
> >for WAL summarization to catch up was spread out in a way that made
> >it difficult to reuse properly, so this code refactors things to make
> >it easier.
>
> It would be nice to split refactoring out of material logic changed.
> This way it would be easier to review and would look cleaner in the
> git history.

I support that idea in general but felt it was overkill in this case:
it's new code, and there was only one existing caller of the function
that got refactored, and I'm not a huge fan of cluttering the git
history with a bunch of tiny little refactoring commits to fix a
single bug. I might have changed it if I'd seen this note before
committing, though.

> >To make this fix work, also teach the WAL summarizer that after a
> >promotion has occurred, no more WAL can appear on the previous
> >timeline: previously, the WAL summarizer wouldn't switch to the new
> >timeline until we actually started writing WAL there, but that meant
> >that when the startup process was waiting for the WAL summarizer, it
> >was waiting for an action that the summarizer wasn't yet prepared to
> >take.
>
> I think this is a pretty long sentence, and I'm not sure I can
> understand it correctly.  Does startup process wait for the WAL
> summarizer without this patch?  If not, it's not clear for me that
> word "previously" doesn't distribute to this part of sentence.
> Breaking this into multiple sentences could improve the clarity for
> me.

Yes, I think that phrasing is muddled. It's too late to amend the
commit message, but for posterity: I initially thought that I could
fix this bug by just teaching the startup process to wait for WAL
summarization before performing the .partial renaming, but that was
not enough by itself. The problem is that the WAL summarizer would
read all the records that were present in the final WAL file on the
old timeline, but it wouldn't actually write out a summary file,
because that only happens when it reaches XLOG_CHECKPOINT_REDO or a
timeline switch point. Since no WAL had appeared on the new timeline
yet, it didn't view the end of the old timeline as a switch point, so
it just sat there waiting, without ever writing out a file; and the
startup process sat there waiting for it. So the second part of the
fix is to make the WAL summarizer realize that once the startup
process has chosen a new timeline ID, no more WAL is going to appear
on the old timeline, and thus it can (and should) write out the final
summary file for the old timeline and prepare to read WAL from the new
timeline.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: tls 1.3: sending multiple tickets

2024-07-29 Thread Robert Haas
On Mon, Jul 29, 2024 at 5:57 AM Daniel Gustafsson  wrote:
> I'm sure there are more interactions with OpenSSL, and TLS in general, which
> warrants better comments but the attached takes a stab at the two examples in
> question here to get started (to avoid perfect get in the way of progress).

+1.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Remove dead code generation tools in src/backend/utils/mb/

2024-07-29 Thread Andreas Karlsson

On 7/29/24 5:00 PM, Alexander Lakhin wrote:

I also wonder whether src/test/locale/ still makes sense; does anyone
run those tests (I could not run a single one on a quick attempt)?


I was actually wondering about those yesterday and they should probably 
be removed (or fixed if anyone can see a use for them). As they are 
right now they do not seem very useful, especially with the current 
selection of locales: de_DE.ISO8859-1, gr_GR.ISO8859-7 and koi8-r.


Andreas




Re: Remove dead code generation tools in src/backend/utils/mb/

2024-07-29 Thread Alexander Lakhin

Hello Tom and Heikki,

29.07.2024 17:15, Tom Lane wrote:

Heikki Linnakangas  writes:

These programs in src/backend/utils/mb/ are unused, and have been unused
and unusable since 2003:
iso.c
win1251.c
win866.c
Attached patch removes them. See commit message for a little more
detailed analysis.

+1.  Seems to have been my oversight in 4c3c8c048d.


I also wonder whether src/test/locale/ still makes sense; does anyone
run those tests (I could not run a single one on a quick attempt)?

(As far as I can tell, KOI8-R fallen out of mainstream usage in Russia
twenty years ago...)

Best regards,
Alexander




Re: Virtual generated columns

2024-07-29 Thread Peter Eisentraut

On 22.07.24 12:53, jian he wrote:

another bug?
drop table gtest12v;
CREATE TABLE gtest12v (a int PRIMARY KEY, b bigint, c int GENERATED
ALWAYS AS (b * 2) VIRTUAL);
insert into gtest12v (a,b) values (11,  22147483647);
table gtest12v;

insert ok, but select error:
ERROR:  integer out of range

should insert fail?


I think this is the correct behavior.

There has been a previous discussion: 
https://www.postgresql.org/message-id/2e3d5147-16f8-af0f-00ab-4c72cafc896f%402ndquadrant.com



CREATE TABLE gtest12v (a int PRIMARY KEY, b bigint, c int GENERATED
ALWAYS AS (b * 2) VIRTUAL);
CREATE SEQUENCE sequence_testx OWNED BY gtest12v.c;

seems to work. But I am not sure if there are any corner cases that
make it not work.
just want to raise this issue.


I don't think this matters.  You can make a sequence owned by any 
column, even if that column doesn't have a default that invokes the 
sequence.  So nonsensical setups are possible, but they are harmless.






Re: Optimize mul_var() for var1ndigits >= 8

2024-07-29 Thread Joel Jacobson
On Mon, Jul 29, 2024, at 02:23, Joel Jacobson wrote:
> Then, I've used sched_setaffinity() from  to ensure the 
> benchmark run on CPU core id 31.

I fixed a bug in my measure function, I had forgot to reset affinity after each
benchmark, so the PostgreSQL backend continued to use the core even after
numeric_mul had finished.

New results with less noise below.

Pardon the exceeding of 80 chars line width,
but felt important to include commit hash and relative delta.


ndigits|rate|  change   |   accum   | commit  | 
 summary
---++---+---+-+
 (1,1) |  1.639e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,1) |  2.248e+07 | +37.16 %  | +37.16 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,1) |  2.333e+07 | +3.77 %   | +42.32 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,1) |  2.291e+07 | -1.81 %   | +39.75 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,1) |  2.276e+07 | -0.64 %   | +38.86 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,1) |  2.256e+07 | -0.86 %   | +37.66 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,1) |  2.182e+07 | -3.32 %   | +33.09 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,2) |  1.640e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,2) |  2.202e+07 | +34.28 %  | +34.28 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,2) |  2.214e+07 | +0.58 %   | +35.06 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,2) |  2.173e+07 | -1.85 %   | +32.55 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,2) |  2.260e+07 | +3.98 %   | +37.83 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,2) |  2.233e+07 | -1.19 %   | +36.19 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,2) |  2.144e+07 | -3.97 %   | +30.79 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,3) |  1.511e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,3) |  2.179e+07 | +44.20 %  | +44.20 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,3) |  2.134e+07 | -2.05 %   | +41.24 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,3) |  2.198e+07 | +2.99 %   | +45.47 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,3) |  2.190e+07 | -0.39 %   | +44.91 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,3) |  2.164e+07 | -1.16 %   | +43.23 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,3) |  2.104e+07 | -2.79 %   | +39.24 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,4) |  1.494e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,4) |  2.132e+07 | +42.71 %  | +42.71 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,4) |  2.151e+07 | +0.91 %   | +44.00 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,4) |  2.190e+07 | +1.82 %   | +46.62 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,4) |  2.172e+07 | -0.82 %   | +45.41 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,4) |  2.112e+07 | -2.75 %   | +41.41 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,4) |  2.077e+07 | -1.67 %   | +39.05 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,8) |  1.444e+07 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,8) |  2.063e+07 | +42.85 %  | +42.85 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,8) |  1.996e+07 | -3.25 %   | +38.21 %  | 628c1d1 | Use diff's 
--strip-trailing-cr flag where appropri
 (1,8) |  2.039e+07 | +2.12 %   | +41.14 %  | 0dcf753 | Improve the 
numeric width_bucket() computation. Fo
 (1,8) |  2.020e+07 | -0.89 %   | +39.87 %  | da87dc0 | Add missing 
pointer dereference in pg_backend_memo
 (1,8) |  1.934e+07 | -4.28 %   | +33.89 %  | v3-0001 | Extend 
mul_var_short() to 5 and 6-digit inputs. Co
 (1,8) |  1.948e+07 | +0.73 %   | +34.87 %  | v3-0002 | Optimise 
numeric multiplication using base-NBASE^2
 (1,16)|  9.614e+06 |   |   | 42de72f | SQL/JSON: 
Various improvements to SQL/JSON query f
 (1,16)|  1.215e+07 | +26.37 %  | +26.37 %  | ca481d3 | Optimise 
numeric multiplication for short inputs.
 (1,16)|  1.223e+07 

Re: Remove dead code generation tools in src/backend/utils/mb/

2024-07-29 Thread Tom Lane
Heikki Linnakangas  writes:
> These programs in src/backend/utils/mb/ are unused, and have been unused 
> and unusable since 2003:
> iso.c
> win1251.c
> win866.c
> Attached patch removes them. See commit message for a little more 
> detailed analysis.

+1.  Seems to have been my oversight in 4c3c8c048d.

regards, tom lane




Re: Make query cancellation keys longer

2024-07-29 Thread Heikki Linnakangas

On 24/07/2024 19:12, Heikki Linnakangas wrote:

On 04/07/2024 15:20, Jelte Fennema-Nio wrote:

On Thu, 4 Jul 2024 at 12:32, Heikki Linnakangas  wrote:

We currently don't do any locking on the ProcSignal array. For query
cancellations, that's good because a query cancel packet is processed
without having a PGPROC entry, so we cannot take LWLocks. We could use
spinlocks though. In this patch, I used memory barriers to ensure that
we load/store the pid and the cancellation key in a sensible order, so
that you cannot e.g. send a cancellation signal to a backend that's just
starting up and hasn't advertised its cancellation key in ProcSignal
yet. But I think this might be simpler and less error-prone by just
adding a spinlock to each ProcSignal slot. That would also fix the
existing race condition where we might set the pss_signalFlags flag for
a slot, when the process concurrently terminates and the slot is reused
for a different process. Because of that, we currently have this caveat:
"... all the signals are such that no harm is done if they're mistakenly
fired". With a spinlock, we could eliminate that race.


I think a spinlock would make this thing a whole concurrency stuff a
lot easier to reason about.

+   slot->pss_cancel_key_valid = false;
+   slot->pss_cancel_key = 0;

If no spinlock is added, I think these accesses should still be made
atomic writes. Otherwise data-races on those fields are still
possible, resulting in undefined behaviour. The memory barriers you
added don't prevent that afaict. With atomic operations there are
still race conditions, but no data-races.

Actually it seems like that same argument applies to the already
existing reading/writing of pss_pid: it's written/read using
non-atomic operations so data-races can occur and thus undefined
behaviour too.


Ok, here's a version with spinlocks.

I went back and forth on what exactly is protected by the spinlock. I
kept the "volatile sig_atomic_t" type for pss_signalFlags, so that it
can still be safely read without holding the spinlock in
CheckProcSignal, but all the functions that set the flags now hold the
spinlock. That removes the race condition that you might set the flag
for wrong slot, if the target backend exits and the slot is recycled.
The race condition was harmless and there were comments to note it, but
it doesn't occur anymore with the spinlock.

(Note: Thomas's "Interrupts vs signals" patch will remove ps_signalFlags
altogether. I'm looking forward to that.)


-volatile pid_t pss_pid;
+pid_tpss_pid;

Why remove the volatile modifier?


Because I introduced a memory barrier to ensure the reads/writes of
pss_pid become visible to other processes in right order. That makes the
'volatile' unnecessary IIUC. With the spinlock, the point is moot however.


I:
- fixed a few comments,
- fixed a straightforward failure with EXEC_BACKEND (ProcSignal needs to 
be passed down in BackendParameters now),
- put back the snippet to signal the whole process group if supported, 
which you pointed out earlier


and committed this refactoring patch.

Thanks for the review!

--
Heikki Linnakangas
Neon (https://neon.tech)





RE: Conflict detection and logging in logical replication

2024-07-29 Thread Zhijie Hou (Fujitsu)
On Monday, July 29, 2024 6:59 PM Dilip Kumar  wrote:
> 
> On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> 
> I was going through v7-0001, and I have some initial comments.

Thanks for the comments !

> 
> @@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo
> *resultRelInfo, TupleTableSlot *slot,
>   ExprContext *econtext;
>   Datum values[INDEX_MAX_KEYS];
>   bool isnull[INDEX_MAX_KEYS];
> - ItemPointerData invalidItemPtr;
>   bool checkedIndex = false;
> 
>   ItemPointerSetInvalid(conflictTid);
> - ItemPointerSetInvalid();
> 
>   /*
>   * Get information from the result relation info structure.
> @@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo
> *resultRelInfo, TupleTableSlot *slot,
> 
>   satisfiesConstraint =
>   check_exclusion_or_unique_constraint(heapRelation, indexRelation,
> - indexInfo, ,
> + indexInfo, >tts_tid,
>   values, isnull, estate, false,
>   CEOUC_WAIT, true,
>   conflictTid);
> 
> What is the purpose of this change?  I mean
> 'check_exclusion_or_unique_constraint' says that 'tupleid'
> should be invalidItemPtr if the tuple is not yet inserted and
> ExecCheckIndexConstraints is called by ExecInsert before inserting the tuple.
> So what is this change?

Because the function ExecCheckIndexConstraints() is now invoked after inserting
a tuple (in the patch). So, we need to ignore the newly inserted tuple when
checking conflict in check_exclusion_or_unique_constraint().

> Would this change ExecInsert's behavior as well?

Thanks for pointing it out, I will check and reply.

> 
> 
> 
> 
> +ReCheckConflictIndexes(ResultRelInfo *resultRelInfo, EState *estate,
> +ConflictType type, List *recheckIndexes,
> +TupleTableSlot *slot)
> +{
> + /* Re-check all the unique indexes for potential conflicts */
> +foreach_oid(uniqueidx, resultRelInfo->ri_onConflictArbiterIndexes)
> + {
> + TupleTableSlot *conflictslot;
> +
> + if (list_member_oid(recheckIndexes, uniqueidx) &&
> + FindConflictTuple(resultRelInfo, estate, uniqueidx, slot,
> + )) { RepOriginId origin; TimestampTz committs;
> + TransactionId xmin;
> +
> + GetTupleCommitTs(conflictslot, , , );
> +ReportApplyConflict(ERROR, type, resultRelInfo->ri_RelationDesc,
> +uniqueidx,  xmin, origin, committs, conflictslot);  }  } }
>  and
> 
> + conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes;
> +
>   if (resultRelInfo->ri_NumIndices > 0)
>   recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
> -slot, estate, false, false,
> -NULL, NIL, false);
> +slot, estate, false,
> +conflictindexes ? true : false,
> +,
> +conflictindexes, false);
> +
> + /*
> + * Rechecks the conflict indexes to fetch the conflicting local tuple
> + * and reports the conflict. We perform this check here, instead of
> + * perform an additional index scan before the actual insertion and
> + * reporting the conflict if any conflicting tuples are found. This is
> + * to avoid the overhead of executing the extra scan for each INSERT
> + * operation, even when no conflict arises, which could introduce
> + * significant overhead to replication, particularly in cases where
> + * conflicts are rare.
> + */
> + if (conflict)
> + ReCheckConflictIndexes(resultRelInfo, estate, CT_INSERT_EXISTS,
> +recheckIndexes, slot);
> 
> 
>  This logic is confusing, first, you are calling
> ExecInsertIndexTuples() with no duplicate error for the indexes
> present in 'ri_onConflictArbiterIndexes' which means
>  the indexes returned by the function must be a subset of
> 'ri_onConflictArbiterIndexes' and later in ReCheckConflictIndexes()
> you are again processing all the
>  indexes of 'ri_onConflictArbiterIndexes' and checking if any of these
> is a subset of the indexes that is returned by
> ExecInsertIndexTuples().

I think that's not always true. The indexes returned by the function *may not*
be a subset of 'ri_onConflictArbiterIndexes'. Based on the comments atop of the
ExecInsertIndexTuples, it returns a list of index OIDs for any unique or
exclusion constraints that are deferred, and in addition to that, it will
include the indexes in 'arbiterIndexes' if noDupErr == true.

> 
>  Why are we doing that, I think we can directly use the
> 'recheckIndexes' which is returned by ExecInsertIndexTuples(), and
> those indexes are guaranteed to be a subset of
>  ri_onConflictArbiterIndexes. No?

Based on above, we need to filter the deferred indexes or exclusion constraints
in the 'ri_onConflictArbiterIndexes'.

Best Regards,
Hou zj



Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Peter Eisentraut
I have some comments about the first three patches, that deal with 
memory management.


v24-0001-Revert-ECPG-s-use-of-pnstrdup.patch

This looks right.

I suppose another approach would be to put a full replacement for 
strndup() into src/port/.  But since there is currently only one user, 
and most other users should be using pnstrdup(), the presented approach 
seems ok.


We should take the check for exit() calls from libpq and expand it to 
cover the other libraries as well.  Maybe there are other problems like 
this?



v24-0002-Remove-fe_memutils-from-libpgcommon_shlib.patch

I don't quite understand how this problem can arise.  The description says

"""
libpq appears to have no need for this, and the exit() references cause
our libpq-refs-stamp test to fail if the linker doesn't strip out the
unused code.
"""

But under what circumstances does "the linker doesn't strip out" happen? 
 If this happens accidentally, then we should have seen some buildfarm 
failures or something?


Also, one could look further and notice that restricted_token.c and 
sprompt.c both a) are not needed by libpq and b) can trigger exit() 
calls.  Then it's not clear why those are not affected.



v24-0003-common-jsonapi-support-libpq-as-a-client.patch

I'm reminded of thread [0].  I think there is quite a bit of confusion 
about the pqexpbuffer vs. stringinfo APIs, and they are probably used 
incorrectly quite a bit.  There are now also programs that use both of 
them!  This patch now introduces another layer on top of them.  I fear, 
at the end, nobody is going to understand any of this anymore.  Also, 
changing all the programs to link in libpq for pqexpbuffer seems like 
the opposite direction from what was suggested in [0].


I think we need to do some deeper thinking here about how we want the 
memory management on the client side to work.  Maybe we could just use 
one API but have some flags or callbacks to control the out-of-memory 
behavior.


[0]: 
https://www.postgresql.org/message-id/flat/16d0beac-a141-e5d3-60e9-323da75f49bf%40eisentraut.org






Re: Conflict detection and logging in logical replication

2024-07-29 Thread shveta malik
On Mon, Jul 29, 2024 at 9:31 AM Amit Kapila  wrote:
>
> On Fri, Jul 26, 2024 at 4:28 PM shveta malik  wrote:
> >
> > On Fri, Jul 26, 2024 at 3:56 PM Amit Kapila  wrote:
> > >
> >
> > > One more thing we need to consider is whether we should LOG or ERROR
> > > for update/delete_differ conflicts. If we LOG as the patch is doing
> > > then we are intentionally overwriting the row when the user may not
> > > expect it. OTOH, without a patch anyway we are overwriting, so there
> > > is an argument that logging by default is what the user will expect.
> > > What do you think?
> >
> > I was under the impression that in this patch we do not intend to
> > change behaviour of HEAD and thus only LOG the conflict wherever
> > possible.
> >
>
> Earlier, I thought it was good to keep LOGGING the conflict where
> there is no chance of wrong data update but for cases where we can do
> something wrong, it is better to ERROR out. For example, for an
> update_differ case where the apply worker can overwrite the data from
> a different origin, it is better to ERROR out. I thought this case was
> comparable to an existing ERROR case like a unique constraint
> violation. But I see your point as well that one might expect the
> existing behavior where we are silently overwriting the different
> origin data. The one idea to address this concern is to suggest users
> set the detect_conflict subscription option as off but I guess that
> would make this feature unusable for users who don't want to ERROR out
> for different origin update cases.
>
> > And in the next patch of resolver, based on the user's input
> > of error/skip/or resolve, we take the action. I still think it is
> > better to stick to the said behaviour. Only if we commit the resolver
> > patch in the same version where we commit the detection patch, then we
> > can take the risk of changing this default behaviour to 'always
> > error'. Otherwise users will be left with conflicts arising but no
> > automatic way to resolve those. But for users who really want their
> > application to error out, we can provide an additional GUC in this
> > patch itself which changes the behaviour to 'always ERROR on
> > conflict'.
> >
>
> I don't see a need of GUC here, even if we want we can have a
> subscription option such conflict_log_level. But users may want to
> either LOG or ERROR based on conflict type. For example, there won't
> be any data inconsistency in two node replication for delete_missing
> case as one is trying to delete already deleted data, so LOGGING such
> a case should be sufficient whereas update_differ could lead to
> different data on two nodes, so the user may want to ERROR out in such
> a case.
>
> We can keep the current behavior as default for the purpose of
> conflict detection but can have a separate patch to decide whether to
> LOG/ERROR based on conflict_type.

+1 on the idea of giving an option to the user to choose either ERROR
or LOG for each conflict type separately.

thanks
Shveta




Re: Detach shared memory in Postmaster child if not needed

2024-07-29 Thread Aleksander Alekseev
Hi Rui,

> Prior to PG16, postmaster children would manually detach from shared memory
> if it was not needed. However, this behavior was removed in fork mode in
> commit aafc05d.
>
> Detaching shared memory when it is no longer needed is beneficial, as
> postmaster children (like syslogger) don't wish to take any risk of
> accidentally corrupting shared memory. Additionally, any panic in these
> processes will not reset shared memory.
>
> The attached patch addresses this issue by detaching shared memory after
> fork_process().

Thanks for the patch. How do you estimate its performance impact?

Note the comments for postmaster_child_launch(). This function is
exposed to the third-party code and guarantees to attach shared
memory. I doubt that there is much third-party code in existence to
break but you should change to comment.

-- 
Best regards,
Aleksander Alekseev




Re: Remove duplicate table scan in logical apply worker and code refactoring

2024-07-29 Thread Amit Kapila
On Thu, Jul 25, 2024 at 5:56 PM Kirill Reshke  wrote:
>
> > +/*
> > + * If the tuple to be modified could not be found, a log message is 
> > emitted.
> > + */
> > +static void
> > +report_tuple_not_found(CmdType cmd, Relation targetrel, bool is_partition)
> > +{
> > + Assert(cmd == CMD_UPDATE || cmd == CMD_DELETE);
> > +
> > + /* XXX should this be promoted to ereport(LOG) perhaps? */
> > + elog(DEBUG1,
> > + "logical replication did not find row to be %s in replication target 
> > relation%s \"%s\"",
> > + cmd == CMD_UPDATE ? "updated" : "deleted",
> > + is_partition ? "'s partition" : "",
> > + RelationGetRelationName(targetrel));
> > +}
>
> Encapsulating report logic inside function is ok,
>

This could even be a separate patch as it is not directly to other
parts of the 0002 patch. BTW, the problem statement for 0002 is not
explicitly stated like which part of the code we want to optimize by
removing duplication. Also, as proposed the name
apply_handle_tuple_routing() for the function doesn't seem suitable as
it no longer remains similar to other apply_handle_* functions where
we perform the required operation like insert or update the tuple. How
about naming it as apply_tuple_routing()?

-- 
With Regards,
Amit Kapila.




Re: Fix smgrtruncate code comment.

2024-07-29 Thread Heikki Linnakangas

On 29/07/2024 13:50, Kirill Reshke wrote:

Today I was busy rebasing my Greenplum-related Extendable SMGR
patches on Cloudberry, and I faced some conflicts.
While resolving them I noticed code & comments inconsistency in smgr.c
in smgrtruncate function, which tracks down to
c5315f4f44843c20ada876fdb0d0828795dfbdf5. In this commit,
smgr_fsm_nblocks & smgr_vm_nblocks fields were removed, however
comments were not fixed accordingly.

So i suggest to fix this, PFA


Applied, thanks!

--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Conflict detection and logging in logical replication

2024-07-29 Thread Dilip Kumar
On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
 wrote:
>

I was going through v7-0001, and I have some initial comments.

@@ -536,11 +542,9 @@ ExecCheckIndexConstraints(ResultRelInfo
*resultRelInfo, TupleTableSlot *slot,
  ExprContext *econtext;
  Datum values[INDEX_MAX_KEYS];
  bool isnull[INDEX_MAX_KEYS];
- ItemPointerData invalidItemPtr;
  bool checkedIndex = false;

  ItemPointerSetInvalid(conflictTid);
- ItemPointerSetInvalid();

  /*
  * Get information from the result relation info structure.
@@ -629,7 +633,7 @@ ExecCheckIndexConstraints(ResultRelInfo
*resultRelInfo, TupleTableSlot *slot,

  satisfiesConstraint =
  check_exclusion_or_unique_constraint(heapRelation, indexRelation,
- indexInfo, ,
+ indexInfo, >tts_tid,
  values, isnull, estate, false,
  CEOUC_WAIT, true,
  conflictTid);

What is the purpose of this change?  I mean
'check_exclusion_or_unique_constraint' says that 'tupleid'
should be invalidItemPtr if the tuple is not yet inserted and
ExecCheckIndexConstraints is called by ExecInsert
before inserting the tuple.  So what is this change? Would this change
ExecInsert's behavior as well?




+ReCheckConflictIndexes(ResultRelInfo *resultRelInfo, EState *estate,
+ConflictType type, List *recheckIndexes,
+TupleTableSlot *slot)
+{
+ /* Re-check all the unique indexes for potential conflicts */
+ foreach_oid(uniqueidx, resultRelInfo->ri_onConflictArbiterIndexes)
+ {
+ TupleTableSlot *conflictslot;
+
+ if (list_member_oid(recheckIndexes, uniqueidx) &&
+ FindConflictTuple(resultRelInfo, estate, uniqueidx, slot, ))
+ {
+ RepOriginId origin;
+ TimestampTz committs;
+ TransactionId xmin;
+
+ GetTupleCommitTs(conflictslot, , , );
+ ReportApplyConflict(ERROR, type, resultRelInfo->ri_RelationDesc, uniqueidx,
+ xmin, origin, committs, conflictslot);
+ }
+ }
+}
 and

+ conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes;
+
  if (resultRelInfo->ri_NumIndices > 0)
  recheckIndexes = ExecInsertIndexTuples(resultRelInfo,
-slot, estate, false, false,
-NULL, NIL, false);
+slot, estate, false,
+conflictindexes ? true : false,
+,
+conflictindexes, false);
+
+ /*
+ * Rechecks the conflict indexes to fetch the conflicting local tuple
+ * and reports the conflict. We perform this check here, instead of
+ * perform an additional index scan before the actual insertion and
+ * reporting the conflict if any conflicting tuples are found. This is
+ * to avoid the overhead of executing the extra scan for each INSERT
+ * operation, even when no conflict arises, which could introduce
+ * significant overhead to replication, particularly in cases where
+ * conflicts are rare.
+ */
+ if (conflict)
+ ReCheckConflictIndexes(resultRelInfo, estate, CT_INSERT_EXISTS,
+recheckIndexes, slot);


 This logic is confusing, first, you are calling
ExecInsertIndexTuples() with no duplicate error for the indexes
present in 'ri_onConflictArbiterIndexes' which means
 the indexes returned by the function must be a subset of
'ri_onConflictArbiterIndexes' and later in ReCheckConflictIndexes()
you are again processing all the
 indexes of 'ri_onConflictArbiterIndexes' and checking if any of these
is a subset of the indexes that is returned by
ExecInsertIndexTuples().

 Why are we doing that, I think we can directly use the
'recheckIndexes' which is returned by ExecInsertIndexTuples(), and
those indexes are guaranteed to be a subset of
 ri_onConflictArbiterIndexes. No?

 ---
 ---


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Add crc32(text) & crc32(bytea)

2024-07-29 Thread Aleksander Alekseev
Hi Nathan,

> I don't think adding crc32c() would sufficiently increase the scope.  We'd
> use the existing implementations for both crc32() and crc32c().  And
> besides, this could be useful for adding tests for that code.
>
> +crc32 ( text )
>
> Do we need a version of the function that takes a text input?  It's easy
> enough to cast to a bytea.
>
> +text
>
> My first reaction is that we should just have this return bytea like the
> SHA ones do, if for no other reason than commit 10cfce3 seems intended to
> move us away from returning text for these kinds of functions.  Upthread,
> you mentioned the possibility of returning a bigint, too.  I think I'd
> still prefer bytea in case we want to add, say, crc64() or crc16() in the
> future.  That would allow us to keep all of these functions consistent
> instead of returning different types for each.  However, I understand that
> returning the numeric types might be more convenient.  I'm curious what
> others think about this.
>
> +Computes the CRC32 hash of
> +the binary string, with the result written in hexadecimal.
>
> I'm not sure we should call the check values "hashes."  Wikipedia does
> include them in the "List of hash functions" page [0], but it seems to
> deliberately avoid calling them hashes in the CRC page [1].  I'd suggest
> calling them "CRC32 values" instead.

Thanks for the code review. Here is the updated patch.

-- 
Best regards,
Aleksander Alekseev


v2-0001-Add-crc32-bytea-crc32c-bytea.patch
Description: Binary data


Re: why is pg_upgrade's regression run so slow?

2024-07-29 Thread Andrew Dunstan



On 2024-07-29 Mo 4:00 AM, Alexander Lakhin wrote:

Hello Andrew,

28.07.2024 22:43, Andrew Dunstan wrote:


Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on 
HEAD for fairywren and drongo -  fairwren has just gone green, and I 
expect drongo will when it reports in a few hours.


I'm at a loss for an explanation.



I'm observing the same here (using a Windows 10 VM).

With no TEMP_CONFIG set, `meson test` gives me these numbers:
test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 72.34s

test: postgresql:regress / regress/regress
duration: 41.98s

But with debug_parallel_query=regress in TEMP_CONFIG, I see:
test: postgresql:regress / regress/regress
duration: 50.08s

test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 398.45s

With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many
parallel workers were started during the test:
...\postgresql\build>grep "starting background worker process" -r 
testrun/pg_upgrade | wc -l

17532

And another interesting fact is that TEMP_CONFIG is apparently ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.




Well, that last fact explains the discrepancy I originally complained 
about. How the heck did that happen? It looks like we just ignored its 
use in Makefile.global.in :-(



cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Logical Replication of sequences

2024-07-29 Thread vignesh C
On Thu, 25 Jul 2024 at 15:41, shveta malik  wrote:
>
> On Thu, Jul 25, 2024 at 12:08 PM shveta malik  wrote:
> >
> > On Thu, Jul 25, 2024 at 9:06 AM vignesh C  wrote:
> > >
> > > The attached v20240725 version patch has the changes for the same.
> >
> > Thank You for addressing the comments. Please review below issues:
> >
> > 1) Sub ahead of pub due to wrong initial sync of last_value for
> > non-incremented sequences. Steps at [1]
> > 2) Sequence's min value is not honored on sub during replication. Steps at 
> > [2]
>
> One more issue:
> 3)  Sequence datatype's range is not honored on sub during
> replication, while it is honored for tables.
>
>
> Behaviour for tables:
> -
> Pub: create table tab1( i integer);
> Sub: create table tab1( i smallint);
>
> Pub: insert into tab1 values(generate_series(1, 32768));
>
> Error on sub:
> 2024-07-25 10:38:06.446 IST [178680] ERROR:  value "32768" is out of
> range for type smallint
>
> -
> Behaviour for sequences:
> -
>
> Pub:
> CREATE SEQUENCE myseq_i as integer INCREMENT 1 START 1;
>
> Sub:
> CREATE SEQUENCE myseq_i as smallint INCREMENT 1 START 1;
>
> Pub:
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i');
> SELECT nextval('myseq_i'); -->brings value to 40001
>
> Sub:
> ALTER SUBSCRIPTION sub1 REFRESH PUBLICATION SEQUENCES;
> SELECT * from pg_sequences;  -->last_val reached till 40001, while the
> range is till  32767.

This issue is addressed in the v20240730_2 version patch attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3%2BXzHAbgyn8gmbBLK5goyv_uyGgHEsTQxRZ8bVk6nAEg%40mail.gmail.com

Regards,
Vignesh




Re: Remove duplicate table scan in logical apply worker and code refactoring

2024-07-29 Thread Amit Kapila
On Thu, Jul 25, 2024 at 4:00 PM Zhijie Hou (Fujitsu)
 wrote:
>
> When reviewing the code in logical/worker.c, I noticed that when applying a
> cross-partition update action, it scans the old partition twice.
> I am attaching the patch 0001 to remove this duplicate table scan.
>
> The test shows that it brings noticeable improvement:
>
> Steps
> -
> Pub:
> create table tab (a int not null, b int);
> alter table tab replica identity full;
> insert into tab select 1,generate_series(1, 100, 1);
>
> Sub:
> create table tab (a int not null, b int) partition by range (b);
> create table tab_1 partition of tab for values from (minvalue) to (500);
> create table tab_2 partition of tab for values from (500) to (maxvalue);
> alter table tab replica identity full;
>
>
> Test query:
> update tab set b = 600 where b > 00; -- UPDATE 100
>
> Results (The time spent by apply worker to apply the all the UPDATEs):
> Before  14s
> After   7s
> -
>

The idea sounds good to me. BTW, we don't need the following comment
in the 0001 patch:
We
+ * don't call apply_handle_delete_internal() here to avoid
+ * repeating some work already done above to find the
+ * local tuple in the partition.

It is implied by the change and we already follow the same for the update.

-- 
With Regards,
Amit Kapila.




Re: Add ALL_CANDIDATES option to EXPLAIN

2024-07-29 Thread Ashutosh Bapat
On Fri, Jul 26, 2024 at 10:47 PM Robert Haas  wrote:
>
> On Fri, Jul 26, 2024 at 12:59 PM Anthonin Bonnefoy
>  wrote:
> > I have a prototype for an ALL_CANDIDATES option for EXPLAIN. The goal
> > of this option is to print all plan candidates instead of only the
> > cheapest plan. It will output the plans from the most expensive at the
> > top to the cheapest. Here's an example:
>
> It's difficult for me to understand how this can work. Either it's not
> really going to print out all candidates, or it's going to print out
> gigabytes of output for moderately complex queries.
>
> I've thought about trying to figure out some way of identifying and
> printing out plans that are "interestingly different" from the chosen
> plan, with the costs they would have had, but I haven't been able to
> come up with a good algorithm. Printing out absolutely everything
> doesn't seem viable, because planning would be slow and use amazing
> amounts of memory and the output would be so large as to be useless.

If we print the path forest as a forest as against individual path
trees, we will be able to cut down on the size but it will still be
huge. Irrespective of that even with slightly non-trivial queries it's
going to be difficult to analyze these paths. The way I think of it is
dumping this information in the form of tables. Roughly something like
a table containing RelOptInfo id and RelOptInfo itself and another
containing all the paths identified by id and RelOptInfo id. The path
linkages are stored as path ids. That's a minimum. We will need more
tables to store query, and other metadata. If we do so we can use SQL
to carry out investigations.

-- 
Best Wishes,
Ashutosh Bapat




Re: tls 1.3: sending multiple tickets

2024-07-29 Thread Daniel Gustafsson
> On 26 Jul 2024, at 20:29, Robert Haas  wrote:

> One of my chronic complaints about comments is
> that they should say why we're doing things, not what we're doing.

Agreed.

> I feel like any
> place where we are doing X because of some property of a non-PG code
> base with which a particular reader might not be familiar, we should
> have a comment explaining why we're doing it. And especially if it's
> security-relevant.

I'm sure there are more interactions with OpenSSL, and TLS in general, which
warrants better comments but the attached takes a stab at the two examples in
question here to get started (to avoid perfect get in the way of progress). 

--
Daniel Gustafsson



openssl_comments.diff
Description: Binary data


Re: Conflict detection and logging in logical replication

2024-07-29 Thread Amit Kapila
On Mon, Jul 29, 2024 at 11:44 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, July 26, 2024 7:34 PM Amit Kapila  wrote:
> >
> > On Thu, Jul 25, 2024 at 4:12 PM Amit Kapila 
> > wrote:
> > > >
> > A few more comments:
>
> Thanks for the comments.
>
> > 1.
> > For duplicate key, the patch reports conflict as following:
> > ERROR:  conflict insert_exists detected on relation "public.t1"
> > 2024-07-26 11:06:34.570 IST [27800] DETAIL:  Key (c1)=(1) already exists in
> > unique index "t1_pkey", which was modified by origin 1 in transaction 770 at
> > 2024-07-26 09:16:47.79805+05:30.
> > 2024-07-26 11:06:34.570 IST [27800] CONTEXT:  processing remote data for
> > replication origin "pg_16387" during message type "INSERT" for replication
> > target relation "public.t1" in transaction 742, finished at 0/151A108
> >
> > In detail, it is better to display the origin name instead of the origin 
> > id. This will
> > be similar to what we do in CONTEXT information.
>
>
> Agreed. Before modifying this, I'd like to confirm the message style in the
> cases where origin id may not have a corresponding origin name (e.g., if the
> data was modified locally (id = 0), or if the origin that modified the data 
> has
> been dropped). I thought of two styles:
>
> 1)
> - for local change: "xxx was modified by a different origin \"(local)\" in 
> transaction 123 at 2024.."
> - for dropped origin: "xxx was modified by a different origin \"(unknown)\" 
> in transaction 123 at 2024.."
>
> One issue for this style is that user may create an origin with the same name
> here (e.g. "(local)" and "(unknown)").
>
> 2)
> - for local change: "xxx was modified locally in transaction 123 at 2024.."
>

This sounds good.

> - for dropped origin: "xxx was modified by an unknown different origin 1234 
> in transaction 123 at 2024.."
>

For this one, how about: "xxx was modified by a non-existent origin in
transaction 123 at 2024.."?

Also, in code please do write comments when each of these two can occur.

-- 
With Regards,
Amit Kapila.




Re: why is pg_upgrade's regression run so slow?

2024-07-29 Thread Alexander Lakhin

Hello Andrew,

28.07.2024 22:43, Andrew Dunstan wrote:


Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on HEAD for fairywren and drongo -  fairwren has 
just gone green, and I expect drongo will when it reports in a few hours.


I'm at a loss for an explanation.



I'm observing the same here (using a Windows 10 VM).

With no TEMP_CONFIG set, `meson test` gives me these numbers:
test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 72.34s

test: postgresql:regress / regress/regress
duration: 41.98s

But with debug_parallel_query=regress in TEMP_CONFIG, I see:
test: postgresql:regress / regress/regress
duration: 50.08s

test: postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
duration: 398.45s

With log_min_messages=DEBUG2 added to TEMP_CONFIG, I can see how many
parallel workers were started during the test:
...\postgresql\build>grep "starting background worker process" -r 
testrun/pg_upgrade | wc -l
17532

And another interesting fact is that TEMP_CONFIG is apparently ignored by
`meson test regress/regress`. For example, with temp.config containing
invalid settings, `meson test pg_upgrade/002_pg_upgrade` fails, but
`meson test regress/regress` passes just fine.

Best regards,
Alexander




Re: Visibility bug with prepared transaction with subtransactions on standby

2024-07-29 Thread a.kozhemyakin

Hi,
I'm trying to run REL_13_STABLE recovey tests for windows and I get the 
error


waiting for server to shut down.. failed
pg_ctl: server does not shut down
# pg_ctl stop failed: 256
# Stale postmaster.pid file for node "paris": PID 1868 no longer exists
Bail out! pg_ctl stop failed

I noticed that on buildfarm recovey tests are disabled for windows, was 
this done intentionally?


'invocation_args' => [
'--config',
'./bowerbird.conf',
'--skip-steps',
'recovery-check',
'--verbose',
'REL_13_STABLE'
],

REL_13_STABLE (071e19a36) - test passed
REL_13_STABLE(e9c8747ee) - test failed

28.06.2024 1:35, Heikki Linnakangas пишет:

On 20/06/2024 17:10, Heikki Linnakangas wrote:

On 20/06/2024 16:41, Heikki Linnakangas wrote:

Attached is a patch to fix this, with a test case.


The patch did not compile, thanks to a last-minute change in a field
name. Here's a fixed version.


All I heard is crickets, so committed and backported to all supported 
versions.







Re: Pgoutput not capturing the generated columns

2024-07-29 Thread Peter Smith
Thanks for the patch updates.

Here are my review comments for v21-0001.

I think this patch is mostly OK now except there are still some
comments about the TAP test.

==
Commit Message

0.
Using Create Subscription:
CREATE SUBSCRIPTION sub2_gen_to_gen CONNECTION '$publisher_connstr' PUBLICATION
pub1 WITH (include_generated_columns = true, copy_data = false)"

If you are going to give an example, I think a gen-to-nogen example
would be a better choice. That's because the gen-to-gen (as you have
here) is not going to replicate anything due to the subscriber-side
column being generated.

==
src/test/subscription/t/011_generated.pl

1.
+$node_subscriber2->safe_psql('postgres',
+ "CREATE TABLE tab1 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a
* 22) STORED, c int)"
+);

The subscriber2 node was intended only for all the tables where we
need include_generated_columns to be true. Mostly that is the
combination tests. (tab_gen_to_nogen, tab_nogen_to_gen, etc) OTOH,
table 'tab1' already existed. I don't think we need to bother
subscribing to tab1 from subscriber2 because every combination is
already covered by the combination tests. Let's leave this one alone.


~~~

2.
Huh? Where is the "tab_nogen_to_gen" combination test that I sent to
you off-list?

~~~

3.
+$node_subscriber->safe_psql('postgres',
+ "CREATE TABLE tab_order (c int GENERATED ALWAYS AS (a * 22) STORED,
a int, b int)"
+);

Maybe you can test 'tab_order' on both subscription nodes but I think
it is overkill. IMO it is enough to test it on subscription2.

~~~

4.
+$node_subscriber->safe_psql('postgres',
+ "CREATE TABLE tab_alter (a int, b int, c int GENERATED ALWAYS AS (a
* 22) STORED)"
+);

Ditto above. Maybe you can test 'tab_order' on both subscription nodes
but I think it is overkill. IMO it is enough to test it on
subscription2.

~~~

5.
Don't forget to add initial data for the missing nogen_to_gen table/test.

~~~

6.
 $node_publisher->safe_psql('postgres',
- "CREATE PUBLICATION pub1 FOR ALL TABLES");
+ "CREATE PUBLICATION pub1 FOR TABLE tab1, tab_gen_to_gen,
tab_gen_to_nogen, tab_gen_to_missing, tab_missing_to_gen, tab_order");
+
 $node_subscriber->safe_psql('postgres',
  "CREATE SUBSCRIPTION sub1 CONNECTION '$publisher_connstr' PUBLICATION pub1"
 );

It is not a bad idea to reduce the number of publications as you have
done, but IMO jamming all the tables into 1 publication is too much
because it makes it less understandable instead of simpler.

How about this:
- leave the 'pub1' just for 'tab1'.
- have a 'pub_combo' for publication all the gen_to_nogen,
nogen_to_gen etc combination tests.
- and a 'pub_misc' for any other misc tables like tab_order.

~~~

7.
+#
 # Wait for initial sync of all subscriptions
+#

I think you should write a note here that you have deliberately set
copy_data = false because COPY and include_generated_columns are not
allowed at the same time for patch 0001. And that is why all expected
results on subscriber2 will be empty. Also, say this limitation will
be changed in patch 0002.

~~~

(I didn't yet check 011_generated.pl file results beyond this point...
I'll wait for v22-0001 to review further)

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Use pgBufferUsage for block reporting in analyze

2024-07-29 Thread Anthonin Bonnefoy
On Sat, Jul 27, 2024 at 12:35 AM Masahiko Sawada  wrote:
> An alternative idea would
> be to pass StringInfo to AcquireSampleRowsFunc() so that callback can
> write its messages there. This is somewhat similar to what we do in
> the EXPLAIN command (cf, ExplainPropertyText() etc). It could be too
> much but I think it could be better than writing logs in the single
> format.

I've tested this approach, it definitely looks better. I've added a
logbuf StringInfo to AcquireSampleRowsFunc which will receive the
logs. elevel was removed as it is not used anymore. Since everything
is in the same log line, I've removed the relation name in the acquire
sample functions.

For partitioned tables, I've also added the processed partition table
being sampled. The output will look like:

INFO:  analyze of table "postgres.public.test_partition"
Sampling rows from child "public.test_partition_1"
pages: 5 of 5 scanned
tuples: 999 live tuples, 0 are dead; 999 tuples in sample, 999
estimated total tuples
Sampling rows from child "public.test_partition_2"
pages: 5 of 5 scanned
tuples: 1000 live tuples, 0 are dead; 1000 tuples in sample, 1000
estimated total tuples
avg read rate: 2.604 MB/s, avg write rate: 0.000 MB/s
...

For a file_fdw, the output will be:

INFO:  analyze of table "postgres.public.pglog"
tuples: 60043 tuples; 3 tuples in sample
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
...

Regards,
Anthonin


v4-0002-Output-buffer-and-wal-usage-with-verbose-analyze.patch
Description: Binary data


v4-0001-Use-pgBufferUsage-for-block-reporting-in-analyze.patch
Description: Binary data


v4-0003-Pass-StringInfo-logbuf-to-AcquireSampleFunc.patch
Description: Binary data


Re: Logical Replication of sequences

2024-07-28 Thread vignesh C
On Fri, 26 Jul 2024 at 11:46, Peter Smith  wrote:
>
> Hi Vignesh,
>
> There are still pending changes from my previous review of the
> 0720-0003 patch [1], but here are some new review comments for your
> latest patch v20240525-0003.
> 2b.
> Is it better to name these returned by-ref ptrs like 'ret_log_cnt',
> and 'ret_lsn' to emphasise they are output variables? YMMV.

I felt this is ok as we have mentioned in function header too

> ==
> src/test/subscription/t/034_sequences.pl
>
> 4.
> Q. Should we be suspicious that log_cnt changes from '32' to '31', or
> is there a valid explanation? It smells like some calculation is
> off-by-one, but without debugging I can't tell if it is right or
> wrong.

 It works like this: for every 33 nextval we will get log_cnt as 0. So
for 33 * 6(198) log_cnt will be 0, then for 199 log_cnt will be 32 and
for 200 log_cnt will be 31. This pattern repeats, so this is ok.

These are handled in the v20240729 version attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3SucGGLe-B-a_aqWNWQZ-yfxFTiAA0JyP-SwX4jq9Y3A%40mail.gmail.com

Regards,
Vignesh




Re: Logical Replication of sequences

2024-07-28 Thread vignesh C
On Fri, 26 Jul 2024 at 08:04, Peter Smith  wrote:
>
> Here are some review comments for latest patch v20240725-0002
>
> ==
> doc/src/sgml/ref/create_publication.sgml
>
> nitpick - tweak to the description of the example.
>
> ==
> src/backend/parser/gram.y
>
> preprocess_pub_all_objtype_list:
> nitpick - typo "allbjects_list"
> nitpick - reword function header
> nitpick - /alltables/all_tables/
> nitpick - /allsequences/all_sequences/
> nitpick - I think code is safe as-is because makeNode internally does
> palloc0, but OTOH adding Assert would be nicer just to remove any
> doubts.
>
> ==
> src/bin/psql/describe.c
>
> 1.
> + /* Print any publications */
> + if (pset.sversion >= 18)
> + {
> + int tuples = 0;
>
> No need to assign value 0 here, because this will be unconditionally
> assigned before use anyway.
>
> 
>
> 2. describePublications
>
>   has_pubviaroot = (pset.sversion >= 13);
> + has_pubsequence = (pset.sversion >= 18000);
>
> That's a bug! Should be 18, not 18000.
>
> ==
>
> And, please see the attached diffs patch, which implements the
> nitpicks mentioned above.

These are handled in the v20240729 version attached at [1].
[1] - 
https://www.postgresql.org/message-id/CALDaNm3SucGGLe-B-a_aqWNWQZ-yfxFTiAA0JyP-SwX4jq9Y3A%40mail.gmail.com

Regards,
Vignesh




Re: Add mention of execution time memory for enable_partitionwise_* GUCs

2024-07-28 Thread David Rowley
On Fri, 19 Jul 2024 at 17:24, Ashutosh Bapat
 wrote:
> I am fine if we want to mention that the executor may consume a large
> amount of memory when these GUCs are turned ON. Users may decide to
> turn those OFF if they can not afford to spend that much memory during
> execution. But I don't like tying execution time consumption with
> default OFF.

Would the attached address your concern about the reasons for defaulting to off?

David


partitionwise_docs.patch
Description: Binary data


Re: Allow logical failover slots to wait on synchronous replication

2024-07-28 Thread Bertrand Drouvot
Hi John,

On Thu, Jul 18, 2024 at 02:22:08PM -0700, John H wrote:
> Hi Bertrand,
> 
> > 1 ===
> > ...
> > That's worth additional comments in the code.
> 
> There's this comment already about caching the value already, not sure
> if you prefer something more?
> 
> /* Cache values to reduce contention on lock */

Yeah, at the same place as the static lsn[] declaration, something like:

static XLogRecPtr   lsn[NUM_SYNC_REP_WAIT_MODE]; /* cached LSNs */

but that may just be a matter of taste.

> > 3 ===
> > ...
> > NUM_SYNC_REP_WAIT_MODE is small but as the goal is the keep the lock time as
> > short as possible I wonder if it wouldn't be better to use memcpy() here 
> > instead
> > of this for loop.
> >
> 
> It results in a "Wdiscarded-qualifiers" which is safe given we take
> the lock, but adds noise?
> What do you think?
> 
> "slot.c:2756:46: warning: passing argument 2 of ‘memcpy’ discards
> ‘volatile’ qualifier from pointer target type
> [-Wdiscarded-qualifiers]"

Right, we may want to cast it then but given that the for loop is "small" I 
think
that's also fine to keep the for loop.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Flush pgstats file during checkpoints

2024-07-28 Thread Bertrand Drouvot
Hi,

On Tue, Jul 23, 2024 at 12:52:11PM +0900, Michael Paquier wrote:
> On Mon, Jul 22, 2024 at 07:01:41AM +, Bertrand Drouvot wrote:
> > 3 ===
> > 
> > +   /*
> > +* Read the redo LSN stored in the file.
> > +*/
> > +   if (!read_chunk_s(fpin, _redo) ||
> > +   file_redo != redo)
> > +   goto error;
> > 
> > I wonder if it would make sense to have dedicated error messages for
> > "file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would
> > ease to diagnose as to why the stat file is discarded.
> 
> Yep.  This has been itching me quite a bit, and that's a bit more than
> just the format ID or the redo LSN: it relates to all the read_chunk()
> callers.  I've taken a shot at this with patch 0001, implemented on
> top of the rest.

Thanks! 0001 attached is 
v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch
so I guess you did not attached the right one.

> Attaching a new v4 series, with all these comments addressed.

Thanks!

Looking at 0002:

1 ===

   if (!read_chunk(fpin, ptr, info->shared_data_len))
+  {
+   elog(WARNING, "could not read data of stats kind %d for entry 
of type %c",
+   kind, t);

Nit: what about to include the "info->shared_data_len" value in the WARNING?

2 ===

   if (!read_chunk_s(fpin, ))
+  {
+   elog(WARNING, "could not read name of stats kind %d for entry 
of type %c",
+kind, t);
goto error;
+  }
   if (!pgstat_is_kind_valid(kind))
+  {
+   elog(WARNING, "invalid stats kind %d for entry of type %c",
+kind, t);
goto error;
+  }

Shouldn't we swap those 2 tests so that we check that the kind is valid right
after this one?

  if (!read_chunk_s(fpin, ))
+ {
+  elog(WARNING, "could not read stats kind for entry of type %c", t);
   goto error;
+ }

Looking at 0003: LGTM

Looking at 0004: LGTM

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Conflict detection and logging in logical replication

2024-07-28 Thread Amit Kapila
On Fri, Jul 26, 2024 at 4:28 PM shveta malik  wrote:
>
> On Fri, Jul 26, 2024 at 3:56 PM Amit Kapila  wrote:
> >
>
> > One more thing we need to consider is whether we should LOG or ERROR
> > for update/delete_differ conflicts. If we LOG as the patch is doing
> > then we are intentionally overwriting the row when the user may not
> > expect it. OTOH, without a patch anyway we are overwriting, so there
> > is an argument that logging by default is what the user will expect.
> > What do you think?
>
> I was under the impression that in this patch we do not intend to
> change behaviour of HEAD and thus only LOG the conflict wherever
> possible.
>

Earlier, I thought it was good to keep LOGGING the conflict where
there is no chance of wrong data update but for cases where we can do
something wrong, it is better to ERROR out. For example, for an
update_differ case where the apply worker can overwrite the data from
a different origin, it is better to ERROR out. I thought this case was
comparable to an existing ERROR case like a unique constraint
violation. But I see your point as well that one might expect the
existing behavior where we are silently overwriting the different
origin data. The one idea to address this concern is to suggest users
set the detect_conflict subscription option as off but I guess that
would make this feature unusable for users who don't want to ERROR out
for different origin update cases.

> And in the next patch of resolver, based on the user's input
> of error/skip/or resolve, we take the action. I still think it is
> better to stick to the said behaviour. Only if we commit the resolver
> patch in the same version where we commit the detection patch, then we
> can take the risk of changing this default behaviour to 'always
> error'. Otherwise users will be left with conflicts arising but no
> automatic way to resolve those. But for users who really want their
> application to error out, we can provide an additional GUC in this
> patch itself which changes the behaviour to 'always ERROR on
> conflict'.
>

I don't see a need of GUC here, even if we want we can have a
subscription option such conflict_log_level. But users may want to
either LOG or ERROR based on conflict type. For example, there won't
be any data inconsistency in two node replication for delete_missing
case as one is trying to delete already deleted data, so LOGGING such
a case should be sufficient whereas update_differ could lead to
different data on two nodes, so the user may want to ERROR out in such
a case.

We can keep the current behavior as default for the purpose of
conflict detection but can have a separate patch to decide whether to
LOG/ERROR based on conflict_type.

-- 
With Regards,
Amit Kapila.




RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-28 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

> > IIUC, the patch which adds user_name attribute to get_connection() can be
> discussed
> > in later stage, is it right?
> 
> No, let's work on the patch at this stage :)

OK, here is a rebased patch.

- Changed the name of new API from `GetUserMappingFromOid` to 
`GetUserMappingByOid`
  to keep the name consistent with others.
- Comments and docs were updated.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



0001-Extend-postgres_fdw_get_connections-to-return-user-n.patch
Description:  0001-Extend-postgres_fdw_get_connections-to-return-user-n.patch


RE: [Proposal] Add foreign-server health checks infrastructure

2024-07-28 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

Thanks for pushing and analyzing the failure!

> The regression.diffs shows that pgfdw_conn_check returned 0 even though
> pgfdw_conn_checkable()
> returned true. This can happen if the "revents" from poll() indicates 
> something
> other than
> POLLRDHUP. I think that "revents" could indicate POLLHUP, POLLERR, or
> POLLNVAL. Therefore,
> IMO pgfdw_conn_check() should be updated as follows. I will test this change.
> 
> -   return (input_fd.revents & POLLRDHUP) ? 1 : 0;
> +   return (input_fd.revents &
> +   (POLLRDHUP | POLLHUP | POLLERR |
> POLLNVAL)) ? 1 : 0;

I think you are right.
According to the man page, the input socket is invalid or disconnected if 
revents
has such bits. So such cases should be also regarded as *failure*.
After the fix, the animal said OK. Great works!

Best regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2024-07-28 Thread Ashutosh Bapat
Thanks a lot Richard.

On Mon, Jul 29, 2024 at 8:56 AM Richard Guo  wrote:
>
> On Fri, Jul 26, 2024 at 5:44 PM Richard Guo  wrote:
> > I've worked a bit more on the comments and commit message, and I plan
> > to push the attached soon, barring any objections or comments.
>
> Pushed.
>
> Thanks
> Richard



-- 
Best Wishes,
Ashutosh Bapat




Re: Reuse child_relids in try_partitionwise_join was Re: Assert failure on bms_equal(child_joinrel->relids, child_joinrelids)

2024-07-28 Thread Richard Guo
On Fri, Jul 26, 2024 at 5:44 PM Richard Guo  wrote:
> I've worked a bit more on the comments and commit message, and I plan
> to push the attached soon, barring any objections or comments.

Pushed.

Thanks
Richard




Re: Allow logical failover slots to wait on synchronous replication

2024-07-28 Thread shveta malik
On Fri, Jul 26, 2024 at 5:11 PM Amit Kapila  wrote:
>
> On Fri, Jul 26, 2024 at 3:28 PM shveta malik  wrote:
> >
> > On Tue, Jul 23, 2024 at 10:35 AM Amit Kapila  
> > wrote:
> > >
> > > On Tue, Jul 9, 2024 at 12:39 AM John H  wrote:
> > > >
> > > > > Out of curiosity, did you compare with 
> > > > > standby_slot_names_from_syncrep set to off
> > > > > and standby_slot_names not empty?
> > > >
> > > > I didn't think 'standby_slot_names' would impact TPS as much since
> > > > it's not grabbing the SyncRepLock but here's a quick test.
> > > > Writer with 5 synchronous replicas, 10 pg_recvlogical clients and
> > > > pgbench all running from the same server.
> > > >
> > > > Command: pgbench -c 4 -j 4 -T 600 -U "ec2-user" -d postgres -r -P 5
> > > >
> > > > Result with: standby_slot_names =
> > > > 'replica_1,replica_2,replica_3,replica_4,replica_5'
> > > >
> > > > latency average = 5.600 ms
> > > > latency stddev = 2.854 ms
> > > > initial connection time = 5.503 ms
> > > > tps = 714.148263 (without initial connection time)
> > > >
> > > > Result with: standby_slot_names_from_syncrep = 'true',
> > > > synchronous_standby_names = 'ANY 3 (A,B,C,D,E)'
> > > >
> > > > latency average = 5.740 ms
> > > > latency stddev = 2.543 ms
> > > > initial connection time = 4.093 ms
> > > > tps = 696.776249 (without initial connection time)
> > > >
> > > > Result with nothing set:
> > > >
> > > > latency average = 5.090 ms
> > > > latency stddev = 3.467 ms
> > > > initial connection time = 4.989 ms
> > > > tps = 785.665963 (without initial connection time)
> > > >
> > > > Again I think it's possible to improve the synchronous numbers if we
> > > > cache but I'll try that out in a bit.
> > > >
> > >
> > > Okay, so the tests done till now conclude that we won't get the
> > > benefit by using 'standby_slot_names_from_syncrep'. Now, if we
> > > increase the number of standby's in both lists and still keep ANY 3 in
> > > synchronous_standby_names then the results may vary. We should try to
> > > find out if there is a performance benefit with the use of
> > > synchronous_standby_names in the normal configurations like the one
> > > you used in the above tests to prove the value of this patch.
> > >
> >
> > I didn't fully understand the parameters mentioned above, specifically
> > what 'latency stddev' and 'latency average' represent.. But shouldn't
> > we see the benefit/value of this patch by having a setup where a
> > particular standby is slow in sending the response back to primary
> > (could be due to network lag or other reasons) and then measuring the
> > latency in receiving changes on failover-enabled logical subscribers?
> > We can perform this test with both of the below settings and say make
> > D and E slow in sending responses:
> > 1) synchronous_standby_names = 'ANY 3 (A,B,C,D,E)'
> > 2) standby_slot_names = A_slot, B_slot, C_slot, D_slot, E_slot.
> >
>
> Yes, I also expect the patch should perform better in such a scenario
> but it is better to test it. Also, irrespective of that, we should
> investigate why the reported case is slower for
> synchronous_standby_names and see if we can improve it.

+1

> BTW, you for 2), I think you wanted to say synchronized_standby_slots,
> not standby_slot_names. We have recently changed the GUC name.

yes, sorry, synchronized_standby_slots it is.

thanks
Shveta




Re: Simplify create_merge_append_path a bit for clarity

2024-07-28 Thread Richard Guo
On Fri, Jul 26, 2024 at 1:28 PM Paul A Jungwirth
 wrote:
> Is there a reason you don't want to remove the required_outer
> parameter altogether? I guess because it is such a common pattern to
> pass it?

I think it's best to keep this parameter unchanged to maintain
consistency with other functions that create path nodes in pathnode.c.

> Do you think it is worth keeping this assertion?:
>
> -
> -/* All child paths must have same parameterization */
> -Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
>
> I understand any failure would trigger one of the prior asserts
> instead, but it does communicate an extra requirement, and there is no
> cost.

I don't think it's a good idea to keep this Assert: with this change
it becomes redundant.

> But I'd be fine with committing this patch as-is.

I've pushed this patch.  Thanks for review.

Thanks
Richard




Re: Maybe don't process multi xmax in FreezeMultiXactId() if it is already marked as invalid?

2024-07-28 Thread Anton A. Melnikov



On 19.06.2024 21:06, Peter Geoghegan wrote:

On Wed, Jun 19, 2024 at 1:39 PM Alvaro Herrera  wrote:

FWIW I don't think HEAP_XMAX_INVALID as purely a hint.
HEAP_XMAX_COMMITTED is a hint, for sure, as is HEAP_XMIN_COMMITTED on
its own; but as far as I recall, the INVALID flags must persist once
set.


Seems we disagree on some pretty fundamental things in this area, then.


To resolve this situation seems it is necessary to agree on what
is a "hint bit" exactly means and how to use it.

For example, in this way:

1) Definition. The "hint bit" if it is set represents presence of the property 
of some object (e.g. tuple).
The value of a hint bit can be derived again at any time. So it is acceptable 
for a hint
bit to be lost during some operations.

2) Purpose. (It has already been defined by Yura Sokolov in one of the previous 
letters)
Some work (e.g CPU + mem usage) must be done to check the property of some 
object.
Checking the hint bit, if it is set, saves this work.
So the purpose of the hint bit is optimization.

3) Use. By default code that needs to check some property of the object
must firstly check the corresponding hint bit. If hint is set, determine that 
the property
is present. If hint is not set, do the work to check this property of the 
object and set
hint bit if that property is present.
Also, non-standard behavior is allowed, when the hint bit is ignored and the 
work on property
check will be performed unconditionally for some reasons. In this case the code 
must contain
a comment with an explanation of this reason.

And maybe for clarity, explicitly say that some bit is a hint right in its 
definition?
For instance, use HEAP_XMIN_COMMITTED_HINT instead of HEAP_XMIN_COMMITTED.


Remarks and concerns are gratefully welcome.
 


With the best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: POC, WIP: OR-clause support for indexes

2024-07-28 Thread Alexander Korotkov
On Sun, Jul 28, 2024 at 12:59 PM Alena Rybakina
 wrote:
> On 27.07.2024 13:56, Alexander Korotkov wrote:
> > On Thu, Jul 25, 2024 at 5:04 PM Alena Rybakina
> >   wrote:
> >> To be honest, I have found a big problem in this patch - we try to perform 
> >> the transformation every time we examime a column:
> >>
> >> for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++) { ...
> >>
> >> }
> >>
> >> I have fixed it and moved the transformation before going through the loop.
> > What makes you think there is a problem?
>
> To be honest, I was bothered by the fact that we need to go through
> expressions several times that obviously will not fit under other
> conditions.
> Just yesterday I thought that it would be worthwhile to create a list of
> candidates - expressions that did not fit because the column did not
> match the index (!match_index_to_operand(nconst_expr, indexcol, index)).

I admit that this area probably could use some optimization,
especially for case of many clauses and many indexes.  But in the
scope of this patch, I think this is enough to not make things worse
in this area.

> Another problem that is related to the first one that the boolexpr could
> contain expressions referring to different operands, for example, both x
> and y. that is, we may have the problem that the optimal "ANY"
> expression may not be used, because the expression with x may come
> earlier and the loop may end earlier.
>
> alena@postgres=# create table b (x int, y int);
> CREATE TABLE
> alena@postgres=# insert into b select id, id from
> generate_series(1,1000) as id;
> INSERT 0 1000
> alena@postgres=# create index x_idx on b(x);
> CREATE INDEX
> alena@postgres=# analyze;
> ANALYZE
> alena@postgres=# explain select * from b where y =3 or x =4 or x=5 or
> x=6 or x = 7 or x=8 or x=9;
>QUERY PLAN
> ---
>   Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
> Filter: ((y = 3) OR (x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x =
> 8) OR (x = 9))
> (2 rows)
> alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x =
> 7 or x=8 or x=9 or y=1;
>QUERY PLAN
> ---
>   Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
> Filter: ((x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x =
> 9) OR (y = 1))
> (2 rows)
> alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x =
> 7 or x=8 or x=9;
> QUERY PLAN
> 
>   Index Scan using x_idx on b  (cost=0.28..12.75 rows=6 width=8)
> Index Cond: (x = ANY ('{4,5,6,7,8,9}'::integer[]))
> (2 rows)
>
> Furthermore expressions can be stored in a different order.
> For example, first comes "AND" expr, and then group of "OR" expr, which
> we can convert to "ANY" expr, but we won't do this due to the fact that
> we will exit the loop early, according to this condition:
>
> if (!IsA(sub_rinfo->clause, OpExpr))
> return NULL;
>
> or it may occur due to other conditions.
>
> alena@postgres=# create index x_y_idx on b(x,y);
> CREATE INDEX
> alena@postgres=# analyze;
> ANALYZE
>
> alena@postgres=# explain select * from b where (x = 2 and y =3) or x =4
> or x=5 or x=6 or x = 7 or x=8 or x=9;
>   QUERY PLAN
> -
>   Seq Scan on b  (cost=0.00..35.00 rows=6 width=8)
> Filter: (((x = 2) AND (y = 3)) OR (x = 4) OR (x = 5) OR (x = 6) OR
> (x = 7) OR (x = 8) OR (x = 9))
> (2 rows)
>
> Because of these reasons, I tried to save this and that transformation
> together for each column and try to analyze for each expr separately
> which method would be optimal.

Yes, with v27 of the patch, optimization wouldn't work in these cases.
However, you are using quite small table.  If you will use larger
table or disable sequential scans, there would be bitmap plans to
handle these queries.  So, v27 doesn't make the situation worse. It
just doesn't optimize all that it could potentially optimize and
that's OK.

I've written a separate 0002 patch to address this.  Now, before
generation of paths for bitmap OR, similar OR entries are grouped
together.  When considering a group of similar entries, they are
considered both together and one-by-one.  Ideally we could consider
more sophisticated grouping, but that seems fine for now.  You can
check how this patch handles the cases of above.

Also, 0002 address issue of duplicated bitmap scan conditions in
different forms. During generate_bitmap_or_paths() we need to exclude
considered condition for other clauses.  It couldn't be as normal
filtered out in the latter stage, because could reach the index in
another form.

> > Do you have a test 

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-28 Thread Joel Jacobson
On Sun, Jul 28, 2024, at 21:18, Dean Rasheed wrote:
> Attachments:
> * v3-0002-Optimise-numeric-multiplication-using-base-NBASE-.patch
> * v3-0001-Extend-mul_var_short-to-5-and-6-digit-inputs.patch

Very nice.

I've done some initial benchmarks on my Intel Core i9-14900K machine.

To reduce noise, I've isolated a single CPU core, specifically CPU core id 31, 
to not get any work scheduled by the kernel:

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.15.0-116-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv 
ro quiet splash isolcpus=31 intel_pstate=disable vt.handoff=7

Then, I've used sched_setaffinity() from  to ensure the benchmark run 
on CPU core id 31.

I've also fixed the CPU frequency to 3.20 GHz:

$ sudo cpufreq-info -c 31
...
  current CPU frequency is 3.20 GHz (asserted by call to hardware).

I've benchmarked each (var1ndigits, var2ndigits) 10 times per commit, in random 
order.

I've benchmarked all commits after "SQL/JSON: Various improvements to SQL/JSON 
query function docs"
which is the parent commit to "Optimise numeric multiplication for short 
inputs.",
including the two patches.

I've benchmarked each commit affecting numeric.c, and each such commit's parent 
commit, for comparison.

The accum_change column shows the accumulative percentage change since the 
baseline commit (SQL/JSON: Various improvements).

There is at least single digit percentage noise in the measurements,
which is apparent since the rate fluctuates even between commits
for cases we know can't be affected by the change.
Still, even with this noise level, the improvements are very visible and 
consistent.

ndigits|rate| accum_change |summary
---++--+
 (1,1) |  1.702e+07 |  | SQL/JSON: Various improvements
 (1,1) |  2.201e+07 | +29.32 % | Optimise numeric multiplicatio
 (1,1) |  2.268e+07 | +33.30 % | Use diff's --strip-trailing-cr
 (1,1) |  2.228e+07 | +30.92 % | Improve the numeric width_buck
 (1,1) |  2.195e+07 | +29.01 % | Add missing pointer dereferenc
 (1,1) |  2.241e+07 | +31.68 % | Extend mul_var_short() to 5 an
 (1,1) |  2.130e+07 | +25.17 % | Optimise numeric multiplicatio
 (1,2) |  1.585e+07 |  | SQL/JSON: Various improvements
 (1,2) |  2.227e+07 | +40.49 % | Optimise numeric multiplicatio
 (1,2) |  2.140e+07 | +35.00 % | Use diff's --strip-trailing-cr
 (1,2) |  2.227e+07 | +40.51 % | Improve the numeric width_buck
 (1,2) |  2.183e+07 | +37.75 % | Add missing pointer dereferenc
 (1,2) |  2.241e+07 | +41.41 % | Extend mul_var_short() to 5 an
 (1,2) |  2.223e+07 | +40.26 % | Optimise numeric multiplicatio
 (1,3) |  1.554e+07 |  | SQL/JSON: Various improvements
 (1,3) |  2.155e+07 | +38.70 % | Optimise numeric multiplicatio
 (1,3) |  2.140e+07 | +37.74 % | Use diff's --strip-trailing-cr
 (1,3) |  2.139e+07 | +37.66 % | Improve the numeric width_buck
 (1,3) |  2.234e+07 | +43.76 % | Add missing pointer dereferenc
 (1,3) |  2.142e+07 | +37.83 % | Extend mul_var_short() to 5 an
 (1,3) |  2.066e+07 | +32.97 % | Optimise numeric multiplicatio
 (1,4) |  1.450e+07 |  | SQL/JSON: Various improvements
 (1,4) |  2.113e+07 | +45.70 % | Optimise numeric multiplicatio
 (1,4) |  2.121e+07 | +46.30 % | Use diff's --strip-trailing-cr
 (1,4) |  2.115e+07 | +45.85 % | Improve the numeric width_buck
 (1,4) |  2.166e+07 | +49.37 % | Add missing pointer dereferenc
 (1,4) |  2.053e+07 | +41.56 % | Extend mul_var_short() to 5 an
 (1,4) |  2.085e+07 | +43.82 % | Optimise numeric multiplicatio
 (1,8) |  1.440e+07 |  | SQL/JSON: Various improvements
 (1,8) |  1.963e+07 | +36.38 % | Optimise numeric multiplicatio
 (1,8) |  2.018e+07 | +40.19 % | Use diff's --strip-trailing-cr
 (1,8) |  2.045e+07 | +42.05 % | Improve the numeric width_buck
 (1,8) |  1.998e+07 | +38.79 % | Add missing pointer dereferenc
 (1,8) |  1.953e+07 | +35.68 % | Extend mul_var_short() to 5 an
 (1,8) |  1.992e+07 | +38.36 % | Optimise numeric multiplicatio
 (1,16)|  9.444e+06 |  | SQL/JSON: Various improvements
 (1,16)|  1.235e+07 | +30.75 % | Optimise numeric multiplicatio
 (1,16)|  1.232e+07 | +30.47 % | Use diff's --strip-trailing-cr
 (1,16)|  1.239e+07 | +31.18 % | Improve the numeric width_buck
 (1,16)|  1.222e+07 | +29.35 % | Add missing pointer dereferenc
 (1,16)|  1.220e+07 | +29.14 % | Extend mul_var_short() to 5 an
 (1,16)|  1.271e+07 | +34.54 % | Optimise numeric multiplicatio
 (1,32)|  5.790e+06 |  | 

Re: pg_upgrade failing for 200+ million Large Objects

2024-07-28 Thread Tom Lane
I wrote:
> Alexander Korotkov  writes:
>> J4F, I have an idea to count number of ';' sings and use it for
>> transaction size counter, since it is as upper bound estimate of
>> number of SQL commands :-)

> Hmm ... that's not a completely silly idea.  Let's keep it in
> the back pocket in case we can't easily reduce the number of
> SQL commands in some cases.

After poking at this for awhile, we can fix Justin's example
case by avoiding repeated UPDATEs on pg_attribute, so I think
we should do that.  It seems clearly a win, with no downside
other than a small increment of complexity in pg_dump.

However, that's probably not sufficient to mark this issue
as closed.  It seems likely that there are other patterns
that would cause backend memory bloat.  One case that I found
is tables with a lot of inherited constraints (not partitions,
but old-style inheritance).  For example, load the output of
this Perl script into a database:

-
for (my $i = 0; $i < 100; $i++)
{
print "CREATE TABLE test_inh_check$i (\n";
for (my $j = 0; $j < 1000; $j++)
{
print "a$j float check (a$j > 10.2),\n";
}
print "b float);\n";
print "CREATE TABLE test_inh_check_child$i() 
INHERITS(test_inh_check$i);\n";
}
-

pg_dump is horrendously slow on this, thanks to O(N^2) behavior in
ruleutils.c, and pg_upgrade is worse --- and leaks memory too in
HEAD/v17.  The slowness was there before, so I think the lack of
field complaints indicates that this isn't a real-world use case.
Still, it's bad if pg_upgrade fails when it would not have before,
and there may be other similar issues.

So I'm forced to the conclusion that we'd better make the transaction
size adaptive as per Alexander's suggestion.

In addition to the patches attached, I experimented with making
dumpTableSchema fold all the ALTER TABLE commands for a single table
into one command.  That's do-able without too much effort, but I'm now
convinced that we shouldn't.  It would break the semicolon-counting
hack for detecting that tables like these involve extra work.
I'm also not very confident that the backend won't have trouble with
ALTER TABLE commands containing hundreds of subcommands.  That's
something we ought to work on probably, but it's not a project that
I want to condition v17 pg_upgrade's stability on.

Anyway, proposed patches attached.  0001 is some trivial cleanup
that I noticed while working on the failed single-ALTER-TABLE idea.
0002 merges the catalog-UPDATE commands that dumpTableSchema issues,
and 0003 is Alexander's suggestion.

regards, tom lane

From 34ebed72e9029f690e5d3f0cb7464670e83caa5c Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Sun, 28 Jul 2024 13:02:27 -0400
Subject: [PATCH v1 1/3] Fix missing ONLY in one dumpTableSchema command.

The various ALTER [FOREIGN] TABLE commands issued by dumpTableSchema
all use ONLY, except for this one.  I think it's not a live bug
because we don't permit foreign tables to have children, but
it's still inconsistent.  I happened across this while refactoring
dumpTableSchema to merge all its ALTERs into one; while I'm not
certain we should actually do that, this seems like good cleanup.
---
 src/bin/pg_dump/pg_dump.c| 2 +-
 src/bin/pg_dump/t/002_pg_dump.pl | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index b8b1888bd3..7cd6a7fb97 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -16344,7 +16344,7 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 			if (tbinfo->relkind == RELKIND_FOREIGN_TABLE &&
 tbinfo->attfdwoptions[j][0] != '\0')
 appendPQExpBuffer(q,
-  "ALTER FOREIGN TABLE %s ALTER COLUMN %s OPTIONS (\n"
+  "ALTER FOREIGN TABLE ONLY %s ALTER COLUMN %s OPTIONS (\n"
   "%s\n"
   ");\n",
   qualrelname,
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index d3dd8784d6..5bcc2244d5 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -1154,7 +1154,7 @@ my %tests = (
 
 	'ALTER FOREIGN TABLE foreign_table ALTER COLUMN c1 OPTIONS' => {
 		regexp => qr/^
-			\QALTER FOREIGN TABLE dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n
+			\QALTER FOREIGN TABLE ONLY dump_test.foreign_table ALTER COLUMN c1 OPTIONS (\E\n
 			\s+\Qcolumn_name 'col1'\E\n
 			\Q);\E\n
 			/xm,
-- 
2.43.5

From c8f0d0252e292f276fe9631ae31e6aea11d294d2 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Sun, 28 Jul 2024 16:19:48 -0400
Subject: [PATCH v1 2/3] Reduce number of commands dumpTableSchema emits for
 binary upgrade.

Avoid issuing a separate SQL UPDATE command for each column when
directly manipulating pg_attribute contents in binary upgrade mode.
With the separate updates, we triggered a relcache invalidation with
each update.  For a table with N columns, that causes O(N^2) relcache
bloat in txn_size mode because the 

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-07-28 Thread David Rowley
On Mon, 29 Jul 2024 at 04:31, Tom Lane  wrote:
> It is not clear to me exactly which of these pointers should be
> presumed to be possibly-null, but certainly testing ident after
> storing through it is pretty pointless.  Maybe what was intended
> was
>
> -   if (ident && strcmp(*name, "dynahash") == 0)
> +   if (*name && strcmp(*name, "dynahash") == 0)

It should be *ident. I just missed adding the pointer dereference when
moving that code to a function.

Thanks for the report. I'll fix shortly.

David




Re: Pluggable cumulative statistics

2024-07-28 Thread Dmitry Dolgov
> On Sun, Jul 28, 2024 at 10:20:45PM GMT, Michael Paquier wrote:
> I would like to apply this new infrastructure stuff and move on to the
> problems related to the scability of pg_stat_statements.  So, are
> there any objections with all that?

So far I've got nothing against :)




Re: Speed up collation cache

2024-07-28 Thread Jeff Davis
On Sun, 2024-07-28 at 00:14 +0200, Andreas Karlsson wrote:
> But even without that extra optimization I think this patch is worth 
> merging and the patch is small, simple and clean and easy to
> understand 
> and a just a clear speed up. Feels like a no brainer. I think that it
> is 
> ready for committer.

Committed, thank you.

> And then we can discuss after committing if an additional cache of
> the 
> last locale is worth it or not.

Yeah, I'm holding off on that until refactoring in the area settles,
and we'll see if it's still worth it.

Regards,
Jeff Davis





Re: why is pg_upgrade's regression run so slow?

2024-07-28 Thread Andrew Dunstan



On 2024-07-27 Sa 6:48 PM, Tom Lane wrote:

Andrew Dunstan  writes:

On 2024-07-27 Sa 10:20 AM, Tom Lane wrote:

Just to add some more fuel to the fire: I do *not* observe such an
effect on my own animals.

The culprit appears to be meson. When I tested running crake with
"using_meson => 0" I got results in line with yours.

Interesting.  Maybe meson is over-aggressively trying to run these
test suites in parallel?




Maybe, IDK. Meanwhile, I disabled "debug_parallel_query = regress" on 
HEAD for fairywren and drongo -  fairwren has just gone green, and I 
expect drongo will when it reports in a few hours.


I'm at a loss for an explanation.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Optimize mul_var() for var1ndigits >= 8

2024-07-28 Thread Dean Rasheed
On Fri, 12 Jul 2024 at 13:34, Dean Rasheed  wrote:
>
> Then I tried compiling with -m32, and unfortunately this made the
> patch slower than HEAD for small inputs:
>
> -- var1ndigits1=5, var2ndigits2=5 [-m32, SIMD disabled]
> call rate=5.052332e+06  -- HEAD
> call rate=3.883459e+06  -- v2 patch
>
> -- var1ndigits1=6, var2ndigits2=6 [-m32, SIMD disabled]
> call rate=4.7221405e+06  -- HEAD
> call rate=3.7965685e+06  -- v2 patch
>
> -- var1ndigits1=7, var2ndigits2=7 [-m32, SIMD disabled]
> call rate=4.4638375e+06   -- HEAD
> call rate=3.39948e+06 -- v2 patch
>
> and it doesn't reach parity until around ndigits=26, which is
> disappointing. It does still get much faster for large inputs
>

I spent some more time hacking on this, to try to improve the
situation for 32-bit builds. One of the biggest factors was the 64-bit
division that is necessary during carry propagation, which is very
slow -- every compiler/platform I looked at on godbolt.org turns this
into a call to a slow function like __udivdi3(). However, since we are
dividing by a constant (NBASE^2), it can be done using the same
multiply-and-shift trick that compilers use in 64-bit builds, and that
significantly improves performance.

Unfortunately, in 32-bit builds, using 64-bit integers is still slower
for small inputs (below 20-30 NBASE digits), so in the end I figured
that it's best to retain the old 32-bit base-NBASE multiplication code
for small inputs, and only use 64-bit base-NBASE^2 multiplication when
the inputs are above a certain size threshold. Furthermore, since this
threshold is quite low, it's possible to make an additional
simplification: as long as the threshold is less than or equal to 42,
it can be shown that there is no chance of 32-bit integer overflow,
and so the "maxdig" tracking and renormalisation code is not needed.
Getting rid of that makes the outer multiplication loop very simple,
and makes quite a noticeable difference to the performance for inputs
below the threshold.

In 64-bit builds, doing 64-bit base-NBASE^2 multiplication is faster
for all inputs over 4 or 5 NBASE digits, so the threshold can be much
lower. However, for numbers near that threshold, it's a close thing,
so it makes sense to also extend mul_var_small() to cover 1-6 digit
inputs, since that gives a much larger gain for numbers of that size.
That's quite nice because it equates to inputs with up to 21-24
decimal digits, which I suspect are quite commonly used in practice.

One risk in having different thresholds in 32-bit and 64-bit builds is
that it opens up the possibility that the results from the
reduced-rscale computations used by inexact functions like ln() and
exp() might be be different (actually, this may already be a
possibility, due to div_var_fast()'s use of floating point arithmetic,
but that seems considerably less likely). In practice though, this
should be extremely unlikely, due to the fact that any difference
would have to propagate through MUL_GUARD_DIGITS NBASE digits (8
decimal digits), and it doesn't show up in any of the existing tests.
IMO a very small chance of different results on different platforms is
probably acceptable, but it wouldn't be acceptable to make the
threshold a runtime configurable parameter that could lead to
different results on the same platform.

Overall, this has turned out to be more code than I would have liked,
but I think it's worth it, because the performance gains are pretty
substantial across the board.

(Here, I'm comparing with REL_17_STABLE, so that the effect of
mul_var_short() for ndigits <= 6 can be seen.)

32-bit build


Balanced inputs:

 ndigits1 | ndigits2 |   PG17 rate   |  patch rate   | % change
--+--+---+---+--
1 |1 |  5.973068e+06 |  6.873059e+06 | +15.07%
2 |2 |  5.646598e+06 | 6.6456665e+06 | +17.69%
3 |3 | 5.8176995e+06 | 7.0903175e+06 | +21.87%
4 |4 |  5.545298e+06 | 6.9701605e+06 | +25.69%
5 |5 | 5.2109155e+06 | 6.6406185e+06 | +27.44%
6 |6 | 4.9276335e+06 |  6.478698e+06 | +31.48%
7 |7 | 4.6595095e+06 | 4.8514485e+06 | +4.12%
8 |8 |  4.898536e+06 | 5.1599975e+06 | +5.34%
9 |9 |  4.537171e+06 |  4.830987e+06 | +6.48%
   10 |   10 |  4.308139e+06 | 4.6029265e+06 | +6.84%
   11 |   11 |  4.092612e+06 |   4.37966e+06 | +7.01%
   12 |   12 | 3.9345035e+06 |  4.213998e+06 | +7.10%
   13 |   13 | 3.7600162e+06 | 4.0115955e+06 | +6.69%
   14 |   14 | 3.5959855e+06 | 3.8216508e+06 | +6.28%
   15 |   15 | 3.3576732e+06 | 3.6070588e+06 | +7.43%
   16 |   16 |  3.657139e+06 | 3.9067975e+06 | +6.83%
   17 |   17 | 3.3601955e+06 | 3.5651722e+06 | +6.10%
   18 |   18 | 3.1844472e+06 | 3.4542238e+06 | +8.47%
   19 |   19 | 3.0286392e+06 | 3.2380662e+06 | +6.91%
   20 |   20 | 2.9012185e+06 | 3.0496352e+06 

Re: UUID v7

2024-07-28 Thread Andrey M. Borodin


> On 24 Jul 2024, at 04:09, Sergey Prokhorenko  
> wrote:
> 
> Implementations MAY alter the actual timestamp.

Hmm… looks like we slightly misinterpreted words about clock source.
Well, that’s great, let’s get offset back.
PFA version accepting offset interval.
It works like this:
postgres=# select uuidv7(interval '-2 months’);
 018fc02f-0996-7136-aeb4-8936b5a516a1


postgres=# select uuid_extract_timestamp(uuidv7(interval '-2 months'));
 2024-05-28 22:11:15.71+05

What do you think?


Best regards, Andrey Borodin.


v26-0001-Implement-UUID-v7.patch
Description: Binary data


Re: race condition in pg_class

2024-07-28 Thread Tom Lane
Noah Misch  writes:
> On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote:
>> Is it time to worry yet?  If this were HEAD only, I'd not be too
>> concerned; but two of these three are on allegedly-stable branches.
>> And we have releases coming up fast.

> I don't know; neither decision feels terrible to me.

Yeah, same here.  Obviously, it'd be better to spend effort on getting
the bug fix committed than to spend effort on some cosmetic
workaround.

The fact that the failure is in the isolation tests not the core
regression tests reduces my level of concern somewhat about shipping
it this way.  I think that packagers typically run the core tests
not check-world during package verification, so they won't hit this.

regards, tom lane




Re: race condition in pg_class

2024-07-28 Thread Noah Misch
On Sun, Jul 28, 2024 at 11:50:33AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
> >> A recent buildfarm test failure [1] showed that the
> >> intra-grant-inplace-db.spec test added with 0844b3968 may fail
> 
> >> But as the test going to be modified by the inplace110-successors-v8.patch
> >> and the modified test (with all three latest patches applied) passes
> >> reliably in the same conditions, maybe this failure doesn't deserve a
> >> deeper exploration.
> 
> > Agreed.  Let's just wait for code review of the actual bug fix, not develop 
> > a
> > separate change to stabilize the test.  One flake in three weeks is low 
> > enough
> > to make that okay.
> 
> It's now up to three similar failures in the past ten days

> Is it time to worry yet?  If this were HEAD only, I'd not be too
> concerned; but two of these three are on allegedly-stable branches.
> And we have releases coming up fast.

I don't know; neither decision feels terrible to me.  A bug fix that would
address both the data corruption causes and those buildfarm failures has been
awaiting review on this thread for 77 days.  The data corruption causes are
more problematic than 0.03% of buildfarm runs getting noise failures.  Two
wrongs don't make a right, but a commit masking that level of buildfarm noise
also feels like sending the wrong message.




Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-07-28 Thread Tom Lane
David Rowley  writes:
> I ended up fixing that another way as the above seems to be casting
> away the const for those variables. Instead, I changed the signature
> of the function to:
> static void get_memory_context_name_and_ident(MemoryContext context,
> const char **const name,  const char **const ident);
> which I think takes into account for the call site variables being
> defined as "const char *".

I did not check the history to see quite what happened here,
but Coverity thinks the end result is rather confused,
and I agree:

*** CID 1615190:  Null pointer dereferences  (REVERSE_INULL)
/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/adt/mcxtfuncs.c: 58 in 
get_memory_context_name_and_ident()
52  *ident = context->ident;
53 
54  /*
55   * To be consistent with logging output, we label dynahash contexts with
56   * just the hash table name as with MemoryContextStatsPrint().
57   */
>>> CID 1615190:  Null pointer dereferences  (REVERSE_INULL)
>>> Null-checking "ident" suggests that it may be null, but it has already 
>>> been dereferenced on all paths leading to the check.
58  if (ident && strcmp(*name, "dynahash") == 0)
59  {
60  *name = *ident;
61  *ident = NULL;
62  }
63 }

It is not clear to me exactly which of these pointers should be
presumed to be possibly-null, but certainly testing ident after
storing through it is pretty pointless.  Maybe what was intended
was

-   if (ident && strcmp(*name, "dynahash") == 0)
+   if (*name && strcmp(*name, "dynahash") == 0)

?

regards, tom lane




Re: race condition in pg_class

2024-07-28 Thread Tom Lane
Noah Misch  writes:
> On Sat, Jul 20, 2024 at 11:00:00AM +0300, Alexander Lakhin wrote:
>> A recent buildfarm test failure [1] showed that the
>> intra-grant-inplace-db.spec test added with 0844b3968 may fail
>> on a slow machine

>> But as the test going to be modified by the inplace110-successors-v8.patch
>> and the modified test (with all three latest patches applied) passes
>> reliably in the same conditions, maybe this failure doesn't deserve a
>> deeper exploration.

> Agreed.  Let's just wait for code review of the actual bug fix, not develop a
> separate change to stabilize the test.  One flake in three weeks is low enough
> to make that okay.

It's now up to three similar failures in the past ten days: in
addition to

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=habu=2024-07-18%2003%3A08%3A08

I see

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=urutu=2024-07-22%2018%3A00%3A46

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=taipan=2024-07-28%2012%3A20%3A37

Is it time to worry yet?  If this were HEAD only, I'd not be too
concerned; but two of these three are on allegedly-stable branches.
And we have releases coming up fast.

(BTW, I don't think taipan qualifies as a slow machine.)

regards, tom lane




Re: [PATCH] TODO “Allow LISTEN on patterns”

2024-07-28 Thread Alexander Cheshev
Hi Emanuel,

I did a review on the new patch version and I observed that the identifier
> passed to the LISTEN command is handled differently between outer and
> inner
> levels.
>

We have the following grammar:

notify_channel:
   ColId
   { $$ = $1; }
   | notify_channel '.' ColId
   { $$ = psprintf("%s.%s", $1, $3); }

And ColId is truncated in core scanner:

 ident = downcase_truncate_identifier(yytext, yyleng, true);

So each level is truncated independently. For this reason we observe the
behaviour which you described above.

Another observation, probably not strictly related to this patch itself but
> the async-notify tests, is that there is no test for
> "payload too long". Probably there is a reason on why isn't in the specs?
>

I believe that simply because not all functionality is covered with tests.
But I have noticed a very interesting test "channel name too long":

SELECT
pg_notify('notify_async_channel_name_too_long__','sample_message1');
ERROR:  channel name too long

But the behaviour is inconsistent with NOTIFY command:

NOTIFY notify_async_channel_name_too_long__
NOTICE:  identifier
"notify_async_channel_name_too_long__" will be
truncated to ...

I guess that the expected behavior would be that if the outer level is
> truncated, the rest of the
> channel name should be ignored, as there won't be possible to notify it
> anyway.
>
> In the case of the inner levels creating a channel name too long, it may
> probably sane to just
> check the length of the entire identifier, and truncate -- ensuring that
> channel name doesn't
> end with the level separator.
>
>
Well, I believe that we can forbid too long channel names at all. So it
provides consistent behaviour among different ways we can send
notifications, and I agree with you that "there won't be possible to notify
it anyway". I created a patch for that and attached it to the email. In the
patch I relocated truncation from core scanner to parser. And as the same
core scanner is also used in plsql I added three lines of code to its
scanner to basically truncate too long identifiers in there. Here is an
example of the new behaviour:

-- Should fail. Too long channel names
NOTIFY notify_async_channel_name_too_long_._;
ERROR:  channel name too long
LISTEN notify_async_channel_name_too_long_%._;
ERROR:  channel name too long
UNLISTEN notify_async_channel_name_too_long_%._;
ERROR:  channel name too long

Regards,
Alexander Cheshev


On Sun, 21 Jul 2024 at 21:36, Emanuel Calvo <3man...@gmail.com> wrote:

>
> Hi Alexander,
>
> I did a review on the new patch version and I observed that the identifier
> passed to the LISTEN command is handled differently between outer and
> inner
> levels.
>
> When the outer level exceeds the 64 characters limitation, the outer level
> of the
> channel name is truncated, but leaves the inner levels in the channel name
> due
> that isn't parsed in the same way.
>
> Also, even if the outer level isn't truncated, it is allowed to add
> channels names
> that exceeds the allowed identifier size.
>
> It can be reproduced just by:
>
>   # LISTEN a.a.a.a.a.lot.of.levels..; -- this doesn't fail at LISTEN,
> but fails in NOTIFY due to channel name too long
>
> In the following, the outer level is truncated, but it doesn't cut out the
> inner levels. This leaves
> listening channels that cannot receive any notifications in the queue:
>
>   # LISTEN
> notify_async_channel_name_too_long.a.a.
> ...
>   NOTICE: identifier  will be truncated
>
>   # select substring(c.channel,0,66), length(c.channel) from
> pg_listening_channels() c(channel) where length(c.channel) > 64;
>   substring |
> notify_async_channel_name_too_long_.a
>   length| 1393
>
>
> I guess that the expected behavior would be that if the outer level is
> truncated, the rest of the
> channel name should be ignored, as there won't be possible to notify it
> anyway.
>
> In the case of the inner levels creating a channel name too long, it may
> probably sane to just
> check the length of the entire identifier, and truncate -- ensuring that
> channel name doesn't
> end with the level separator.
>
> Another observation, probably not strictly related to this patch itself
> but the async-notify tests, is that there is no test for
> "payload too long". Probably there is a reason on why isn't in the specs?
>
>
> Regards,
>
>
> El lun, 15 jul 2024 a las 12:59, Alexander Cheshev (<
> alex.ches...@gmail.com>) escribió:
>
>> Hi Emanuel,
>>
>> Changed implementation of the function Exec_UnlistenCommit . v2 of the
>> path contained a bug in the function Exec_UnlistenCommit (added a test case
>> for that) 

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-07-28 Thread Junwang Zhao
Hi Sutou,

On Wed, Jul 24, 2024 at 4:31 PM Sutou Kouhei  wrote:
>
> Hi,
>
> In <9172d4eb-6de0-4c6d-beab-8210b7a22...@enterprisedb.com>
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Mon, 22 Jul 2024 14:36:40 +0200,
>   Tomas Vondra  wrote:
>
> > Thanks for the summary/responses. I still think it'd be better to post a
> > summary as a separate message, not as yet another post responding to
> > someone else. If I was reading the thread, I would not have noticed this
> > is meant to be a summary. I'd even consider putting a "THREAD SUMMARY"
> > title on the first line, or something like that. Up to you, of course.
>
> It makes sense. I'll do it as a separated e-mail.
>
> > My suggestions would be to maintain this as a series of patches, making
> > incremental changes, with the "more complex" or "more experimental"
> > parts larger in the series. For example, I can imagine doing this:
> >
> > 0001 - minimal version of the patch (e.g. current v17)
> > 0002 - switch existing formats to the new interface
> > 0003 - extend the interface to add bits needed for columnar formats
> > 0004 - add DML to create/alter/drop custom implementations
> > 0005 - minimal patch with extension adding support for Arrow
> >
> > Or something like that. The idea is that we still have a coherent story
> > of what we're trying to do, and can discuss the incremental changes
> > (easier than looking at a large patch). It's even possible to commit
> > earlier parts before the later parts are quite cleanup up for commit.
> > And some changes changes may not be even meant for commit (e.g. the
> > extension) but as guidance / validation for the earlier parts.
>
> OK. I attach the v18 patch set:
>
> 0001: add a basic feature (Copy{From,To}Routine)
>   (same as the v17 but it's based on the current master)
> 0002: use Copy{From,To}Rountine for the existing formats
>   (this may not be committed because there is a
>   profiling related concern)
> 0003: add support for specifying custom format by "COPY
>   ... WITH (format 'my-format')"
>   (this also has a test)
> 0004: export Copy{From,To}StateData
>   (but this isn't enough to implement custom COPY
>   FROM/TO handlers as an extension)
> 0005: add opaque member to Copy{From,To}StateData and export
>   some functions to read the next data and flush the buffer
>   (we can implement a PoC Apache Arrow COPY FROM/TO
>   handler as an extension with this)
>
> https://github.com/kou/pg-copy-arrow is a PoC Apache Arrow
> COPY FROM/TO handler as an extension.
>
>
> Notes:
>
> * 0002: We use "static inline" and "constant argument" for
>   optimization.
> * 0002: This hides NextCopyFromRawFields() in a public
>   header because it's not used in PostgreSQL and we want to
>   use "static inline" for it. If it's a problem, we can keep
>   it and create an internal function for "static inline".
> * 0003: We use "CREATE FUNCTION" to register a custom COPY
>   FROM/TO handler. It's the same approach as tablesample.
> * 0004 and 0005: We can mix them but this patch set split
>   them for easy to review. 0004 just moves the existing
>   codes. It doesn't change the existing codes.
> * PoC: I provide it as a separated repository instead of a
>   patch because an extension exists as a separated project
>   in general. If it's a problem, I can provide it as a patch
>   for contrib/.
> * This patch set still has minimal Copy{From,To}Routine. For
>   example, custom COPY FROM/TO handlers can't process their
>   own options with this patch set. We may add more callbacks
>   to Copy{From,To}Routine later based on real world use-cases.
>
> > Unfortunately, there's not much information about what exactly the tests
> > did, context (hardware, ...). So I don't know, really. But if you share
> > enough information on how to reproduce this, I'm willing to take a look
> > and investigate.
>
> Thanks. Here is related information based on the past
> e-mails from Michael:
>
> * Use -O2 for optimization build flag
>   ("meson setup --buildtype=release" may be used)
> * Use tmpfs for PGDATA
> * Disable fsync
> * Run on scissors (what is "scissors" in this context...?)
>   
> https://www.postgresql.org/message-id/flat/Zbr6piWuVHDtFFOl%40paquier.xyz#dbbec4d5c54ef2317be01a54abaf495c
> * Unlogged table may be used
> * Use a table that has 30 integer columns (*1)
> * Use 5M rows (*2)
> * Use '/dev/null' for COPY TO (*3)
> * Use blackhole_am for COPY FROM (*4)
>   https://gi

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

2024-07-28 Thread Masahiko Sawada
On Fri, Jul 26, 2024 at 1:27 PM Melanie Plageman
 wrote:
>
> On Mon, Jul 22, 2024 at 9:26 PM Masahiko Sawada  wrote:
> >
> > +   CREATE TABLE ${table1}(col1 int)
> > +   WITH (autovacuum_enabled=false, fillfactor=10);
> > +   INSERT INTO $table1 VALUES(7);
> > +   INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3;
> > +   CREATE INDEX on ${table1}(col1);
> > +   UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
> > +   INSERT INTO $table1 VALUES(7);
> >
> > These queries make sense to me; these make the radix tree wide and use
> > more nodes, instead of fattening lead nodes (i.e. the offset bitmap).
> > The $table1 has 18182 blocks and the statistics of radix tree shows:
> >
> > max_val = 65535
> > num_keys = 18182
> > height = 1, n4 = 0, n16 = 1, n32 = 0, n64 = 0, n256 = 72, leaves = 18182
> >
> > Which means that the height of the tree is 2 and we use the maximum
> > size node for all nodes except for 1 node.
>
> Do you have some kind of tool that prints this out for you? That would
> be really handy.

You can add '#define RT_DEBUG' for radix tree used in TidStore and
then call RT_STATS (e.g., local_ts_stats()).

>
> > I don't have any great idea to substantially reduce the total number
> > of tuples in the $table1. Probably we can use DELETE instead of UPDATE
> > to make garbage tuples (although I'm not sure it's okay for this
> > test). Which reduces the amount of WAL records from 11MB to 4MB and
> > would reduce the time to catch up. But I'm not sure how much it would
> > help. There might be ideas to trigger a two-round index vacuum with
> > fewer tuples but if the tests are too optimized for the current
> > TidStore, we will have to re-adjust them if the TidStore changes in
> > the future. So I think it's better and reliable to allow
> > maintenance_work_mem to be a lower value or use injection points
> > somehow.
>
> I think we can make improvements in overall time on master and 17 with
> the examples John provided later in the thread. However, I realized
> you are right about using a DELETE instead of an UPDATE. At some point
> in my development, I needed the UPDATE to satisfy some other aspect of
> the test. But that is no longer true. A DELETE works just as well as
> an UPDATE WRT the dead items and, as you point out, much less WAL is
> created and replay is much faster.
>
> I also realized I forgot to add 043_vacuum_horizon_floor.pl to
> src/test/recovery/meson.build in 16. I will post a patch here this
> weekend which changes the UPDATE to a DELETE in 14-16 (sped up the
> test by about 20% for me locally) and adds 043_vacuum_horizon_floor.pl
> to src/test/recovery/meson.build in 16. I'll plan to push it on Monday
> to save myself any weekend buildfarm embarrassment.
>
> As for 17 and master, I'm going to try out John's examples and see if
> it seems like it will be fast enough to commit to 17/master without
> lowering the maintenance_work_mem lower bound.

+1. Thanks.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: pg_attribute.atttypmod for interval type

2024-07-28 Thread Chapman Flack
On 07/27/24 00:32, Tom Lane wrote:
> Interval typmods include a fractional-seconds-precision field as well
> as a bitmask indicating the allowed interval fields (per the SQL
> standard's weird syntax such as INTERVAL DAY TO SECOND).  Looking at
> the source code for intervaltypmodout() might be helpful:
> 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/utils/adt/timestamp.c;h=69fe7860ede062fc8be42e7545b35e69c3e068c4;hb=HEAD#l1136

Also, for this kind of question, an overview of a type modifier's
contents can be found in the javadoc for the WIP PL/Java 1.7, which is
intended to model such things accurately.[0]

The model is aimed at the logical level, that is, to represent
what information is present in the typmod, the precise semantics, what
combinations are allowed/disallowed, and so on, but not the way PostgreSQL
physically packs the bits. So, for this case, what you would find there
is essentially what Tom already said, about what's logically present; it
doesn't save you the effort of looking in the PostgreSQL source if you want
to independently implement unpacking the bits.

For possible future typmod questions, it may serve as a quick way to
get that kind of logical-level description at moments when Tom is away
from the keyboard.

Regards,
-Chap


[0]
https://tada.github.io/pljava/preview1.7/pljava-api/apidocs/org.postgresql.pljava/org/postgresql/pljava/adt/Timespan.Interval.Modifier.html

I just noticed a nit in that javadoc: it says the field combination
must be "one of the named constants in this interface" but you don't
find them in the Interval.Modifier interface; they're in the containing
interface Interval itself.




Re: Pluggable cumulative statistics

2024-07-28 Thread Michael Paquier
On Sat, Jul 27, 2024 at 03:49:42PM +0200, Dmitry Dolgov wrote:
> Agree, looks good. I've tried to quickly sketch out such a fixed
> statistic for some another extension, everything was fine and pretty
> straightforward.

That's my hope.  Thanks a lot for the feedback.

> One question, why don't you use
> pgstat_get_custom_shmem_data & pgstat_get_custom_snapshot_data outside
> of the injection points code? There seems to be a couple of possible
> places in pgstats itself.

Because these two helper routines are only able to fetch the fixed
data area in the snapshot and the control shmem structures for the
custom kinds, not the in-core ones.  We could, but the current code is
OK as well.  My point was just to ease the pluggability effort.

I would like to apply this new infrastructure stuff and move on to the
problems related to the scability of pg_stat_statements.  So, are
there any objections with all that?
--
Michael


signature.asc
Description: PGP signature


Re: Fix overflow in pg_size_pretty

2024-07-28 Thread David Rowley
On Sun, 28 Jul 2024 at 16:30, Joseph Koshakow  wrote:
> Attached is an updated patch with your approach. I removed the 0 from
> the negative case because I think it was unnecessary, but happy to add
> it back in if I missed something.

I made a few adjustments and pushed this.  I did keep the 0 - part as
some compilers don't seem to like not having the 0.  e.g MSVC gives:

../src/backend/utils/adt/dbsize.c(578): warning C4146: unary minus
operator applied to unsigned type, result still unsigned

I thought a bit about if it was really worth keeping the regression
test or not and in the end decided it was likely worthwhile keeping
it, so I expanded it slightly to cover both PG_INT64_MIN and
PG_INT64_MAX values. It looks slightly less like we're earmarking the
fact that there was a bug that way, and also seems to be of some
additional value.

PG15 did see quite a significant rewrite of the pg_size_pretty code.
The bug does still exist in PG14 and earlier, but on looking at what
it would take to fix it there I got a bit unexcited at the risk to
reward ratio of adjusting that code and just left it alone. I've
backpatched only as far as PG15. I'm sure someone else will feel I
should have done something else there, but that's the judgement call I
made.

Thanks for the patch.

David




Re: Protocol question regarding Portal vs Cursor

2024-07-28 Thread Dave Cramer
On Sat, 27 Jul 2024 at 19:06, Tatsuo Ishii  wrote:

> > Yes, sorry, I should have said one can not create a with hold portal
> using
> > the BIND command
>
> Ok.
>
> It would be possible to add a new parameter to the BIND command to
> create such a portal. But it needs some changes to the existing
> protocol definition and requires protocol version up, which is a major
> pain.
>

I'm trying to add WITH HOLD to the JDBC driver and currently I would have
1) rewrite the query
2) issue a new query ... declare .. and bind variables to that statement
3) execute fetch

vs

1) bind variables to the statement
2) execute fetch

The second can be done much lower in the code.

However as you mentioned this would require a new protocol version which is
unlikely to happen.

Dave


Re: POC, WIP: OR-clause support for indexes

2024-07-28 Thread Alena Rybakina

On 27.07.2024 13:56, Alexander Korotkov wrote:

On Thu, Jul 25, 2024 at 5:04 PM Alena Rybakina
  wrote:

To be honest, I have found a big problem in this patch - we try to perform the 
transformation every time we examime a column:

for (indexcol = 0; indexcol < index->nkeycolumns; indexcol++) { ...

}

I have fixed it and moved the transformation before going through the loop.

What makes you think there is a problem?


To be honest, I was bothered by the fact that we need to go through 
expressions several times that obviously will not fit under other 
conditions.
Just yesterday I thought that it would be worthwhile to create a list of 
candidates - expressions that did not fit because the column did not 
match the index (!match_index_to_operand(nconst_expr, indexcol, index)).


Another problem that is related to the first one that the boolexpr could 
contain expressions referring to different operands, for example, both x 
and y. that is, we may have the problem that the optimal "ANY" 
expression may not be used, because the expression with x may come 
earlier and the loop may end earlier.


alena@postgres=# create table b (x int, y int);
CREATE TABLE
alena@postgres=# insert into b select id, id from 
generate_series(1,1000) as id;

INSERT 0 1000
alena@postgres=# create index x_idx on b(x);
CREATE INDEX
alena@postgres=# analyze;
ANALYZE
alena@postgres=# explain select * from b where y =3 or x =4 or x=5 or 
x=6 or x = 7 or x=8 or x=9;

  QUERY PLAN
---
 Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
   Filter: ((y = 3) OR (x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 
8) OR (x = 9))

(2 rows)
alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x = 
7 or x=8 or x=9 or y=1;

  QUERY PLAN
---
 Seq Scan on b  (cost=0.00..32.50 rows=7 width=8)
   Filter: ((x = 4) OR (x = 5) OR (x = 6) OR (x = 7) OR (x = 8) OR (x = 
9) OR (y = 1))

(2 rows)
alena@postgres=# explain select * from b where x =4 or x=5 or x=6 or x = 
7 or x=8 or x=9;

   QUERY PLAN

 Index Scan using x_idx on b  (cost=0.28..12.75 rows=6 width=8)
   Index Cond: (x = ANY ('{4,5,6,7,8,9}'::integer[]))
(2 rows)

Furthermore expressions can be stored in a different order.
For example, first comes "AND" expr, and then group of "OR" expr, which 
we can convert to "ANY" expr, but we won't do this due to the fact that 
we will exit the loop early, according to this condition:


if (!IsA(sub_rinfo->clause, OpExpr))
           return NULL;

or it may occur due to other conditions.

alena@postgres=# create index x_y_idx on b(x,y);
CREATE INDEX
alena@postgres=# analyze;
ANALYZE

alena@postgres=# explain select * from b where (x = 2 and y =3) or x =4 
or x=5 or x=6 or x = 7 or x=8 or x=9;

 QUERY PLAN
-
 Seq Scan on b  (cost=0.00..35.00 rows=6 width=8)
   Filter: (((x = 2) AND (y = 3)) OR (x = 4) OR (x = 5) OR (x = 6) OR 
(x = 7) OR (x = 8) OR (x = 9))

(2 rows)

Because of these reasons, I tried to save this and that transformation 
together for each column and try to analyze for each expr separately 
which method would be optimal.



Do you have a test case
illustrating a slow planning time?

No, I didn't have time to measure it and sorry for that. I'll do it.

When v27 performs transformation for a particular column, it just
stops facing the first unmatched OR entry.  So,
match_orclause_to_indexcol() examines just the first OR entry for all
the columns excepts at most one.  So, the check
match_orclause_to_indexcol() does is not much slower than other
match_*_to_indexcol() do.

I actually think this could help performance in many cases, not hurt
it.  At least we get rid of O(n^2) complexity over the number of OR
entries, which could be very many.


I agree with you that there is an overhead and your patch fixes this 
problem, but optimizer needs to have a good ordering of expressions for 
application.


I think we can try to move the transformation to another place where 
there is already a loop pass, and also save two options "OR" expr and 
"ANY" expr in one place (through BoolExpr) (like find_duplicate_ors 
function) and teach the optimizer to determine which option is better, 
for example, like now in match_orclause_to_indexcol() function.


What do you thing about it?

--
Regards,
Alena Rybakina
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company





Re: improve ssl error code, 2147483650

2024-07-28 Thread Peter Eisentraut

On 25.07.24 11:36, Daniel Gustafsson wrote:

On 24 Jul 2024, at 15:32, Peter Eisentraut  wrote:

On 25.06.24 16:21, Tom Lane wrote:

Peter Eisentraut  writes:

On 21.06.24 16:53, Tom Lane wrote:

Most of libpq gets at strerror_r via SOCK_STRERROR for Windows
portability.  Is that relevant here?

Looking inside the OpenSSL code, it makes no efforts to translate
between winsock error codes and standard error codes, so I don't think
our workaround/replacement code needs to do that either.

Fair enough.

Btw., our source code comments say something like
"ERR_reason_error_string randomly refuses to map system errno values."
The reason it doesn't is exactly that it can't do it while maintaining
thread-safety.

Ah.  Do you want to improve that comment while you're at it?


Here is a patch that fixes the strerror() call and updates the comments a bit.


LGTM.


This ought to be backpatched like the original fix; ideally for the next minor 
releases in about two weeks.


Agreed.


done





<    1   2   3   4   5   6   7   8   9   10   >