date:20240229

RE: Synchronizing slots from primary to standby

2024-02-29 Thread Zhijie Hou (Fujitsu)

On Friday, March 1, 2024 2:11 PM Masahiko Sawada  wrote:
> 
> On Fri, Mar 1, 2024 at 12:42 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Friday, March 1, 2024 10:17 AM Zhijie Hou (Fujitsu)
>  wrote:
> > >
> > >
> > > Attach the V102 patch set which addressed Amit and Shveta's comments.
> > > Thanks Shveta for helping addressing the comments off-list.
> >
> > The cfbot reported a compile warning, here is the new version patch
> > which fixed it, Also removed some outdate comments in this version.
> >
> 
> I've reviewed the v102-0001 patch. Here are some comments:

Thanks for the comments !

> 
> ---
> I got a compiler warning:
> 
> walsender.c:1829:6: warning: variable 'wait_event' is used uninitialized
> whenever '&&' condition is false [-Wsometimes-uninitialized]
> if (!XLogRecPtrIsInvalid(RecentFlushPtr) &&
> ^~~~
> walsender.c:1871:7: note: uninitialized use occurs here
> if (wait_event ==
> WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL)
> ^~
> walsender.c:1829:6: note: remove the '&&' if its condition is always true
> if (!XLogRecPtrIsInvalid(RecentFlushPtr) &&
> ^~~
> walsender.c:1818:20: note: initialize the variable 'wait_event' to silence 
> this
> warning
> uint32  wait_event;
>   ^
>= 0
> 1 warning generated.


Thanks for reporting, it was fixed in V102_2.

> 
> ---
> +void
> +assign_standby_slot_names(const char *newval, void *extra) {
> +List  *standby_slots;
> +MemoryContext oldcxt;
> +char  *standby_slot_names_cpy = extra;
> +
> 
> Given that the newval and extra have the same data (standby_slot_names
> value), why do we not use newval instead? I think that if we use
> newval, we don't need to guc_strdup() in check_standby_slot_names(),
> we might need to do list_copy_deep() instead, though. It's not clear
> to me as there is no comment.

I think SplitIdentifierString will modify the passed in string, so we'd better
not pass the newval to it, otherwise the stored guc string(standby_slot_names)
will be changed. I can see we are doing similar thing in other GUC check/assign
function as well. (check_wal_consistency_checking/
assign_wal_consistency_checking, check_createrole_self_grant/
assign_createrole_self_grant ...).

> ---
> +/*
> + * Switch to the memory context under which GUC variables are
> allocated
> + * (GUCMemoryContext).
> + */
> +oldcxt =
> MemoryContextSwitchTo(GetMemoryChunkContext(standby_slot_names_cpy
> ));
> +standby_slot_names_list = list_copy(standby_slots);
> +MemoryContextSwitchTo(oldcxt);
> 
> Why do we not explicitly switch to GUCMemoryContext?

I think it's because the GUCMemoryContext is not exposed.

Best Regards,
Hou zj

Re: "type with xxxx does not exist" when doing ExecMemoize()

2024-02-29 Thread Tender Wang

Hi,

When I think about how to add a test case for v5 version patch, and I want
to test if v5 version patch has memory leak.
This thread [1] provided a way how to repeat the memory leak, so I used it
to test v5 patch. I didn't found memory leak on
v5 patch.

But I found other interesting issue. When changed whereClause in [1],  the
query reported below error:

"ERROR could not find memoization table entry"

the query:
EXPLAIN analyze
select sum(q.id_table1)
from (
SELECT t2.*
FROM table1 t1
JOIN table2 t2
ON (t2.id_table1 + t2.id_table1) = t1.id) q;

But on v5 patch, it didn't report error.

I guess it is the same reason that data in probeslot was reset in Hash
function.

I debug the above query, and get this:
before
(gdb) p *(DatumGetNumeric(mstate->probeslot->tts_values[0]))
$1 = {vl_len_ = 48, choice = {n_header = 32770, n_long = {n_sign_dscale =
32770, n_weight = 60, n_data = 0x564632ebd708}, n_short = {n_header =
32770, n_data = 0x564632ebd706}}}
after
(gdb) p *(DatumGetNumeric(mstate->probeslot->tts_values[0]))
$2 = {vl_len_ = 264, choice = {n_header = 32639, n_long = {n_sign_dscale =
32639, n_weight = 32639, n_data = 0x564632ebd6a8}, n_short = {n_header =
32639, n_data = 0x564632ebd6a6}}}

So after call ResetExprContext() in Hash function, the data in probeslot is
corrupted.  It is not sure what error will happen when executing on
corrupted data.

During debug, I learned that numeric_add doesn't have type check like
rangetype, so aboved query will not report "type with xxx does not exist".

And I realize that  the test case added by Andrei Lepikhov  in v3 is right.
So in v6 patch I add Andrei Lepikhov's test case.  Thanks a lot.

Now I think the v6 version patch seems to be complete now.



[1]
https://www.postgresql.org/message-id/83281eed63c74e4f940317186372abfd%40cft.ru

-- 
Tender Wang
OpenPie:  https://en.openpie.com/


v6-0001-Fix-wrong-used-ResetExprContext-in-ExecMemoize.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-02-29 Thread Bertrand Drouvot

Hi,

On Fri, Mar 01, 2024 at 03:22:55PM +1100, Peter Smith wrote:
> Here are some review comments for v102-0001.
> 
> ==
> doc/src/sgml/config.sgml
> 
> 1.
> +   
> +Lists the streaming replication standby server slot names that 
> logical
> +WAL sender processes will wait for. Logical WAL sender processes will
> +send decoded changes to plugins only after the specified replication
> +slots confirm receiving WAL. This guarantees that logical replication
> +slots with failover enabled do not consume changes until those 
> changes
> +are received and flushed to corresponding physical standbys. If a
> +logical replication connection is meant to switch to a physical 
> standby
> +after the standby is promoted, the physical replication slot for the
> +standby should be listed here. Note that logical replication will not
> +proceed if the slots specified in the standby_slot_names do
> not exist or
> +are invalidated.
> +   
> 
> Should this also mention the effect this GUC has on those 2 SQL
> functions? E.g. Commit message says:
> 
> Additionally, The SQL functions pg_logical_slot_get_changes and
> pg_replication_slot_advance are modified to wait for the replication
> slots mentioned in 'standby_slot_names' to catch up before returning.

I think that's also true for all the ones that rely on
pg_logical_slot_get_changes_guts(), means:

- pg_logical_slot_get_changes
- pg_logical_slot_peek_changes
- pg_logical_slot_get_binary_changes
- pg_logical_slot_peek_binary_changes

Not sure it's worth to mention the "binary" ones though as their doc mention
they behave as their "non binary" counterpart.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Improve readability by using designated initializers when possible

2024-02-29 Thread jian he

On Fri, Mar 1, 2024 at 12:08 PM Michael Paquier  wrote:
>
> On Thu, Feb 29, 2024 at 12:41:38PM +0100, Peter Eisentraut wrote:
> > On 27.02.24 08:57, Alvaro Herrera wrote:
> >> On 2024-Feb-27, Michael Paquier wrote:
> >>> These would cause compilation failures.  Saying that, this is a very
> >>> nice cleanup, so I've fixed these and applied the patch after checking
> >>> that the one-one replacements were correct.
> >>
> >> Oh, I thought we were going to get rid of ObjectClass altogether -- I
> >> mean, have getObjectClass() return ObjectAddress->classId, and then
> >> define the OCLASS values for each catalog OID [... tries to ...]  But
> >> this(*) doesn't work for two reasons:
> >
> > I have long wondered what the point of ObjectClass is.  I find the extra
> > layer of redirection, which is used only in small parts of the code, and the
> > similarity to ObjectType confusing.  I happened to have a draft patch for
> > its removal lying around, so I'll show it here, rebased over what has
> > already been done in this thread.
>
> The elimination of getObjectClass() seems like a good end goal IMO, so
> the direction of the patch is interesting.  Would object_type_map and
> ObjectProperty follow the same idea of relying on the catalogs OID
> instead of the OBJECT_*?
>
> Note that there are still two dependencies to getObjectClass() in
> event_trigger.c and dependency.c.
> --

I refactored dependency.c, event_trigger.c based on
0001-Remove-ObjectClass.patch.
dependency.c already includes a bunch of catalog header files, but
event_trigger.c doesn't.
Now we need to "include" around 30 header files in event_trigger.c,
not sure if it's ok or not.

0001-Remove-ObjectClass.patch
We also need to refactor getObjectIdentityParts's below comments?
/*
* There's intentionally no default: case here; we want the
* compiler to warn if a new OCLASS hasn't been handled above.
*/
since OCLASS is removed.

`bool EventTriggerSupportsObjectClass(ObjectClass objclass)`
change to
`bool EventTriggerSupportsObjectClass(Oid classId)`

I think the function name should also be refactored.
I'm not sure of the new function name, so I didn't change.


v1-0001-Remove-ObjectClass-refactor-dependency.c-event.no-cfbot
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-29 Thread John Naylor

I wrote:

> Secondly, I thought about my recent work to skip checking if we first
> need to create a root node, and that has a harmless (for vacuum at
> least) but slightly untidy behavior: When RT_SET is first called, and
> the key is bigger than 255, new nodes will go on top of the root node.
> These have chunk '0'. If all subsequent keys are big enough, the
> orginal root node will stay empty. If all keys are deleted, there will
> be a chain of empty nodes remaining. Again, I believe this is
> harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> on this, but likely not today.

This turns out to be a lot trickier than it looked, so it seems best
to allow a trivial amount of waste, as long as it's documented
somewhere. It also wouldn't be terrible to re-add those branches,
since they're highly predictable.

I just noticed there are a lot of unused function parameters
(referring to parent slots) leftover from a few weeks ago. Those are
removed in v64-0009. 0010 makes the obvious name change in those
remaining to "parent_slot". 0011 is a simplification in two places
regarding reserving slots. This should be a bit easier to read and
possibly makes it easier on the compiler.

v64-ART.tar.gz
Description: application/gzip

Re: Injection points: some tools to wait and wake

2024-02-29 Thread Bertrand Drouvot

Hi,

On Fri, Mar 01, 2024 at 11:02:01AM +0900, Michael Paquier wrote:
> On Thu, Feb 29, 2024 at 06:19:58PM +0500, Andrey M. Borodin wrote:
> > Works fine for me. Thanks!
> 
> + "timed out waiting for the backend type to wait for the injection 
> point name";
> 
> The log should provide some context about the caller failing, meaning
> that the backend type and the injection point name should be mentioned
> in these logs to help in debugging issues.

Yeah, done in v3 attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From ea354b9d0aeee34b9ba315735cce1e2ff1cb6532 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 29 Feb 2024 08:31:28 +
Subject: [PATCH v3] Adding wait_for_injection_point helper

---
 src/test/perl/PostgreSQL/Test/Cluster.pm  | 36 +++
 .../recovery/t/041_checkpoint_at_promote.pl   |  8 +
 2 files changed, 37 insertions(+), 7 deletions(-)
  65.3% src/test/perl/PostgreSQL/Test/
  34.6% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 44c1bb5afd..8d92eb3858 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2742,6 +2742,42 @@ sub lsn
 
 =pod
 
+=item $node->wait_for_injection_point(injection_name, backend_type)
+
+Wait for the backend_type to wait for the injection point name.
+
+=cut
+
+sub wait_for_injection_point
+{
+	my ($self, $injection_name, $backend_type) = @_;
+	my $die_message;
+
+	if (defined($backend_type))
+	{
+		$backend_type = qq('$backend_type');
+		$die_message = "the backend type $backend_type";
+	}
+	else
+	{
+		$backend_type = 'backend_type';
+		$die_message = 'one backend';
+
+	}
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT count(*) > 0 FROM pg_stat_activity
+		WHERE backend_type = $backend_type AND wait_event = '$injection_name'
+	])
+	  or die
+	  qq(timed out waiting for $die_message to wait for the injection point '$injection_name');
+
+	return;
+}
+
+=pod
+
 =item $node->wait_for_catchup(standby_name, mode, target_lsn)
 
 Wait for the replication connection with application_name standby_name until
diff --git a/src/test/recovery/t/041_checkpoint_at_promote.pl b/src/test/recovery/t/041_checkpoint_at_promote.pl
index 47381a2c82..3d6faabc0b 100644
--- a/src/test/recovery/t/041_checkpoint_at_promote.pl
+++ b/src/test/recovery/t/041_checkpoint_at_promote.pl
@@ -79,13 +79,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 # Wait until the checkpointer is in the middle of the restart point
 # processing, relying on the custom wait event generated in the
 # wait callback used in the injection point previously attached.
-ok( $node_standby->poll_query_until(
-		'postgres',
-		qq[SELECT count(*) FROM pg_stat_activity
-   WHERE backend_type = 'checkpointer' AND wait_event = 'CreateRestartPoint' ;],
-		'1'),
-	'checkpointer is waiting in restart point'
-) or die "Timed out while waiting for checkpointer to run restart point";
+$node_standby->wait_for_injection_point('CreateRestartPoint','checkpointer');
 
 # Check the logs that the restart point has started on standby.  This is
 # optional, but let's be sure.
-- 
2.34.1

Re: Synchronizing slots from primary to standby

2024-02-29 Thread Amit Kapila

On Fri, Mar 1, 2024 at 9:53 AM Peter Smith  wrote:
>
> Here are some review comments for v102-0001.
>
>
> 7.
> + /*
> + * Return false if not all the standbys have caught up to the specified WAL
> + * location.
> + */
> + if (caught_up_slot_num != list_length(standby_slot_names_list))
> + return false;
>
> Somehow it seems complicated to have a counter of the slots as you
> process then compare that counter to the list_length to determine if
> one of them was skipped.
>
> Probably simpler just to set a 'skipped' flag set whenever you do 
> 'continue'...
>

The other thing is do we need to continue when we find some slot that
can't be processed? If not, then we can simply set the boolean flag,
break the loop, and return false if the boolean is set after releasing
the LWLock. The other way is we simply release lock whenever we need
to skip the slot and just return false.

-- 
With Regards,
Amit Kapila.

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-29 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Fri, 1 Mar 2024 14:31:38 +0900,
  Michael Paquier  wrote:

> I guess so.  It does not make much of a difference, though.  The thing
> is that the dispatch caused by the custom callbacks called for each
> row is noticeable in any profiles I'm taking (not that much in the
> worst-case scenarios, still a few percents), meaning that this impacts
> the performance for all the in-core formats (text, csv, binary) as
> long as we refactor text/csv/binary to use the routines of copyapi.h.
> I don't really see a way forward, except if we don't dispatch the
> in-core formats to not impact the default cases.  That makes the code
> a bit less elegant, but equally efficient for the existing formats.

It's an option based on your profile result but your
execution result also shows that v15 is faster than HEAD [1]:

> I am getting faster runtimes with v15 (6232ms in average)
> vs HEAD (6550ms) at 5M rows with COPY TO

[1] 
https://www.postgresql.org/message-id/flat/ZdbtQJ-p5H1_EDwE%40paquier.xyz#6439e6ad574f2d47cd7220e9bfed3889

I think that faster runtime is beneficial than mysterious
profile for users. So I think that we can merge v15 to
master.


Thanks,
-- 
kou

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-29 Thread Sutou Kouhei

Hi,

In <20240222.183948.518018047578925034@clear-code.com>
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Thu, 22 Feb 2024 18:39:48 +0900 (JST),
  Sutou Kouhei  wrote:

> How about adding "is_csv" to CopyReadline() and
> CopyReadLineText() too?

I tried this on my environment. This is a change for COPY
FROM not COPY TO but this decreases COPY TO
performance with [1]... Hmm...

master:   697.693 msec (the best case)
v15:  576.374 msec (the best case)
v15+this: 593.559 msec (the best case)

[1] COPY (SELECT 
1::int2,2::int2,3::int2,4::int2,5::int2,6::int2,7::int2,8::int2,9::int2,10::int2,11::int2,12::int2,13::int2,14::int2,15::int2,16::int2,17::int2,18::int2,19::int2,20::int2,
 generate_series(1, 100::int4)) TO '/dev/null' \watch c=15

So I think that v15 is good.


perf result of master:

# Children  Self  Command   Shared Object  Symbol   

#       .  
.
#
31.39%14.54%  postgres  postgres   [.] CopyOneRowTo
|--17.00%--CopyOneRowTo
|  |--10.61%--FunctionCall1Coll
|  |   --8.40%--int2out
|  | |--2.58%--pg_ltoa
|  | |   --0.68%--pg_ultoa_n
|  | |--1.11%--pg_ultoa_n
|  | |--0.83%--AllocSetAlloc
|  | 
|--0.69%--__memcpy_avx_unaligned_erms (inlined)
|  | |--0.58%--FunctionCall1Coll
|  |  --0.55%--memcpy@plt
|  |--3.25%--appendBinaryStringInfo
|  |   --0.56%--pg_ultoa_n
|   --0.69%--CopyAttributeOutText

perf result of v15:

# Children  Self  Command   Shared Object  Symbol   

#       .  
.
#
25.60%10.47%  postgres  postgres   [.] CopyToTextOneRow
|--15.39%--CopyToTextOneRow
|  |--10.44%--FunctionCall1Coll
|  |  |--7.25%--int2out
|  |  |  |--2.60%--pg_ltoa
|  |  |  |   --0.71%--pg_ultoa_n
|  |  |  |--0.90%--FunctionCall1Coll
|  |  |  |--0.84%--pg_ultoa_n
|  |  |   --0.66%--AllocSetAlloc
|  |  |--0.79%--ExecProjectSet
|  |   --0.68%--int4out
|  |--2.50%--appendBinaryStringInfo
|   --0.53%--CopyAttributeOutText


The profiles on Michael's environment [2] showed that
CopyOneRow() % was increased by v15. But it
(CopyToTextOneRow() % not CopyOneRow() %) wasn't increased
by v15. It's decreased instead.

[2] 
https://www.postgresql.org/message-id/flat/ZdbtQJ-p5H1_EDwE%40paquier.xyz#6439e6ad574f2d47cd7220e9bfed3889

So I think that v15 doesn't have performance regression but
my environment isn't suitable for benchmark...


Thanks,
-- 
kou

Re: Synchronizing slots from primary to standby

2024-02-29 Thread Peter Smith

On Fri, Mar 1, 2024 at 5:11 PM Masahiko Sawada  wrote:
>
...
> +/*
> + * "*" is not accepted as in that case primary will not be able to 
> know
> + * for which all standbys to wait for. Even if we have physical slots
> + * info, there is no way to confirm whether there is any standby
> + * configured for the known physical slots.
> + */
> +if (strcmp(*newval, "*") == 0)
> +{
> +GUC_check_errdetail("\"*\" is not accepted for
> standby_slot_names");
> +return false;
> +}
>
> Why only '*' is checked aside from validate_standby_slots()? I think
> that the doc doesn't mention anything about '*' and '*' cannot be used
> as a replication slot name. So even if we don't have this check, it
> might be no problem.
>

Hi, a while ago I asked this same question. See [1 #28] for the response..

--
[1] 
https://www.postgresql.org/message-id/OS0PR01MB571646B8186F6A06404BD50194BDA%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-29 Thread Masahiko Sawada

On Fri, Mar 1, 2024 at 12:42 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, March 1, 2024 10:17 AM Zhijie Hou (Fujitsu) 
>  wrote:
> >
> >
> > Attach the V102 patch set which addressed Amit and Shveta's comments.
> > Thanks Shveta for helping addressing the comments off-list.
>
> The cfbot reported a compile warning, here is the new version patch which 
> fixed it,
> Also removed some outdate comments in this version.
>

Thank you for updating the patch!

I've reviewed the v102-0001 patch. Here are some comments:

---
I got a compiler warning:

walsender.c:1829:6: warning: variable 'wait_event' is used
uninitialized whenever '&&' condition is false
[-Wsometimes-uninitialized]
if (!XLogRecPtrIsInvalid(RecentFlushPtr) &&
^~~~
walsender.c:1871:7: note: uninitialized use occurs here
if (wait_event == WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL)
^~
walsender.c:1829:6: note: remove the '&&' if its condition is always true
if (!XLogRecPtrIsInvalid(RecentFlushPtr) &&
^~~
walsender.c:1818:20: note: initialize the variable 'wait_event' to
silence this warning
uint32  wait_event;
  ^
   = 0
1 warning generated.

---
+void
+assign_standby_slot_names(const char *newval, void *extra)
+{
+List  *standby_slots;
+MemoryContext oldcxt;
+char  *standby_slot_names_cpy = extra;
+

Given that the newval and extra have the same data (standby_slot_names
value), why do we not use newval instead? I think that if we use
newval, we don't need to guc_strdup() in check_standby_slot_names(),
we might need to do list_copy_deep() instead, though. It's not clear
to me as there is no comment.

---
+
+standby_slot_oldest_flush_lsn = min_restart_lsn;
+

IIUC we expect that standby_slot_oldest_flush_lsn never moves
backward. If so, I think it's better to have an assertion here.

---
Resetting standby_slot_names doesn't work:

=# alter system set standby_slot_names to '';
ERROR:  invalid value for parameter "standby_slot_names": 
DETAIL:  replication slot "" does not exist

---
+/*
+ * Switch to the memory context under which GUC variables are allocated
+ * (GUCMemoryContext).
+ */
+oldcxt =
MemoryContextSwitchTo(GetMemoryChunkContext(standby_slot_names_cpy));
+standby_slot_names_list = list_copy(standby_slots);
+MemoryContextSwitchTo(oldcxt);

Why do we not explicitly switch to GUCMemoryContext?

---
+if (!standby_slot_names_list)
+return true;
+

Probably "standby_slot_names_list == NIL" is more consistent with
other places. The same can be applied in WaitForStandbyConfirmation().

---
+/*
+ * Return true if all the standbys have already caught up to
the passed in
+ * WAL localtion.
+ */
+

s/localtion/location/

---
I was a bit surprised by the fact that standby_slot_names value is
handled in a different way than a similar parameter
synchronous_standby_names. For example, the following command doesn't
work unless there is a replication slot 'slot1, slot2':

=# alter system set standby_slot_names to 'slot1, slot2';
ERROR:  invalid value for parameter "standby_slot_names": ""slot1, slot2""
DETAIL:  replication slot "slot1, slot2" does not exist

Whereas "alter system set synchronous_standby_names to stb1, stb2;"
can correctly separate the string into 'stb1' and 'stb2'.

Probably it would be okay since this behavior of standby_slot_names is
the same as other GUC parameters that accept a comma-separated string.
But I was confused a bit the first time I used it.

---
+/*
+ * "*" is not accepted as in that case primary will not be able to know
+ * for which all standbys to wait for. Even if we have physical slots
+ * info, there is no way to confirm whether there is any standby
+ * configured for the known physical slots.
+ */
+if (strcmp(*newval, "*") == 0)
+{
+GUC_check_errdetail("\"*\" is not accepted for
standby_slot_names");
+return false;
+}

Why only '*' is checked aside from validate_standby_slots()? I think
that the doc doesn't mention anything about '*' and '*' cannot be used
as a replication slot name. So even if we don't have this check, it
might be no problem.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: DOCS: Avoid using abbreviation "aka"

2024-02-29 Thread Michael Paquier

On Fri, Mar 01, 2024 at 11:08:21AM +0530, Amit Kapila wrote:
> I wanted to wait for two or three days to see if any other fixes in
> docs, typos, or cosmetic stuff are reported in this functionality then
> I can combine and push them. However, there is no harm in pushing them
> separately, so if you want to go ahead please feel free to do so.

Nah, feel free to :)
--
Michael


signature.asc
Description: PGP signature

Re: Add system identifier to backup manifest

2024-02-29 Thread Michael Paquier

On Mon, Feb 19, 2024 at 12:06:19PM +0530, Amul Sul wrote:
> Agreed, now they will have an error as _could not read file "": Is
> a directory_. But, IIUC, that what usually happens with the dev version, and
> the extension needs to be updated for compatibility with the newer version for
> the same reason.

I was reading through the remaining pieces of the patch set, and are
you sure that there is a need for 0002 at all?  The only reason why
get_dir_controlfile() is introduced is to be able to get the contents
of a control file with a full path to it, and not a data folder.  Now,
if I look closely, with 0002~0004 applied, the only two callers of
get_controlfile() are pg_combinebackup.c and pg_verifybackup.c.  Both
of them have an access to the backup directories, which point to the
root of the data folder.  pg_combinebackup can continue to use
backup_dirs[i].  pg_verifybackup has an access to the backup directory
in the context data, if I'm reading this right, so you could just use
that in verify_system_identifier().
--
Michael


signature.asc
Description: PGP signature

Re: DOCS: Avoid using abbreviation "aka"

2024-02-29 Thread Amit Kapila

On Fri, Mar 1, 2024 at 4:25 AM Michael Paquier  wrote:
>
> On Thu, Feb 29, 2024 at 02:42:08PM +0530, Amit Kapila wrote:
> > +1. LGTM as well.
>
> This has been introduced by ddd5f4f54a02, so if you wish to fix it
> yourself, please feel free.  If you'd prefer that I take care of it,
> I'm OK to do so as well.
>

I wanted to wait for two or three days to see if any other fixes in
docs, typos, or cosmetic stuff are reported in this functionality then
I can combine and push them. However, there is no harm in pushing them
separately, so if you want to go ahead please feel free to do so.

-- 
With Regards,
Amit Kapila.

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-02-29 Thread Michael Paquier

On Thu, Feb 22, 2024 at 06:39:48PM +0900, Sutou Kouhei wrote:
> If so, adding the change independently on HEAD makes
> sense. But I don't know why that improves
> performance... Inlining?

I guess so.  It does not make much of a difference, though.  The thing
is that the dispatch caused by the custom callbacks called for each
row is noticeable in any profiles I'm taking (not that much in the
worst-case scenarios, still a few percents), meaning that this impacts
the performance for all the in-core formats (text, csv, binary) as
long as we refactor text/csv/binary to use the routines of copyapi.h.
I don't really see a way forward, except if we don't dispatch the
in-core formats to not impact the default cases.  That makes the code
a bit less elegant, but equally efficient for the existing formats.
--
Michael


signature.asc
Description: PGP signature

Re: Improve readability by using designated initializers when possible

2024-02-29 Thread Jelte Fennema-Nio

On Fri, 1 Mar 2024 at 06:08, Michael Paquier  wrote:
> Mostly OK to me.  Just note that this comment is incorrect because
> pg_enc2gettext_tbl[] includes elements in the range
> [PG_SJIS,_PG_LAST_ENCODING_[  ;)

fixed
From 0d66d374697eff2b276b667ce24cf4599432dbd5 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Tue, 27 Feb 2024 12:26:10 +0100
Subject: [PATCH v9] Simplify pg_enc2gettext_tbl

Use designated initialization of pg_enc2gettext_tbl to simplify the
implementation of raw_pg_bind_textdomain_codeset. Now iteration over the
array is not needed anymore. Instead the desired element can simply be
fetched by its index.
---
 src/backend/utils/mb/mbutils.c | 24 --
 src/common/encnames.c  | 86 +-
 src/include/mb/pg_wchar.h  | 12 ++---
 3 files changed, 57 insertions(+), 65 deletions(-)

diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index c13f947a827..5818e752280 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -1188,24 +1188,20 @@ static bool
 raw_pg_bind_textdomain_codeset(const char *domainname, int encoding)
 {
 	bool		elog_ok = (CurrentMemoryContext != NULL);
-	int			i;
 
-	for (i = 0; pg_enc2gettext_tbl[i].name != NULL; i++)
+	if (!PG_VALID_ENCODING(encoding) || pg_enc2gettext_tbl[encoding] == NULL)
 	{
-		if (pg_enc2gettext_tbl[i].encoding == encoding)
-		{
-			if (bind_textdomain_codeset(domainname,
-		pg_enc2gettext_tbl[i].name) != NULL)
-return true;
+		return false;
+	}
 
-			if (elog_ok)
-elog(LOG, "bind_textdomain_codeset failed");
-			else
-write_stderr("bind_textdomain_codeset failed");
+	if (bind_textdomain_codeset(domainname,
+pg_enc2gettext_tbl[encoding]) != NULL)
+		return true;
 
-			break;
-		}
-	}
+	if (elog_ok)
+		elog(LOG, "bind_textdomain_codeset failed");
+	else
+		write_stderr("bind_textdomain_codeset failed");
 
 	return false;
 }
diff --git a/src/common/encnames.c b/src/common/encnames.c
index dba6bd2c9ee..910cc2c7e5c 100644
--- a/src/common/encnames.c
+++ b/src/common/encnames.c
@@ -357,50 +357,50 @@ const pg_enc2name pg_enc2name_tbl[] =
  * This covers all encodings except MULE_INTERNAL, which is alien to gettext.
  * --
  */
-const pg_enc2gettext pg_enc2gettext_tbl[] =
+const char *pg_enc2gettext_tbl[] =
 {
-	{PG_SQL_ASCII, "US-ASCII"},
-	{PG_UTF8, "UTF-8"},
-	{PG_LATIN1, "LATIN1"},
-	{PG_LATIN2, "LATIN2"},
-	{PG_LATIN3, "LATIN3"},
-	{PG_LATIN4, "LATIN4"},
-	{PG_ISO_8859_5, "ISO-8859-5"},
-	{PG_ISO_8859_6, "ISO_8859-6"},
-	{PG_ISO_8859_7, "ISO-8859-7"},
-	{PG_ISO_8859_8, "ISO-8859-8"},
-	{PG_LATIN5, "LATIN5"},
-	{PG_LATIN6, "LATIN6"},
-	{PG_LATIN7, "LATIN7"},
-	{PG_LATIN8, "LATIN8"},
-	{PG_LATIN9, "LATIN-9"},
-	{PG_LATIN10, "LATIN10"},
-	{PG_KOI8R, "KOI8-R"},
-	{PG_KOI8U, "KOI8-U"},
-	{PG_WIN1250, "CP1250"},
-	{PG_WIN1251, "CP1251"},
-	{PG_WIN1252, "CP1252"},
-	{PG_WIN1253, "CP1253"},
-	{PG_WIN1254, "CP1254"},
-	{PG_WIN1255, "CP1255"},
-	{PG_WIN1256, "CP1256"},
-	{PG_WIN1257, "CP1257"},
-	{PG_WIN1258, "CP1258"},
-	{PG_WIN866, "CP866"},
-	{PG_WIN874, "CP874"},
-	{PG_EUC_CN, "EUC-CN"},
-	{PG_EUC_JP, "EUC-JP"},
-	{PG_EUC_KR, "EUC-KR"},
-	{PG_EUC_TW, "EUC-TW"},
-	{PG_EUC_JIS_2004, "EUC-JP"},
-	{PG_SJIS, "SHIFT-JIS"},
-	{PG_BIG5, "BIG5"},
-	{PG_GBK, "GBK"},
-	{PG_UHC, "UHC"},
-	{PG_GB18030, "GB18030"},
-	{PG_JOHAB, "JOHAB"},
-	{PG_SHIFT_JIS_2004, "SHIFT_JISX0213"},
-	{0, NULL}
+	[PG_SQL_ASCII] = "US-ASCII",
+	[PG_UTF8] = "UTF-8",
+	[PG_MULE_INTERNAL] = NULL,
+	[PG_LATIN1] = "LATIN1",
+	[PG_LATIN2] = "LATIN2",
+	[PG_LATIN3] = "LATIN3",
+	[PG_LATIN4] = "LATIN4",
+	[PG_ISO_8859_5] = "ISO-8859-5",
+	[PG_ISO_8859_6] = "ISO_8859-6",
+	[PG_ISO_8859_7] = "ISO-8859-7",
+	[PG_ISO_8859_8] = "ISO-8859-8",
+	[PG_LATIN5] = "LATIN5",
+	[PG_LATIN6] = "LATIN6",
+	[PG_LATIN7] = "LATIN7",
+	[PG_LATIN8] = "LATIN8",
+	[PG_LATIN9] = "LATIN-9",
+	[PG_LATIN10] = "LATIN10",
+	[PG_KOI8R] = "KOI8-R",
+	[PG_KOI8U] = "KOI8-U",
+	[PG_WIN1250] = "CP1250",
+	[PG_WIN1251] = "CP1251",
+	[PG_WIN1252] = "CP1252",
+	[PG_WIN1253] = "CP1253",
+	[PG_WIN1254] = "CP1254",
+	[PG_WIN1255] = "CP1255",
+	[PG_WIN1256] = "CP1256",
+	[PG_WIN1257] = "CP1257",
+	[PG_WIN1258] = "CP1258",
+	[PG_WIN866] = "CP866",
+	[PG_WIN874] = "CP874",
+	[PG_EUC_CN] = "EUC-CN",
+	[PG_EUC_JP] = "EUC-JP",
+	[PG_EUC_KR] = "EUC-KR",
+	[PG_EUC_TW] = "EUC-TW",
+	[PG_EUC_JIS_2004] = "EUC-JP",
+	[PG_SJIS] = "SHIFT-JIS",
+	[PG_BIG5] = "BIG5",
+	[PG_GBK] = "GBK",
+	[PG_UHC] = "UHC",
+	[PG_GB18030] = "GB18030",
+	[PG_JOHAB] = "JOHAB",
+	[PG_SHIFT_JIS_2004] = "SHIFT_JISX0213",
 };
 
 
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index fd91aefbcb7..cc804cc0f5b 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -225,7 +225,9 @@ typedef unsigned int pg_wchar;
  * PostgreSQL encoding identifiers
  *
  * WARNING: If you add some encoding don't forget to update
- *			the pg_enc2name_tbl[] array (in src/common/encnames.c) and
+ *			the pg_enc2name_tbl[] array (in

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-02-29 Thread Jelte Fennema-Nio

On Fri, 1 Mar 2024 at 06:15, Michael Paquier  wrote:
>
> On Fri, Mar 01, 2024 at 05:43:25AM +0100, Jelte Fennema-Nio wrote:
> > I think we should set the AM OID explicitly. Because an important
> > thing to consider is: What behaviour makes sense when later
> > default_table_access_method is changed?
> > I think if someone sets it explicitly on the partitioned table, they
> > would want the AM of the partitioned table to stay the same when
> > default_table_access_method is changed. Which requires storing the AM
> > OID afaict.
>
> Oops, I think I misread that.  You just mean to always set relam when
> using an AM in the SET ACCESS METHOD clause.  Apologies for the noise.

Correct, I intended to say that "SET ACCESS METHOD heap" on a
partitioned table should store heap its OID. Because while storing 0
might be simpler, it will result in (imho) wrong behaviour when later
the default_table_access_method is changed. behavior won't result in
the (imho) intended. i.e. it's not simply a small detail in what the
catolog looks like, but there's an actual behavioural change.

Re: Make query cancellation keys longer

2024-02-29 Thread Jelte Fennema-Nio

This is a preliminary review. I'll look at this more closely soon.

On Thu, 29 Feb 2024 at 22:26, Heikki Linnakangas  wrote:
> If the client requests the "_pq_.extended_query_cancel" protocol
> feature, the server will generate a longer 256-bit cancellation key.

Huge +1 for this general idea. This is a problem I ran into with
PgBouncer, and had to make some concessions when fitting the
information I wanted into the available bits. Also from a security
perspective I don't think the current amount of bits have stood the
test of time.

+ ADD_STARTUP_OPTION("_pq_.extended_query_cancel", "");

Since this parameter doesn't actually take a value (just empty
string). I think this behaviour makes more sense as a minor protocol
version bump instead of a parameter.

+ if (strcmp(conn->workBuffer.data, "_pq_.extended_query_cancel") == 0)
+ {
+ /* that's ok */
+ continue;
+ }

Please see this thread[1], which in the first few patches makes
pqGetNegotiateProtocolVersion3 actually usable for extending the
protocol. I started that, because very proposed protocol change that's
proposed on the list has similar changes to
pqGetNegotiateProtocolVersion3 and I think we shouldn't make these
changes ad-hoc hacked into the current code, but actually do them once
in a way that makes sense for all protocol changes.

> Documentation is still missing

I think at least protocol message type documentation would be very
helpful in reviewing, because that is really a core part of this
change. Based on the current code I think it should have a few
changes:

+ int cancel_key_len = 5 + msgLength - (conn->inCursor - conn->inStart);
+
+ conn->be_cancel_key = malloc(cancel_key_len);
+ if (conn->be_cancel_key == NULL)

This is using the message length to determine the length of the cancel
key in BackendKey. That is not something we generally do in the
protocol. It's even documented: "Notice that although each message
includes a byte count at the beginning, the message format is defined
so that the message end can be found without reference to the byte
count." So I think the patch should be changed to include the length
of the cancel key explicitly in the message.

[1]: 
https://www.postgresql.org/message-id/flat/CAGECzQSr2%3DJPJHNN06E_jTF2%2B0E60K%3DhotyBw5wY%3Dq9Wvmt7DQ%40mail.gmail.com#359e4222eb161da37124be1a384f8d92

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-02-29 Thread Michael Paquier

On Fri, Mar 01, 2024 at 05:43:25AM +0100, Jelte Fennema-Nio wrote:
> I think we should set the AM OID explicitly. Because an important
> thing to consider is: What behaviour makes sense when later
> default_table_access_method is changed?
> I think if someone sets it explicitly on the partitioned table, they
> would want the AM of the partitioned table to stay the same when
> default_table_access_method is changed. Which requires storing the AM
> OID afaict.

Oops, I think I misread that.  You just mean to always set relam when
using an AM in the SET ACCESS METHOD clause.  Apologies for the noise.
--
Michael


signature.asc
Description: PGP signature

Re: Improve readability by using designated initializers when possible

2024-02-29 Thread Michael Paquier

On Fri, Mar 01, 2024 at 05:34:05AM +0100, Jelte Fennema-Nio wrote:

> diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
> index fd91aefbcb7..32e25a1a6ea 100644
> --- a/src/include/mb/pg_wchar.h
> +++ b/src/include/mb/pg_wchar.h
> @@ -225,7 +225,8 @@ typedef unsigned int pg_wchar;
>   * PostgreSQL encoding identifiers
>   *
>   * WARNING: If you add some encoding don't forget to update
> - *   the pg_enc2name_tbl[] array (in src/common/encnames.c) 
> and
> + *   the pg_enc2name_tbl[] array (in src/common/encnames.c),
> + *   the pg_enc2gettext_tbl[] array (in 
> src/common/encnames.c) and
>   *   the pg_wchar_table[] array (in src/common/wchar.c) and 
> to check
>   *   PG_ENCODING_BE_LAST macro.

Mostly OK to me.  Just note that this comment is incorrect because
pg_enc2gettext_tbl[] includes elements in the range
[PG_SJIS,_PG_LAST_ENCODING_[  ;)
--
Michael


signature.asc
Description: PGP signature

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-02-29 Thread Michael Paquier

On Fri, Mar 01, 2024 at 05:43:25AM +0100, Jelte Fennema-Nio wrote:
> I think we should set the AM OID explicitly. Because an important
> thing to consider is: What behaviour makes sense when later
> default_table_access_method is changed?

Per the latest discussion of the thread, we've kind of reached a
consensus that we should keep the current historical bevahior on
default, where relam remains at 0, causing new partitions to grab the
GUC as AM.  If we create a partitioned table attached to a partitioned
table, it should be 0 as well.  If the partitioned table has a non-0
relam, a new partitioned table created on it will inherit the same
non-0 value. 

> I think if someone sets it explicitly on the partitioned table, they
> would want the AM of the partitioned table to stay the same when
> default_table_access_method is changed. Which requires storing the AM
> OID afaict.

If we allow relam to be non-0 for a partitioned table, it is equally
important to give users a way to reset it at will.  My point was a bit
more subtle than that.  For example, this sequence is clear to me:
SET default_table_access_method = 'foo';
ALTER TABLE part SET ACCESS METHOD DEFAULT;

The user wants to rely on the GUC, so relam should be 0, new
partitions created on it will use the GUC.

Now, what should this sequence mean?  See:
SET default_table_access_method = 'foo';
ALTER TABLE part SET ACCESS METHOD foo;

Should the relam be 0 because the user requested a match with the GUC,
or use the OID of the AM?  There has to be some difference with
tablespaces, because relations with physical storage (tables,
matviews) can use a reltablespace of 0, but AMs have to be set for
tables and matviews.

Fun topic, especially once coupled with the internals of tablecmds.c
that uses InvalidOid for the new access AM as a special value to work
as a no-op.
--
Michael

signature.asc
Description: PGP signature

Re: Add new error_action COPY ON_ERROR "log"

2024-02-29 Thread Michael Paquier

On Wed, Feb 28, 2024 at 12:10:00PM +0530, Bharath Rupireddy wrote:
> On Mon, Feb 26, 2024 at 5:47 PM torikoshia  wrote:
>> +   if (cstate->opts.on_error != COPY_ON_ERROR_STOP)
>> +   {
>> +   ereport(NOTICE,
>>
>> I think cstate->opts.on_error is not COPY_ON_ERROR_STOP here, since if
>> it is COPY_ON_ERROR_STOP, InputFunctionCallSafe() should already have
>> errored out.
>>
>> Should it be something like "Assert(cstate->opts.on_error !=
>> COPY_ON_ERROR_STOP)"?
> 
> Nice catch. When COPY_ON_ERROR_STOP is specified, we use ereport's
> soft error mechanism. An assertion seems a good choice to validate the
> state is what we expect. Done that way.

Hmm.  I am not really on board with this patch, that would generate
one NOTICE message each time a row is incompatible in the soft error
mode.  If you have a couple of billion rows to bulk-load into the
backend and even 0.01% of them are corrupted, you could finish with a
more than 100k log entries, and all systems should be careful about
the log quantity generated, especially if we use the syslogger which
could become easily a bottleneck.

The existing ON_ERROR controls what to do on error.  I think that we'd
better control the amount of information reported with a completely 
separate option, an option even different than where to redirect
errors (if required, which would be either the logs, the client, a
pipe, a combination of these or even all of them).
--
Michael


signature.asc
Description: PGP signature

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-02-29 Thread Jelte Fennema-Nio

On Fri, 1 Mar 2024 at 02:57, Michael Paquier  wrote:
> When it comes to partitioned tables, there is a still a tricky case:
> what should we do when a user specifies a non-default value in the SET
> ACCESS METHOD clause and it matches default_table_access_method?
> Should the relam be 0 or should we force relam to be the OID of the
> given value given by the query?  Implementation-wise, forcing the
> value to 0 is simpler, but I can get why it could be confusing as
> well, because the state of the catalogs does not reflect what was
> provided in the query.  At the same time, the user has explicitly set
> the access method to be the same as the default, so perhaps 0 makes
> sense anyway in this case.

I think we should set the AM OID explicitly. Because an important
thing to consider is: What behaviour makes sense when later
default_table_access_method is changed?
I think if someone sets it explicitly on the partitioned table, they
would want the AM of the partitioned table to stay the same when
default_table_access_method is changed. Which requires storing the AM
OID afaict.

Re: Improve readability by using designated initializers when possible

2024-02-29 Thread Jelte Fennema-Nio

On Fri, 1 Mar 2024 at 05:12, Michael Paquier  wrote:
> Shouldn't PG_MULE_INTERNAL point to NULL in pg_enc2gettext_tbl[]?
> That just seems safer to me, and more consistent because its values
> satisfies PG_VALID_ENCODING().

Safety wise it doesn't matter, because gaps in a designated
initializer array will be initialized with 0/NULL. But I agree it's
more consistent, so see attached.
From 1a45e03cdaf0c0cf962a33cea8cb25a7c2783d9d Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Tue, 27 Feb 2024 12:26:10 +0100
Subject: [PATCH v8] Simplify pg_enc2gettext_tbl

Use designated initialization of pg_enc2gettext_tbl to simplify the
implementation of raw_pg_bind_textdomain_codeset. Now iteration over the
array is not needed anymore. Instead the desired element can simply be
fetched by its index.
---
 src/backend/utils/mb/mbutils.c | 24 --
 src/common/encnames.c  | 86 +-
 src/include/mb/pg_wchar.h  | 11 ++---
 3 files changed, 56 insertions(+), 65 deletions(-)

diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index c13f947a827..5818e752280 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -1188,24 +1188,20 @@ static bool
 raw_pg_bind_textdomain_codeset(const char *domainname, int encoding)
 {
 	bool		elog_ok = (CurrentMemoryContext != NULL);
-	int			i;
 
-	for (i = 0; pg_enc2gettext_tbl[i].name != NULL; i++)
+	if (!PG_VALID_ENCODING(encoding) || pg_enc2gettext_tbl[encoding] == NULL)
 	{
-		if (pg_enc2gettext_tbl[i].encoding == encoding)
-		{
-			if (bind_textdomain_codeset(domainname,
-		pg_enc2gettext_tbl[i].name) != NULL)
-return true;
+		return false;
+	}
 
-			if (elog_ok)
-elog(LOG, "bind_textdomain_codeset failed");
-			else
-write_stderr("bind_textdomain_codeset failed");
+	if (bind_textdomain_codeset(domainname,
+pg_enc2gettext_tbl[encoding]) != NULL)
+		return true;
 
-			break;
-		}
-	}
+	if (elog_ok)
+		elog(LOG, "bind_textdomain_codeset failed");
+	else
+		write_stderr("bind_textdomain_codeset failed");
 
 	return false;
 }
diff --git a/src/common/encnames.c b/src/common/encnames.c
index dba6bd2c9ee..910cc2c7e5c 100644
--- a/src/common/encnames.c
+++ b/src/common/encnames.c
@@ -357,50 +357,50 @@ const pg_enc2name pg_enc2name_tbl[] =
  * This covers all encodings except MULE_INTERNAL, which is alien to gettext.
  * --
  */
-const pg_enc2gettext pg_enc2gettext_tbl[] =
+const char *pg_enc2gettext_tbl[] =
 {
-	{PG_SQL_ASCII, "US-ASCII"},
-	{PG_UTF8, "UTF-8"},
-	{PG_LATIN1, "LATIN1"},
-	{PG_LATIN2, "LATIN2"},
-	{PG_LATIN3, "LATIN3"},
-	{PG_LATIN4, "LATIN4"},
-	{PG_ISO_8859_5, "ISO-8859-5"},
-	{PG_ISO_8859_6, "ISO_8859-6"},
-	{PG_ISO_8859_7, "ISO-8859-7"},
-	{PG_ISO_8859_8, "ISO-8859-8"},
-	{PG_LATIN5, "LATIN5"},
-	{PG_LATIN6, "LATIN6"},
-	{PG_LATIN7, "LATIN7"},
-	{PG_LATIN8, "LATIN8"},
-	{PG_LATIN9, "LATIN-9"},
-	{PG_LATIN10, "LATIN10"},
-	{PG_KOI8R, "KOI8-R"},
-	{PG_KOI8U, "KOI8-U"},
-	{PG_WIN1250, "CP1250"},
-	{PG_WIN1251, "CP1251"},
-	{PG_WIN1252, "CP1252"},
-	{PG_WIN1253, "CP1253"},
-	{PG_WIN1254, "CP1254"},
-	{PG_WIN1255, "CP1255"},
-	{PG_WIN1256, "CP1256"},
-	{PG_WIN1257, "CP1257"},
-	{PG_WIN1258, "CP1258"},
-	{PG_WIN866, "CP866"},
-	{PG_WIN874, "CP874"},
-	{PG_EUC_CN, "EUC-CN"},
-	{PG_EUC_JP, "EUC-JP"},
-	{PG_EUC_KR, "EUC-KR"},
-	{PG_EUC_TW, "EUC-TW"},
-	{PG_EUC_JIS_2004, "EUC-JP"},
-	{PG_SJIS, "SHIFT-JIS"},
-	{PG_BIG5, "BIG5"},
-	{PG_GBK, "GBK"},
-	{PG_UHC, "UHC"},
-	{PG_GB18030, "GB18030"},
-	{PG_JOHAB, "JOHAB"},
-	{PG_SHIFT_JIS_2004, "SHIFT_JISX0213"},
-	{0, NULL}
+	[PG_SQL_ASCII] = "US-ASCII",
+	[PG_UTF8] = "UTF-8",
+	[PG_MULE_INTERNAL] = NULL,
+	[PG_LATIN1] = "LATIN1",
+	[PG_LATIN2] = "LATIN2",
+	[PG_LATIN3] = "LATIN3",
+	[PG_LATIN4] = "LATIN4",
+	[PG_ISO_8859_5] = "ISO-8859-5",
+	[PG_ISO_8859_6] = "ISO_8859-6",
+	[PG_ISO_8859_7] = "ISO-8859-7",
+	[PG_ISO_8859_8] = "ISO-8859-8",
+	[PG_LATIN5] = "LATIN5",
+	[PG_LATIN6] = "LATIN6",
+	[PG_LATIN7] = "LATIN7",
+	[PG_LATIN8] = "LATIN8",
+	[PG_LATIN9] = "LATIN-9",
+	[PG_LATIN10] = "LATIN10",
+	[PG_KOI8R] = "KOI8-R",
+	[PG_KOI8U] = "KOI8-U",
+	[PG_WIN1250] = "CP1250",
+	[PG_WIN1251] = "CP1251",
+	[PG_WIN1252] = "CP1252",
+	[PG_WIN1253] = "CP1253",
+	[PG_WIN1254] = "CP1254",
+	[PG_WIN1255] = "CP1255",
+	[PG_WIN1256] = "CP1256",
+	[PG_WIN1257] = "CP1257",
+	[PG_WIN1258] = "CP1258",
+	[PG_WIN866] = "CP866",
+	[PG_WIN874] = "CP874",
+	[PG_EUC_CN] = "EUC-CN",
+	[PG_EUC_JP] = "EUC-JP",
+	[PG_EUC_KR] = "EUC-KR",
+	[PG_EUC_TW] = "EUC-TW",
+	[PG_EUC_JIS_2004] = "EUC-JP",
+	[PG_SJIS] = "SHIFT-JIS",
+	[PG_BIG5] = "BIG5",
+	[PG_GBK] = "GBK",
+	[PG_UHC] = "UHC",
+	[PG_GB18030] = "GB18030",
+	[PG_JOHAB] = "JOHAB",
+	[PG_SHIFT_JIS_2004] = "SHIFT_JISX0213",
 };
 
 
diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h
index fd91aefbcb7..32e25a1a6ea 100644
--- a/src/include/mb/pg_wchar.h
+++ b/src/include/mb/pg_wchar.h
@@ -225,7 +225,8 @@ typedef unsigned int pg_wchar;
  * PostgreSQL encoding identifiers

Re: Synchronizing slots from primary to standby

2024-02-29 Thread Amit Kapila

On Fri, Mar 1, 2024 at 8:58 AM shveta malik  wrote:
>
> On Wed, Feb 28, 2024 at 4:51 PM Amit Kapila  wrote:
> >
> > On Wed, Feb 28, 2024 at 3:26 PM shveta malik  wrote:
> > >
> > >
> > > Here is the patch which addresses the above comments. Also optimized
> > > the test a little bit. Now we use pg_sync_replication_slots() function
> > > instead of worker to test the operator-redirection using search-patch.
> > > This has been done to simplify the test case and reduce the added
> > > time.
> > >
> >
> > I have slightly adjusted the comments in the attached, otherwise, LGTM.
>
> This patch was pushed (commit: b3f6b14) and it resulted in BF failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo=2024-02-29%2012%3A49%3A27
>

Yeah, we forgot to allow proper authentication on Windows for
non-superusers used in the test. We need to use: "auth_extra => [
'--create-role', 'repl_role' ])" in the test. See attached. I'll do
some more testing with this and then push it.

-- 
With Regards,
Amit Kapila.


fix_user_auth_slot_sync_1.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-02-29 Thread Peter Smith

Here are some review comments for v102-0001.

==
doc/src/sgml/config.sgml

1.
+   
+Lists the streaming replication standby server slot names that logical
+WAL sender processes will wait for. Logical WAL sender processes will
+send decoded changes to plugins only after the specified replication
+slots confirm receiving WAL. This guarantees that logical replication
+slots with failover enabled do not consume changes until those changes
+are received and flushed to corresponding physical standbys. If a
+logical replication connection is meant to switch to a physical standby
+after the standby is promoted, the physical replication slot for the
+standby should be listed here. Note that logical replication will not
+proceed if the slots specified in the standby_slot_names do
not exist or
+are invalidated.
+   

Should this also mention the effect this GUC has on those 2 SQL
functions? E.g. Commit message says:

Additionally, The SQL functions pg_logical_slot_get_changes and
pg_replication_slot_advance are modified to wait for the replication
slots mentioned in 'standby_slot_names' to catch up before returning.

==
src/backend/replication/slot.c

2. validate_standby_slots

+ else if (!ReplicationSlotCtl)
+ {
+ /*
+ * We cannot validate the replication slot if the replication slots'
+ * data has not been initialized. This is ok as we will validate the
+ * specified slot when waiting for them to catch up. See
+ * StandbySlotsHaveCaughtup for details.
+ */
+ }
+ else
+ {
+ /*
+ * If the replication slots' data have been initialized, verify if the
+ * specified slots exist and are logical slots.
+ */
+ LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

IMO that 2nd comment does not need to say "If the replication slots'
data have been initialized," because that is implicit from the
if/else.

~~~

3. GetStandbySlotList

+List *
+GetStandbySlotList(void)
+{
+ if (RecoveryInProgress())
+ return NIL;
+ else
+ return standby_slot_names_list;
+}

The 'else' is not needed. IMO is better without it, but YMMV.

~~~

4. StandbySlotsHaveCaughtup

+/*
+ * Return true if the slots specified in standby_slot_names have caught up to
+ * the given WAL location, false otherwise.
+ *
+ * The elevel parameter determines the error level used for logging messages
+ * related to slots that do not exist, are invalidated, or are inactive.
+ */
+bool
+StandbySlotsHaveCaughtup(XLogRecPtr wait_for_lsn, int elevel)

/determines/specifies/

~

5.
+ /*
+ * Don't need to wait for the standby to catch up if there is no value in
+ * standby_slot_names.
+ */
+ if (!standby_slot_names_list)
+ return true;
+
+ /*
+ * If we are on a standby server, we do not need to wait for the standby to
+ * catch up since we do not support syncing slots to cascading standbys.
+ */
+ if (RecoveryInProgress())
+ return true;
+
+ /*
+ * Return true if all the standbys have already caught up to the passed in
+ * WAL localtion.
+ */
+ if (!XLogRecPtrIsInvalid(standby_slot_oldest_flush_lsn) &&
+ standby_slot_oldest_flush_lsn >= wait_for_lsn)
+ return true;


5a.
I felt all these comments should be worded in a consistent way like
"Don't need to wait ..."

e.g.
1. Don't need to wait for the standbys to catch up if there is no
value in 'standby_slot_names'.
2. Don't need to wait for the standbys to catch up if we are on a
standby server, since we do not support syncing slots to cascading
standbys.
3. Don't need to wait for the standbys to catch up if they are already
beyond the specified WAL location.

~

5b.
typo
/WAL localtion/WAL location/

~~~

6.
+ if (!slot)
+ {
+ /*
+ * It may happen that the slot specified in standby_slot_names GUC
+ * value is dropped, so let's skip over it.
+ */
+ ereport(elevel,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("replication slot \"%s\" specified in parameter %s does not exist",
+name, "standby_slot_names"));
+ continue;
+ }

Is "is dropped" strictly the only reason? IIUC another reason is the
slot maybe just did not even exist in the first place but it was not
detected before now because inititial validation was skipped.

~~~

7.
+ /*
+ * Return false if not all the standbys have caught up to the specified WAL
+ * location.
+ */
+ if (caught_up_slot_num != list_length(standby_slot_names_list))
+ return false;

Somehow it seems complicated to have a counter of the slots as you
process then compare that counter to the list_length to determine if
one of them was skipped.

Probably simpler just to set a 'skipped' flag set whenever you do 'continue'...

==
src/backend/replication/walsender.c

8.
+/*
+ * Returns true if there are not enough WALs to be read, or if not all standbys
+ * have caught up to the flushed position when failover_slot is true;
+ * otherwise, returns false.
+ *
+ * Set prioritize_stop to true to skip waiting for WALs if the shutdown signal
+ * is received.
+ *
+ * Set failover_slot to true if the

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi

At Fri, 01 Mar 2024 12:37:55 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Anyway, our current policy here is to avoid record-rereads beyond
> source switches. However, fixing this seems to require that source
> switches cause record rereads unless some additional information is
> available to know if replay LSN needs to back up.

It seems to me that the error messages are related to commit 0668719801.

XLogPageRead:
> * Check the page header immediately, so that we can retry immediately if
> * it's not valid. This may seem unnecessary, because ReadPageInternal()
> * validates the page header anyway, and would propagate the failure up to
> * ReadRecord(), which would retry. However, there's a corner case with
> * continuation records, if a record is split across two pages such that
> * we would need to read the two pages from different sources. For
...
>   if (StandbyMode &&
>   !XLogReaderValidatePageHeader(xlogreader, targetPagePtr, 
> readBuf))
>   {
>   /*
>* Emit this error right now then retry this page immediately. 
> Use
>* errmsg_internal() because the message was already translated.
>*/
>   if (xlogreader->errormsg_buf[0])
>   ereport(emode_for_corrupt_record(emode, 
> xlogreader->EndRecPtr),
>   (errmsg_internal("%s", 
> xlogreader->errormsg_buf)));

This code intends to prevent a page header error from causing a record
reread, when a record is required to be read from multiple sources. We
could restrict this to only fire at segment boundaries. At segment
boundaries, we won't let LSN back up by using XLP_FIRST_IS_CONTRECORD.

Having thought up to this point, I now believe that we should
completely prevent LSN from going back in any case. One drawback is
that the fix cannot be back-patched.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-29 Thread John Naylor

I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and
I'm not convinced it's freeing all the memory. It's been many months
since we discussed this last, but IIRC we cannot just tell DSA to free
all its segments, right? Is there currently anything preventing us
from destroying the whole DSA area at once?

+ /* The last level node has pointers to values */
+ if (shift == 0)
+ {
+   dsa_free(tree->dsa, ptr);
+   return;
+ }

IIUC, this doesn't actually free leaves, it only frees the last-level
node. And, this function is unaware of whether children could be
embedded values. I'm thinking we need to get rid of the above
pre-check and instead, each node kind to have something like (e.g.
node4):

RT_PTR_ALLOC child = n4->children[i];

if (shift > 0)
  RT_FREE_RECURSE(tree, child, shift - RT_SPAN);
else if (!RT_CHILDPTR_IS_VALUE(child))
  dsa_free(tree->dsa, child);

...or am I missing something?

Re: Improve readability by using designated initializers when possible

2024-02-29 Thread Michael Paquier

On Thu, Feb 29, 2024 at 04:01:47AM +0100, Jelte Fennema-Nio wrote:
> On Thu, 29 Feb 2024 at 01:57, Michael Paquier  wrote:
>> I have doubts about the changes in raw_pg_bind_textdomain_codeset(),
>> as the encoding could come from the value in the pg_database tuples
>> themselves.  The current coding is slightly safer from the perspective
>> of bogus input values as we would loop over pg_enc2gettext_tbl looking
>> for a match.  0003 changes that so as we could point to incorrect
>> memory areas rather than fail safely for the NULL check.
> 
> That's fair. Attached is a patch that adds a PG_VALID_ENCODING check
> to raw_pg_bind_textdomain_codeset to solve this regression.

-   for (i = 0; pg_enc2gettext_tbl[i].name != NULL; i++)
+   if (!PG_VALID_ENCODING(encoding) || pg_enc2gettext_tbl[encoding] == NULL)   
  { 

Shouldn't PG_MULE_INTERNAL point to NULL in pg_enc2gettext_tbl[]?
That just seems safer to me, and more consistent because its values
satisfies PG_VALID_ENCODING().
--
Michael


signature.asc
Description: PGP signature

Re: Improve readability by using designated initializers when possible

2024-02-29 Thread Michael Paquier

On Thu, Feb 29, 2024 at 12:41:38PM +0100, Peter Eisentraut wrote:
> On 27.02.24 08:57, Alvaro Herrera wrote:
>> On 2024-Feb-27, Michael Paquier wrote:
>>> These would cause compilation failures.  Saying that, this is a very
>>> nice cleanup, so I've fixed these and applied the patch after checking
>>> that the one-one replacements were correct.
>> 
>> Oh, I thought we were going to get rid of ObjectClass altogether -- I
>> mean, have getObjectClass() return ObjectAddress->classId, and then
>> define the OCLASS values for each catalog OID [... tries to ...]  But
>> this(*) doesn't work for two reasons:
> 
> I have long wondered what the point of ObjectClass is.  I find the extra
> layer of redirection, which is used only in small parts of the code, and the
> similarity to ObjectType confusing.  I happened to have a draft patch for
> its removal lying around, so I'll show it here, rebased over what has
> already been done in this thread.

The elimination of getObjectClass() seems like a good end goal IMO, so
the direction of the patch is interesting.  Would object_type_map and
ObjectProperty follow the same idea of relying on the catalogs OID
instead of the OBJECT_*?

Note that there are still two dependencies to getObjectClass() in
event_trigger.c and dependency.c.
--
Michael


signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-02-29 Thread Corey Huinker

>
>
> That’s certainly a fair point and my initial reaction (which could
> certainly be wrong) is that it’s unlikely to be an issue- but also, if you
> feel you could make it work with an array and passing all the attribute
> info in with one call, which I suspect would be possible but just a bit
> more complex to build, then sure, go for it. If it ends up being overly
> unwieldy then perhaps the  per-attribute call would be better and we could
> perhaps acquire the lock before the function calls..?  Doing a check to see
> if we have already locked it would be cheaper than trying to acquire a new
> lock, I’m fairly sure.
>

Well the do_analyze() code was already ok with acquiring the lock once for
non-inherited stats and again for inherited stats, so the locks were
already not the end of the world. However, that's at most a 2x of the
locking required, and this would natts * x, quite a bit more. Having the
procedures check for a pre-existing lock seems like a good compromise.


> Also per our prior discussion- this makes sense to include in post-data
> section, imv, and also because then we have the indexes we may wish to load
> stats for, but further that also means it’ll be in the paralleliziable part
> of the process, making me a bit less concerned overall about the individual
> timing.
>

The ability to parallelize is pretty persuasive. But is that per-statement
parallelization or do we get transaction blocks? i.e. if we ended up
importing stats like this:

BEGIN;
LOCK TABLE schema.relation IN SHARE UPDATE EXCLUSIVE MODE;
LOCK TABLE pg_catalog.pg_statistic IN ROW UPDATE EXCLUSIVE MODE;
SELECT pg_import_rel_stats('schema.relation', ntuples, npages);
SELECT pg_import_pg_statistic('schema.relation', 'id', ...);
SELECT pg_import_pg_statistic('schema.relation', 'name', ...);
SELECT pg_import_pg_statistic('schema.relation', 'description', ...);
...
COMMIT;

RE: Synchronizing slots from primary to standby

2024-02-29 Thread Zhijie Hou (Fujitsu)

On Friday, March 1, 2024 10:17 AM Zhijie Hou (Fujitsu)  
wrote:
> 
> 
> Attach the V102 patch set which addressed Amit and Shveta's comments.
> Thanks Shveta for helping addressing the comments off-list.

The cfbot reported a compile warning, here is the new version patch which fixed 
it,
Also removed some outdate comments in this version.

Best Regards,
Hou zj


v102_2-0001-Allow-logical-walsenders-to-wait-for-the-physi.patch
Description:  v102_2-0001-Allow-logical-walsenders-to-wait-for-the-physi.patch


v102_2-0002-Document-the-steps-to-check-if-the-standby-is-r.patch
Description:  v102_2-0002-Document-the-steps-to-check-if-the-standby-is-r.patch

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi

At Fri, 01 Mar 2024 12:04:31 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > After reading this, I came up with a possibility that walreceiver
> > recovers more quickly than the calling interval to
> > WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
> > to the function WaitForWAL...(), and then somehow recovers the
> > connection before the next call, the function doesn't notice the
> > disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
> > is correct, the correct fix might be for us to return XLREAD_FAIL when
> > reconnection happens after the last call to the WaitForWAL...()
> > function.
> 
> That's my stupid. The function performs reconnection by itself.

Anyway, our current policy here is to avoid record-rereads beyond
source switches. However, fixing this seems to require that source
switches cause record rereads unless some additional information is
available to know if replay LSN needs to back up.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Synchronizing slots from primary to standby

2024-02-29 Thread shveta malik

On Wed, Feb 28, 2024 at 4:51 PM Amit Kapila  wrote:
>
> On Wed, Feb 28, 2024 at 3:26 PM shveta malik  wrote:
> >
> >
> > Here is the patch which addresses the above comments. Also optimized
> > the test a little bit. Now we use pg_sync_replication_slots() function
> > instead of worker to test the operator-redirection using search-patch.
> > This has been done to simplify the test case and reduce the added
> > time.
> >
>
> I have slightly adjusted the comments in the attached, otherwise, LGTM.

This patch was pushed (commit: b3f6b14) and it resulted in BF failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo=2024-02-29%2012%3A49%3A27

The concerned log on standby1:
2024-02-29 14:23:16.738 UTC [3908:4]
040_standby_failover_slots_sync.pl LOG:  statement: SELECT
pg_sync_replication_slots();
The system cannot find the file specified.
2024-02-29 14:23:16.971 UTC [3908:5]
040_standby_failover_slots_sync.pl ERROR:  could not connect to the
primary server: connection to server at "127.0.0.1", port 65352
failed: FATAL:  SSPI authentication failed for user "repl_role"
2024-02-29 14:23:16.971 UTC [3908:6]
040_standby_failover_slots_sync.pl STATEMENT:  SELECT
pg_sync_replication_slots();

It seems authentication is failing for the new role added.We also see
method=sspi used in the publisher log. We are analysing it further and
will share the findings.

thanks
Shveta

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi

At Fri, 01 Mar 2024 10:29:12 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> After reading this, I came up with a possibility that walreceiver
> recovers more quickly than the calling interval to
> WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
> to the function WaitForWAL...(), and then somehow recovers the
> connection before the next call, the function doesn't notice the
> disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
> is correct, the correct fix might be for us to return XLREAD_FAIL when
> reconnection happens after the last call to the WaitForWAL...()
> function.

That's my stupid. The function performs reconnection by itself.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Synchronizing slots from primary to standby

2024-02-29 Thread Zhijie Hou (Fujitsu)

On Thursday, February 29, 2024 5:54 PM Amit Kapila  
wrote:
> 
> On Thu, Feb 29, 2024 at 8:29 AM Peter Smith 
> wrote:
> >
> > On Wed, Feb 28, 2024 at 1:23 PM Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > > On Tuesday, February 27, 2024 3:18 PM Peter Smith
>  wrote:
> > ...
> > > > 20.
> > > > +#
> > > > +# | > standby1 (primary_slot_name = sb1_slot) # | > standby2
> > > > +(primary_slot_name = sb2_slot) # primary - | # | > subscriber1
> > > > +(failover = true) # | > subscriber2 (failover = false)
> > > >
> > > > In the diagram, the "--->" means a mixture of physical standbys and
> logical
> > > > pub/sub replication. Maybe it can be a bit clearer?
> > > >
> > > > SUGGESTION
> > > >
> > > > # primary (publisher)
> > > > #
> > > > # (physical standbys)
> > > > # | > standby1 (primary_slot_name = sb1_slot)
> > > > # | > standby2 (primary_slot_name = sb2_slot)
> > > > #
> > > > # (logical replication)
> > > > # | > subscriber1 (failover = true, slot_name = lsub1_slot)
> > > > # | > subscriber2 (failover = false, slot_name = lsub2_slot)
> > > >
> > >
> > > I think one can distinguish it based on the 'standby' and 'subscriber' as 
> > > well,
> because
> > > 'standby' normally refer to physical standby while the other refer to 
> > > logical.
> > >
> 
> I think Peter's suggestion will make the setup clear.

Changed.

> 
> >
> > Ok, but shouldn't it at least include info about the logical slot
> > names associated with the subscribers (slot_name = lsub1_slot,
> > slot_name = lsub2_slot) like suggested above?
> >
> > ==
> >
> > Here are some more review comments for v100-0001
> >
> > ==
> > doc/src/sgml/config.sgml
> >
> > 1.
> > +   
> > +Lists the streaming replication standby server slot names that
> logical
> > +WAL sender processes will wait for. Logical WAL sender processes
> will
> > +send decoded changes to plugins only after the specified
> replication
> > +slots confirm receiving WAL. This guarantees that logical 
> > replication
> > +slots with failover enabled do not consume changes until those
> changes
> > +are received and flushed to corresponding physical standbys. If a
> > +logical replication connection is meant to switch to a physical
> standby
> > +after the standby is promoted, the physical replication slot for 
> > the
> > +standby should be listed here. Note that logical replication will 
> > not
> > +proceed if the slots specified in the standby_slot_names do
> > not exist or
> > +are invalidated.
> > +   
> >
> > Is that note ("Note that logical replication will not proceed if the
> > slots specified in the standby_slot_names do not exist or are
> > invalidated") meant only for subscriptions marked for 'failover' or
> > any subscription? Maybe wording can be modified to help clarify it?
> >
> 
> I think it is implicit that here we are talking about failover slots.
> I think clarifying again the same could be repetitive considering the
> previous sentence: "This guarantees that logical replication slots
> with failover enabled do not consume .." have mentioned it.
> 
> > ==
> > src/backend/replication/slot.c
> >
> > 2.
> > +/*
> > + * A helper function to validate slots specified in GUC standby_slot_names.
> > + */
> > +static bool
> > +validate_standby_slots(char **newval)
> > +{
> > + char*rawname;
> > + List*elemlist;
> > + bool ok;
> > +
> > + /* Need a modifiable copy of string */
> > + rawname = pstrdup(*newval);
> > +
> > + /* Verify syntax and parse string into a list of identifiers */
> > + ok = SplitIdentifierString(rawname, ',', );
> > +
> > + if (!ok)
> > + {
> > + GUC_check_errdetail("List syntax is invalid.");
> > + }
> > +
> > + /*
> > + * If the replication slots' data have been initialized, verify if the
> > + * specified slots exist and are logical slots.
> > + */
> > + else if (ReplicationSlotCtl)
> > + {
> > + foreach_ptr(char, name, elemlist)
> > + {
> > + ReplicationSlot *slot;
> > +
> > + slot = SearchNamedReplicationSlot(name, true);
> > +
> > + if (!slot)
> > + {
> > + GUC_check_errdetail("replication slot \"%s\" does not exist",
> > + name);
> > + ok = false;
> > + break;
> > + }
> > +
> > + if (!SlotIsPhysical(slot))
> > + {
> > + GUC_check_errdetail("\"%s\" is not a physical replication slot",
> > + name);
> > + ok = false;
> > + break;
> > + }
> > + }
> > + }
> > +
> > + pfree(rawname);
> > + list_free(elemlist);
> > + return ok;
> > +}
> >
> > 2a.
> > I didn't mention this previously because I thought this function was
> > not going to change anymore, but since Bertrand suggested some changes
> > [1], I will say IMO the { } are fine here for the single statement,
> > but I think it will be better to rearrange this code to be like below.
> > Having a 2nd NOP 'else' gives a much better place where you can put
> > your ReplicationSlotCtl comment.
> >
> > if (!ok)
> > {
> >

RE: Synchronizing slots from primary to standby

2024-02-29 Thread Zhijie Hou (Fujitsu)

On Thursday, February 29, 2024 7:36 PM shveta malik  
wrote:
> 
> On Thu, Feb 29, 2024 at 7:04 AM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > Here is the v101 patch set which addressed above comments.
> >
> 
> Thanks for the patch. Few comments:
> 
> 1)  Shall we mention in doc that shutdown will wait for standbys in
> standby_slot_names to confirm receiving WAL:
> 
> Suggestion for logicaldecoding.sgml:
> 
> When standby_slot_names is utilized, the primary
> server will not completely shut down until the corresponding standbys,
> associated with the physical replication slots specified in
> standby_slot_names, have confirmed receiving the
> WAL up to the latest flushed position on the primary server.
> 
> slot.c
> 2)
> /*
>  * If a logical slot name is provided in standby_slot_names, report
>  * a message and skip it. Although logical slots are disallowed in
>  * the GUC check_hook(validate_standby_slots), it is still
>  * possible for a user to drop an existing physical slot and
>  * recreate a logical slot with the same name.
>  */
> 
> This is not completely true, we can still specify a logical slot during 
> instance
> start and it will accept it.
> 
> Suggestion:
> /*
>  * If a logical slot name is provided in standby_slot_names, report
>  * a message and skip it. It is possible for user to specify a
>  * logical slot name in standby_slot_names just before the server
>  * startup. The GUC check_hook(validate_standby_slots) can not
>  * validate such a slot during startup as the ReplicationSlotCtl
>  * shared memory is not initialized by that time. It is also
>  * possible for user to drop an existing physical slot and
>  * recreate a logical slot with the same name.
>  */
> 
> 3. Wait for physical standby to confirm receiving the given lsn
> 
> standby -->standbys
> 
> 
> 4.
> In StandbyConfirmedFlush(), is it better to have below errdetail in all
> problematic cases:
> Logical replication is waiting on the standby associated with \"%s\
> 
> We have it only for inactive pid case but we are waiting in all cases.

Thanks for the comments, I have addressed them.

Best Regards,
Hou zj

RE: Synchronizing slots from primary to standby

2024-02-29 Thread Zhijie Hou (Fujitsu)

On Thursday, February 29, 2024 7:17 PM Amit Kapila  
wrote:
> 
> On Thu, Feb 29, 2024 at 3:23 PM Amit Kapila 
> wrote:
> >
> > Few additional comments on the latest patch:
> > =
> > 1.
> >  static XLogRecPtr
> >  WalSndWaitForWal(XLogRecPtr loc)
> > {
> > ...
> > + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr &&
> > + (!replication_active || StandbyConfirmedFlush(loc, WARNING))) {
> > + /*
> > + * Fast path to avoid acquiring the spinlock in case we already know
> > + * we have enough WAL available and all the standby servers have
> > + * confirmed receipt of WAL up to RecentFlushPtr. This is
> > + particularly
> > + * interesting if we're far behind.
> > + */
> >   return RecentFlushPtr;
> > + }
> > ...
> > ...
> > + * Wait for WALs to be flushed to disk only if the postmaster has not
> > + * asked us to stop.
> > + */
> > + if (loc > RecentFlushPtr && !got_STOPPING) wait_event =
> > + WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL;
> > +
> > + /*
> > + * Check if the standby slots have caught up to the flushed position.
> > + * It is good to wait up to RecentFlushPtr and then let it send the
> > + * changes to logical subscribers one by one which are already
> > + covered
> > + * in RecentFlushPtr without needing to wait on every change for
> > + * standby confirmation. Note that after receiving the shutdown
> > + signal,
> > + * an ERROR is reported if any slots are dropped, invalidated, or
> > + * inactive. This measure is taken to prevent the walsender from
> > + * waiting indefinitely.
> > + */
> > + else if (replication_active &&
> > + !StandbyConfirmedFlush(RecentFlushPtr, WARNING)) { wait_event =
> > + WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION;
> > + wait_for_standby = true;
> > + }
> > + else
> > + {
> > + /* Already caught up and doesn't need to wait for standby_slots. */
> >   break;
> > + }
> > ...
> > }
> >
> > Can we try to move these checks into a separate function that returns
> > a boolean and has an out parameter as wait_event?
> >
> > 2. How about naming StandbyConfirmedFlush() as
> StandbySlotsAreCaughtup?
> >
> 
> Some more comments:
> ==
> 1.
> + foreach_ptr(char, name, elemlist)
> + {
> + ReplicationSlot *slot;
> +
> + slot = SearchNamedReplicationSlot(name, true);
> +
> + if (!slot)
> + {
> + GUC_check_errdetail("replication slot \"%s\" does not exist", name);
> + ok = false; break; }
> +
> + if (!SlotIsPhysical(slot))
> + {
> + GUC_check_errdetail("\"%s\" is not a physical replication slot",
> + name); ok = false; break; } }
> 
> Won't the second check need protection via ReplicationSlotControlLock?

Yes, added.

> 
> 2. In WaitForStandbyConfirmation(), we are anyway calling
> StandbyConfirmedFlush() before the actual sleep in the loop, so does calling 
> it
> at the beginning of the function will serve any purpose? If so, it is better 
> to add
> some comments explaining the same.

It is used as a fast-path to avoid calling condition variable stuff, I think we 
can directly
put failover check and list check in the beginning instead of calling that 
function.

> 
> 3. Also do we need to perform the validation checks done in
> StandbyConfirmedFlush() repeatedly when it is invoked in a loop? We can
> probably separate those checks and perform them just once.

I have removed slot.failover check from the StandbyConfirmedFlush
function, so that we can do it when necessary. I didn’t change the check
for standby_slot_names_list because the list could be changed in the loop
when reloading config.

> 
> 4.
> +   *
> +   * XXX: If needed, this can be attempted in future.
> 
> Remove this part of the comment.

Removed.

Attach the V102 patch set which addressed Amit and Shveta's comments.
Thanks Shveta for helping addressing the comments off-list.

Best Regards,
Hou zj


v102-0002-Document-the-steps-to-check-if-the-standby-is-r.patch
Description:  v102-0002-Document-the-steps-to-check-if-the-standby-is-r.patch


v102-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch
Description:  v102-0001-Allow-logical-walsenders-to-wait-for-the-physic.patch

Re: Propagate sanity checks of ProcessUtility() to standard_ProcessUtility()?

2024-02-29 Thread Michael Paquier

On Thu, Feb 29, 2024 at 04:10:26PM +0800, jian he wrote:
> why not just shovel these to standard_ProcessUtility.
> so ProcessUtility will looking consistent with (in format)
>  * ExecutorStart()
>  * ExecutorRun()
>  * ExecutorFinish()
>  * ExecutorEnd()

That's one of the points of the change: checking that only in
standard_ProcessUtility() may not be sufficient for utility hooks that
don't call standard_ProcessUtility(), so you'd stil want one in
ProcessUtility().
--
Michael


signature.asc
Description: PGP signature

Re: Injection points: some tools to wait and wake

2024-02-29 Thread Michael Paquier

On Thu, Feb 29, 2024 at 06:19:58PM +0500, Andrey M. Borodin wrote:
> Works fine for me. Thanks!

+ "timed out waiting for the backend type to wait for the injection 
point name";

The log should provide some context about the caller failing, meaning
that the backend type and the injection point name should be mentioned
in these logs to help in debugging issues.
--
Michael


signature.asc
Description: PGP signature

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-02-29 Thread Michael Paquier

On Thu, Feb 29, 2024 at 08:51:31AM -0600, Justin Pryzby wrote:
> On Wed, Feb 28, 2024 at 05:08:49PM +0900, Michael Paquier wrote:
>> I have implemented that so as we keep the default, historical
>> behavior: if pg_class.relam is 0 for a partitioned table, use the AM
>> defined by default_table_access_method.  The patch only adds a path to
>> switch to a different AM than the GUC when creating a new partition if
>> and only if a partitioned table has been manipulated with ALTER TABLE
>> SET ACCESS METHOD to update its AM to something else than the GUC.
>> Similarly to tablespaces, CREATE TABLE USING is *not* supported for
>> partitioned tables, same behavior as previously.
> 
> This patch allows resetting relam=0 by running ALTER TABLE SET AM to the
> same value as the GUC.  Maybe it'd be better to have an explicit SET
> DEFAULT (as in b9424d01 and 4f622503).

Outside the scope of this patch's thread, this looks like a good idea
even for tables/matviews.  And the semantics are pretty easy: if DEFAULT
is specified, just set the access method to NULL in the parser and let
tablecmds.c go the AM OID lookup in the prep phase if set to NULL.
See 0001 attached.  This one looks pretty good taken as an independent
piece.

When it comes to partitioned tables, there is a still a tricky case:
what should we do when a user specifies a non-default value in the SET
ACCESS METHOD clause and it matches default_table_access_method?
Should the relam be 0 or should we force relam to be the OID of the
given value given by the query?  Implementation-wise, forcing the
value to 0 is simpler, but I can get why it could be confusing as
well, because the state of the catalogs does not reflect what was
provided in the query.  At the same time, the user has explicitly set
the access method to be the same as the default, so perhaps 0 makes
sense anyway in this case.

0002 does that, as that's simpler.  I'm not sure if there is a case
for forcing a value in relam if the query has the same value as the
default.  Thoughts?
--
Michael
From d8f1094415c34eaaae46c8d05b8abc83693b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 1 Mar 2024 10:22:22 +0900
Subject: [PATCH v2 1/2] Add DEFAULT option to ALTER TABLE SET ACCESS METHOD

---
 src/backend/commands/tablecmds.c|  3 ++-
 src/backend/parser/gram.y   | 10 --
 src/test/regress/expected/create_am.out | 21 +
 src/test/regress/sql/create_am.sql  | 11 +++
 doc/src/sgml/ref/alter_table.sgml   |  6 --
 5 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f798794556..3b1c2590fd 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -15212,7 +15212,8 @@ ATPrepSetAccessMethod(AlteredTableInfo *tab, Relation rel, const char *amname)
 	Oid			amoid;
 
 	/* Check that the table access method exists */
-	amoid = get_table_am_oid(amname, false);
+	amoid = get_table_am_oid(amname ? amname : default_table_access_method,
+			 false);
 
 	if (rel->rd_rel->relam == amoid)
 		return;
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 130f7fc7c3..c6e2f679fd 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -338,6 +338,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type alter_identity_column_option_list
 %type   alter_identity_column_option
 %type 	set_statistics_value
+%type 		set_access_method_name
 
 %type 	createdb_opt_list createdb_opt_items copy_opt_list
 transaction_mode_list
@@ -2859,8 +2860,8 @@ alter_table_cmd:
 	n->newowner = $3;
 	$$ = (Node *) n;
 }
-			/* ALTER TABLE  SET ACCESS METHOD  */
-			| SET ACCESS METHOD name
+			/* ALTER TABLE  SET ACCESS METHOD {  | DEFAULT } */
+			| SET ACCESS METHOD set_access_method_name
 {
 	AlterTableCmd *n = makeNode(AlterTableCmd);
 
@@ -3076,6 +3077,11 @@ set_statistics_value:
 			| DEFAULT		{ $$ = NULL; }
 		;
 
+set_access_method_name:
+			ColId			{ $$ = $1; }
+			| DEFAULT		{ $$ = NULL; }
+		;
+
 PartitionBoundSpec:
 			/* a HASH partition */
 			FOR VALUES WITH '(' hash_partbound ')'
diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out
index b50293d514..e843d39ee7 100644
--- a/src/test/regress/expected/create_am.out
+++ b/src/test/regress/expected/create_am.out
@@ -283,6 +283,27 @@ SELECT COUNT(a), COUNT(1) FILTER(WHERE a=1) FROM heaptable;
  9 | 1
 (1 row)
 
+-- DEFAULT access method
+BEGIN;
+SET LOCAL default_table_access_method TO heap2;
+ALTER TABLE heaptable SET ACCESS METHOD DEFAULT;
+SELECT amname FROM pg_class c, pg_am am
+  WHERE c.relam = am.oid AND c.oid = 'heaptable'::regclass;
+ amname 
+
+ heap2
+(1 row)
+
+SET LOCAL default_table_access_method TO heap;
+ALTER TABLE heaptable SET ACCESS METHOD DEFAULT;
+SELECT amname FROM pg_class c, pg_am am
+  WHERE c.relam =

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Kyotaro Horiguchi

At Fri, 1 Mar 2024 08:17:04 +0900, Michael Paquier  wrote 
in 
> On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote:
> > On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi 
> > wrote:
> >> In the first place, it's important to note that we do not guarantee
> >> that an async standby can always switch its replication connection to
> >> the old primary or another sibling standby. This is due to the
> >> variations in replication lag among standbys. pg_rewind is required to
> >> adjust such discrepancies.
> > 
> > Sure, I know. But in this case the async standby received and flushed
> > absolutely the same amount of WAL as the promoted one.
> 
> Ugh.  If it can happen transparently to the user without the user
> knowing directly about it, that does not sound good to me.  I did not
> look very closely at monitoring tools available out there, but if both
> standbys flushed the same WAL locations a rewind should not be
> required.  It is not something that monitoring tools would be able to
> detect because they just look at LSNs.
> 
> >> In the repro script, the replication connection of the second standby
> >> is switched from the old primary to the first standby after its
> >> promotion. After the switching, replication is expected to continue
> >> from the beginning of the last replayed segment.
> > 
> > Well, maybe, but apparently the standby is busy trying to decode a record
> > that spans multiple pages, and it is just infinitely waiting for the next
> > page to arrive. Also, the restart "fixes" the problem, because indeed it is
> > reading the file from the beginning.
> 
> What happens if the continuation record spawns across multiple segment
> files boundaries in this case?  We would go back to the beginning of
> the segment where the record spawning across multiple segments began,
> right?  (I may recall this part of contrecords incorrectly, feel free
> to correct me if necessary.)

After reading this, I came up with a possibility that walreceiver
recovers more quickly than the calling interval to
WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
to the function WaitForWAL...(), and then somehow recovers the
connection before the next call, the function doesn't notice the
disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
is correct, the correct fix might be for us to return XLREAD_FAIL when
reconnection happens after the last call to the WaitForWAL...()
function.

> >> But with the script,
> >> the second standby copies the intentionally broken file, which differs
> >> from the data that should be received via streaming.
> > 
> > As I already said, this is a simple way to emulate the primary crash while
> > standbys receiving WAL.
> > It could easily happen that the record spans on multiple pages is not fully
> > received and flushed.
> 
> I think that's OK to do so at test level to force a test in the
> backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has
> proved that this can be designed to be cheap and stable across the
> buildfarm fleet.

Yeah, I agree that it clearly illustrates the final state after the
issue happened, but if my assumption above is correct, the test
doesn't manifest the real issue.

> For anything like that, the result had better have solid test
> coverage, where perhaps we'd better refactor some of the routines of
> 039_end_of_wal.pl into a module to use them here, rather than
> duplicate the code.  The other test has a few assumptions with the
> calculation of page boundaries, and I'd rather not duplicate that
> across the tree.

I agree to the point.

> Seeing that Alexander is a maintainer of Patroni, which is very
> probably used by his employer across a large set of PostgreSQL
> instances, if he says that he's seen that in the field, that's good
> enough for me to respond to the problem, especially if reconnecting a
> standby to a promoted node where both flushed the same LSN.  Now the
> level of response also depends on the invasiness of the change, and we
> need a very careful evaluation here.

I don't mean to say that we should respond with DNF to this "issue" at
all. I simply wanted to make the real issue clear.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-02-29 Thread Melanie Plageman

On Thu, Feb 29, 2024 at 6:44 PM Tomas Vondra
 wrote:
>
> On 2/29/24 23:44, Tomas Vondra wrote:
> >
> > ...
> >
> >>>
> >>> I do have some partial results, comparing the patches. I only ran one of
> >>> the more affected workloads (cyclic) on the xeon, attached is a PDF
> >>> comparing master and the 0001-0014 patches. The percentages are timing
> >>> vs. the preceding patch (green - faster, red - slower).
> >>
> >> Just confirming: the results are for uncached?
> >>
> >
> > Yes, cyclic data set, uncached case. I picked this because it seemed
> > like one of the most affected cases. Do you want me to test some other
> > cases too?
> >
>
> BTW I decided to look at the data from a slightly different angle and
> compare the behavior with increasing effective_io_concurrency. Attached
> are charts for three "uncached" cases:
>
>  * uniform, work_mem=4MB, workers_per_gather=0
>  * linear-fuzz, work_mem=4MB, workers_per_gather=0
>  * uniform, work_mem=4MB, workers_per_gather=4
>
> Each page has charts for master and patched build (with all patches). I
> think there's a pretty obvious difference in how increasing e_i_c
> affects the two builds:

Wow! These visualizations make it exceptionally clear. I want to go to
the Vondra school of data visualizations for performance results!

> 1) On master there's clear difference between eic=0 and eic=1 cases, but
> on the patched build there's literally no difference - for example the
> "uniform" distribution is clearly not great for prefetching, but eic=0
> regresses to eic=1 poor behavior).

Yes, so eic=0 and eic=1 are identical with the streaming read API.
That is, eic 0 does not disable prefetching. Thomas is going to update
the streaming read API to avoid issuing an fadvise for the last block
in a range before issuing a read -- which would mean no prefetching
with eic 0 and eic 1. Not doing prefetching with eic 1 actually seems
like the right behavior -- which would be different than what master
is doing, right?

Hopefully this fixes the clear difference between master and the
patched version at eic 0.

> 2) For some reason, the prefetching with eic>1 perform much better with
> the patches, except for with very low selectivity values (close to 0%).
> Not sure why this is happening - either the overhead is much lower
> (which would matter on these "adversarial" data distribution, but how
> could that be when fadvise is not free), or it ends up not doing any
> prefetching (but then what about (1)?).

For the uniform with four parallel workers, eic == 0 being worse than
master makes sense for the above reason. But I'm not totally sure why
eic == 1 would be worse with the patch than with master. Both are
doing a (somewhat useless) prefetch.

With very low selectivity, you are less likely to get readahead
(right?) and similarly less likely to be able to build up > 8kB IOs --
which is one of the main value propositions of the streaming read
code. I imagine that this larger read benefit is part of why the
performance is better at higher selectivities with the patch. This
might be a silly experiment, but we could try decreasing
MAX_BUFFERS_PER_TRANSFER on the patched version and see if the
performance gains go away.

> 3) I'm not sure about the linear-fuzz case, the only explanation I have
> we're able to skip almost all of the prefetches (and read-ahead likely
> works pretty well here).

I started looking at the data generated by linear-fuzz to understand
exactly what effect the fuzz was having but haven't had time to really
understand the characteristics of this dataset. In the original
results, I thought uncached linear-fuzz and linear had similar results
(performance improvement from master). What do you expect with linear
vs linear-fuzz?

- Melanie

Re: Pre-proposal: unicode normalized text

2024-02-29 Thread Jeff Davis

On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.

Attached is an implementation of a per-database option STRICT_UNICODE
which enforces the use of assigned code points only.

Not everyone would want to use it. There are lots of applications that
accept free-form text, and that may include recently-assigned code
points not yet recognized by Postgres.

But it would offer protection/stability for some databases. It makes it
possible to have a hard guarantee that Unicode normalization is
stable[1]. And it may also mitigate the risk of collation changes --
using unassigned code points carries a high risk that the collation
order changes as soon as the collation provider recognizes the
assignment. (Though assigned code points can change, too, so limiting
yourself to assigned code points is only a mitigation.)

I worry slightly that users will think at first that they want only
assigned code points, and then later figure out that the application
has increased in scope and now takes all kinds of free-form text. In
that case, the user can "ALTER DATABASE ... STRICT_UNICODE FALSE", and
follow up with some "CHECK (unicode_assigned(...))" constraints on the
particular fields that they'd like to protect.

There's some weirdness that the set of assigned code points as Postgres
sees it may not match what a collation provider sees due to differing
Unicode versions. That's not great -- perhaps we could check that code
points are considered assigned by *both* Postgres and ICU. I don't know
if there's a way to tell if libc considers a code point to be assigned.

Regards,
Jeff Davis

[1]
https://www.unicode.org/policies/stability_policy.html#Normalization

From 54a15ee4ac5d5f437f4d536d724e1fa9e535fd50 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 29 Feb 2024 13:13:58 -0800
Subject: [PATCH v1] CREATE DATABASE ... STRICT_UNICODE.

Introduce new per-database option STRICT_UNICODE, which causes
Postgres to reject any textual value containing unassigned code
points. (Surrogate halves were already rejected because they are
invalid for UTF-8.)

"Unassigned" means unassigned as of the version of Unicode that
Postgres is based on; that is, the version returned by the SQL
function unicode_version().

By rejecting unassigned code points, it helps stabilize the database
against semantic changes across Postgres versions resulting from
assignment of previously-unassigned code points. For instance, Unicode
normalization is only stable across Unicode versions when using
assigned code points.

New databases may use STRICT_UNICODE if the template also uses
STRICT_UNICODE, or if the template is template0. An existing database
may be altered to disable STRICT_UNICODE (and therefore allow
unassigned code points), but may not be altered to enable
STRICT_UNICODE (because existing values may contain unassigned code
points).

Discussion: https://postgr.es/m/f30b58657ceb71d5be032decf4058d454cc1df74.camel%40j-davis.com
---
 doc/src/sgml/ref/alter_database.sgml  | 33 ++
 doc/src/sgml/ref/create_database.sgml | 23 ++
 doc/src/sgml/ref/createdb.sgml| 23 ++
 doc/src/sgml/ref/initdb.sgml  | 23 ++
 src/backend/commands/dbcommands.c | 64 ---
 src/backend/utils/adt/oracle_compat.c | 16 +++
 src/backend/utils/adt/pg_locale.c |  3 ++
 src/backend/utils/adt/varlena.c   | 35 +++
 src/backend/utils/init/postinit.c |  2 +
 src/bin/initdb/initdb.c   | 21 +
 src/bin/pg_dump/pg_dump.c | 12 +
 src/bin/psql/describe.c   | 11 +
 src/bin/scripts/createdb.c| 15 +++
 src/include/catalog/pg_database.dat   |  1 +
 src/include/catalog/pg_database.h |  3 ++
 src/include/utils/pg_locale.h |  3 ++
 16 files changed, 281 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/alter_database.sgml b/doc/src/sgml/ref/alter_database.sgml
index 2479c41e8d..07e42dbdd4 100644
--- a/doc/src/sgml/ref/alter_database.sgml
+++ b/doc/src/sgml/ref/alter_database.sgml
@@ -25,6 +25,7 @@ ALTER DATABASE name [ [ WITH ] where option can be:

+STRICT_UNICODE strict_unicode
 ALLOW_CONNECTIONS allowconn
 CONNECTION LIMIT connlimit
 IS_TEMPLATE istemplate
@@ -112,6 +113,38 @@ ALTER DATABASE name RESET ALL

+  
+   strict_unicode
+   
+
+ If true, specifies that the initial databases will
+ reject Unicode code points that are unassigned as of the version of
+ Unicode returned by unicode_version() (See ). Only valid if the encoding is
+ UTF8.
+
+
+ This setting may be changed from true to
+ false to enable storing textual values containing
+ unassigned Unicode code

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-02-29 Thread Melanie Plageman

On Thu, Feb 29, 2024 at 5:44 PM Tomas Vondra
 wrote:
>
>
>
> On 2/29/24 22:19, Melanie Plageman wrote:
> > On Thu, Feb 29, 2024 at 7:54 AM Tomas Vondra
> >  wrote:
> >>
> >>
> >>
> >> On 2/29/24 00:40, Melanie Plageman wrote:
> >>> On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra
> >>>  wrote:
> 
> 
> 
>  On 2/28/24 21:06, Melanie Plageman wrote:
> > On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra
> >  wrote:
> >>
> >> On 2/28/24 15:56, Tomas Vondra wrote:
>  ...
> >>>
> >>> Sure, I can do that. It'll take a couple hours to get the results, 
> >>> I'll
> >>> share them when I have them.
> >>>
> >>
> >> Here are the results with only patches 0001 - 0012 applied (i.e. 
> >> without
> >> the patch introducing the streaming read API, and the patch switching
> >> the bitmap heap scan to use it).
> >>
> >> The changes in performance don't disappear entirely, but the scale is
> >> certainly much smaller - both in the complete results for all runs, and
> >> for the "optimal" runs that would actually pick bitmapscan.
> >
> > Hmm. I'm trying to think how my refactor could have had this impact.
> > It seems like all the most notable regressions are with 4 parallel
> > workers. What do the numeric column labels mean across the top
> > (2,4,8,16...) -- are they related to "matches"? And if so, what does
> > that mean?
> >
> 
>  That's the number of distinct values matched by the query, which should
>  be an approximation of the number of matching rows. The number of
>  distinct values in the data set differs by data set, but for 1M rows
>  it's roughly like this:
> 
>  uniform: 10k
>  linear: 10k
>  cyclic: 100
> 
>  So for example matches=128 means ~1% of rows for uniform/linear, and
>  100% for cyclic data sets.
> >>>
> >>> Ah, thank you for the explanation. I also looked at your script after
> >>> having sent this email and saw that it is clear in your script what
> >>> "matches" is.
> >>>
>  As for the possible cause, I think it's clear most of the difference
>  comes from the last patch that actually switches bitmap heap scan to the
>  streaming read API. That's mostly expected/understandable, although we
>  probably need to look into the regressions or cases with e_i_c=0.
> >>>
> >>> Right, I'm mostly surprised about the regressions for patches 0001-0012.
> >>>
> >>> Re eic 0: Thomas Munro and I chatted off-list, and you bring up a
> >>> great point about eic 0. In old bitmapheapscan code eic 0 basically
> >>> disabled prefetching but with the streaming read API, it will still
> >>> issue fadvises when eic is 0. That is an easy one line fix. Thomas
> >>> prefers to fix it by always avoiding an fadvise for the last buffer in
> >>> a range before issuing a read (since we are about to read it anyway,
> >>> best not fadvise it too). This will fix eic 0 and also cut one system
> >>> call from each invocation of the streaming read machinery.
> >>>
>  To analyze the 0001-0012 patches, maybe it'd be helpful to run tests for
>  individual patches. I can try doing that tomorrow. It'll have to be a
>  limited set of tests, to reduce the time, but might tell us whether it's
>  due to a single patch or multiple patches.
> >>>
> >>> Yes, tomorrow I planned to start trying to repro some of the "red"
> >>> cases myself. Any one of the commits could cause a slight regression
> >>> but a 3.5x regression is quite surprising, so I might focus on trying
> >>> to repro that locally and then narrow down which patch causes it.
> >>>
> >>> For the non-cached regressions, perhaps the commit to use the correct
> >>> recheck flag (0004) when prefetching could be the culprit. And for the
> >>> cached regressions, my money is on the commit which changes the whole
> >>> control flow of BitmapHeapNext() and the next_block() and next_tuple()
> >>> functions (0010).
> >>>
> >>
> >> I do have some partial results, comparing the patches. I only ran one of
> >> the more affected workloads (cyclic) on the xeon, attached is a PDF
> >> comparing master and the 0001-0014 patches. The percentages are timing
> >> vs. the preceding patch (green - faster, red - slower).
> >
> > Just confirming: the results are for uncached?
> >
>
> Yes, cyclic data set, uncached case. I picked this because it seemed
> like one of the most affected cases. Do you want me to test some other
> cases too?

So, I actually may have found the source of at least part of the
regression with 0010. I was able to reproduce the regression with
patch 0010 applied for the unached case with 4 workers and eic 8 and
1 rows for the cyclic dataset. I see it for all number of
matches. The regression went away (for this specific example) when I
moved the BitmapAdjustPrefetchIterator call back up to before the call
to table_scan_bitmap_next_block() like this:

diff --git

Showing applied extended statistics in explain Part 2

2024-02-29 Thread Tatsuro Yamada

Hi,

This original patch made by Tomas improves the usability of extended 
statistics, 
so I rebased it on 362de947, and I'd like to re-start developing it.
 
The previous thread [1] suggested something to solve. I'll try to solve it as 
best I can, but I think this feature is worth it with some limitations.
Please find the attached file.

[1] 
https://www.postgresql.org/message-id/flat/8081617b-d80f-ae2b-b79f-ea7e926f9fcf%40enterprisedb.com

Regards,
Tatsuro Yamada
NTT Open Source Software Center



0001-show-stats-in-explain-rebased-on-362de947.patch
Description: 0001-show-stats-in-explain-rebased-on-362de947.patch

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-02-29 Thread Jacob Champion

On Thu, Feb 29, 2024 at 1:08 PM Daniel Gustafsson  wrote:
> +   /* TODO */
> +   CHECK_SETOPT(actx, CURLOPT_WRITEDATA, stderr);
> I might be missing something, but what this is intended for in
> setup_curl_handles()?

Ah, that's cruft left over from early debugging, just so that I could
see what was going on. I'll remove it.

> --- /dev/null
> +++ b/src/interfaces/libpq/fe-auth-oauth-iddawc.c
> As discussed off-list I think we should leave iddawc support for later and
> focus on getting one library properly supported to start with.  If you agree,
> let's drop this from the patchset to make it easier to digest.  We should make
> sure we keep pluggability such that another library can be supported though,
> much like the libpq TLS support.

Agreed. The number of changes being folded into the next set is
already pretty big so I think this will wait until next+1.

--Jacob

Avoiding inadvertent debugging mode for pgbench

2024-02-29 Thread Greg Sabino Mullane

Attached please find a patch to adjust the behavior of the pgbench program
and make it behave like the other programs that connect to a database
(namely, psql and pg_dump). Specifically, add support for using -d and
--dbname to specify the name of the database. This means that -d can no
longer be used to turn on debugging mode, and the long option --debug must
be used instead.

This removes a long-standing footgun, in which people assume that the -d
option behaves the same as other programs. Indeed, because it takes no
arguments, and because the first non-option argument is the database name,
it still appears to work. However, people then wonder why pgbench is so
darn verbose all the time! :)

This is a breaking change, but fixing it this way seems to have the least
total impact, as the number of people using the debug mode of pgbench is
likely quite small. Further, those already using the long option are
unaffected, and those using the short one simply need to replace '-d' with
'--debug', arguably making their scripts a little more self-documenting in
the process.

Cheers,
Greg


pgbench.dash.d.or.not.dash.d.v1.patch
Description: Binary data

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-02-29 Thread Jacob Champion

On Wed, Feb 28, 2024 at 9:40 AM Daniel Gustafsson  wrote:
> +#define ALLOC(size) malloc(size)
> I wonder if we should use pg_malloc_extended(size, MCXT_ALLOC_NO_OOM) instead
> to self document the code.  We clearly don't want feature-parity with server-
> side palloc here.  I know we use malloc in similar ALLOC macros so it's not
> unique in that regard, but maybe?

I have a vague recollection that linking fe_memutils into libpq
tripped the exit() checks, but I can try again and see.

> +#ifdef FRONTEND
> +   destroyPQExpBuffer(lex->errormsg);
> +#else
> +   pfree(lex->errormsg->data);
> +   pfree(lex->errormsg);
> +#endif
> Wouldn't it be nicer if we abstracted this into a destroyStrVal function to a)
> avoid the ifdefs and b) make it more like the rest of the new API?  While it's
> only used in two places (close to each other) it's a shame to let the
> underlying API bleed through the abstraction.

Good idea. I'll fold this from your patch into the next set (and do
the same for the ones I've marked +1 below).

> +   CURLM  *curlm;  /* top-level multi handle for cURL operations */
> Nitpick, but curl is not capitalized cURL anymore (for some value of "anymore"
> since it changed in 2016 [0]).  I do wonder if we should consistently write
> "libcurl" as well since we don't use curl but libcurl.

Huh, I missed that memo. Thanks -- that makes it much easier to type!

> +   PQExpBufferData work_data;  /* scratch buffer for general use 
> (remember
> +  to clear out prior contents first!) */
> This seems like asking for subtle bugs due to uncleared buffers bleeding into
> another operation (especially since we are writing this data across the wire).
> How about having an array the size of OAuthStep of unallocated buffers where
> each step use it's own?  Storing the content of each step could also be useful
> for debugging.  Looking at the statemachine here it's not an obvious change 
> but
> also not impossible.

I like that idea; I'll give it a look.

> + * TODO: This disables DNS resolution timeouts unless libcurl has been
> + * compiled against alternative resolution support. We should check that.
> curl_version_info() can be used to check for c-ares support.

+1

> + * so you don't have to write out the error handling every time. They assume
> + * that they're embedded in a function returning bool, however.
> It feels a bit iffy to encode the returntype in the macro, we can use the same
> trick that DISABLE_SIGPIPE employs where a failaction is passed in.

+1

> +  if (!strcmp(name, field->name))
> Project style is to test for (strcmp(x,y) == 0) rather than (!strcmp()) to
> improve readability.

+1

> +  libpq_append_conn_error(conn, "out of memory");
> While not introduced in this patch, it's not an ideal pattern to report "out 
> of
> memory" errors via a function which may allocate memory.

Does trying (and failing) to allocate more memory cause any harm? Best
case, we still have enough room in the errorMessage to fit the whole
error; worst case, we mark the errorMessage broken and then
PQerrorMessage() can handle it correctly.

> +  appendPQExpBufferStr(>errorMessage,
> +   libpq_gettext("server's error message contained an embedded 
> NULL"));
> We should maybe add ", discarding" or something similar after this string to
> indicate that there was an actual error which has been thrown away, the error
> wasn't that the server passed an embedded NULL.

+1

> +#ifdef USE_OAUTH
> +   else if (strcmp(mechanism_buf.data, OAUTHBEARER_NAME) == 0 &&
> +   !selected_mechanism)
> I wonder if we instead should move the guards inside the statement and error
> out with "not built with OAuth support" or something similar like how we do
> with TLS and other optional components?

This one seems like a step backwards. IIUC, the client can currently
handle a situation where the server returns multiple mechanisms
(though the server doesn't support that yet), and I'd really like to
make use of that property without making users upgrade libpq.

That said, it'd be good to have a more specific error message in the
case where we don't have a match...

> +   errdetail("Comma expected, but found character %s.",
> + sanitize_char(*p;
> The %s formatter should be wrapped like '%s' to indicate that the message part
> is the character in question (and we can then reuse the translation since the
> error message already exist for SCRAM).

+1

> +   temp = curl_slist_append(temp, "authorization_code");
> +   if (!temp)
> +   oom = true;
> +
> +   temp = curl_slist_append(temp, "implicit");
> While not a bug per se, it reads a bit odd to call another operation that can
> allocate memory when the oom flag has been set.  I think we can move some
> things around a little to make it clearer.

I'm not a huge fan of nested happy paths/pyramids of doom, but in this
case it's small enough that I'm not

Re: Statistics Import and Export

2024-02-29 Thread Stephen Frost

Greetings,

On Thu, Feb 29, 2024 at 17:48 Corey Huinker  wrote:

> Having looked through this thread and discussed a bit with Corey
>> off-line, the approach that Tom laid out up-thread seems like it would
>> make the most sense overall- that is, eliminate the JSON bits and the
>> SPI and instead export the stats data by running queries from the new
>> version of pg_dump/server (in the FDW case) against the old server
>> with the intelligence of how to transform the data into the format
>> needed for the current pg_dump/server to accept, through function calls
>> where the function calls generally map up to the rows/information being
>> updated- a call to update the information in pg_class for each relation
>> and then a call for each attribute to update the information in
>> pg_statistic.
>>
>
> Thanks for the excellent summary of our conversation, though I do add that
> we discussed a problem with per-attribute functions: each function would be
> acquiring locks on both the relation (so it doesn't go away) and
> pg_statistic, and that lock thrashing would add up. Whether that overhead
> is judged significant or not is up for discussion. If it is significant, it
> makes sense to package up all the attributes into one call, passing in an
> array of some new pg_statistic-esque special typethe very issue that
> sent me down the JSON path.
>
> I certainly see the flexibility in having a per-attribute functions, but
> am concerned about non-binary-upgrade situations where the attnums won't
> line up, and if we're passing them by name then the function has dig around
> looking for the right matching attnum, and that's overhead too. In the
> whole-table approach, we just iterate over the attributes that exist, and
> find the matching parameter row.
>

That’s certainly a fair point and my initial reaction (which could
certainly be wrong) is that it’s unlikely to be an issue- but also, if you
feel you could make it work with an array and passing all the attribute
info in with one call, which I suspect would be possible but just a bit
more complex to build, then sure, go for it. If it ends up being overly
unwieldy then perhaps the  per-attribute call would be better and we could
perhaps acquire the lock before the function calls..?  Doing a check to see
if we have already locked it would be cheaper than trying to acquire a new
lock, I’m fairly sure.

Also per our prior discussion- this makes sense to include in post-data
section, imv, and also because then we have the indexes we may wish to load
stats for, but further that also means it’ll be in the paralleliziable part
of the process, making me a bit less concerned overall about the individual
timing.

Thanks!

Stephen

>

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Michael Paquier

On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote:
> On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi 
> wrote:
>> In the first place, it's important to note that we do not guarantee
>> that an async standby can always switch its replication connection to
>> the old primary or another sibling standby. This is due to the
>> variations in replication lag among standbys. pg_rewind is required to
>> adjust such discrepancies.
> 
> Sure, I know. But in this case the async standby received and flushed
> absolutely the same amount of WAL as the promoted one.

Ugh.  If it can happen transparently to the user without the user
knowing directly about it, that does not sound good to me.  I did not
look very closely at monitoring tools available out there, but if both
standbys flushed the same WAL locations a rewind should not be
required.  It is not something that monitoring tools would be able to
detect because they just look at LSNs.

>> In the repro script, the replication connection of the second standby
>> is switched from the old primary to the first standby after its
>> promotion. After the switching, replication is expected to continue
>> from the beginning of the last replayed segment.
> 
> Well, maybe, but apparently the standby is busy trying to decode a record
> that spans multiple pages, and it is just infinitely waiting for the next
> page to arrive. Also, the restart "fixes" the problem, because indeed it is
> reading the file from the beginning.

What happens if the continuation record spawns across multiple segment
files boundaries in this case?  We would go back to the beginning of
the segment where the record spawning across multiple segments began,
right?  (I may recall this part of contrecords incorrectly, feel free
to correct me if necessary.)

>> But with the script,
>> the second standby copies the intentionally broken file, which differs
>> from the data that should be received via streaming.
> 
> As I already said, this is a simple way to emulate the primary crash while
> standbys receiving WAL.
> It could easily happen that the record spans on multiple pages is not fully
> received and flushed.

I think that's OK to do so at test level to force a test in the
backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has
proved that this can be designed to be cheap and stable across the
buildfarm fleet.

For anything like that, the result had better have solid test
coverage, where perhaps we'd better refactor some of the routines of
039_end_of_wal.pl into a module to use them here, rather than
duplicate the code.  The other test has a few assumptions with the
calculation of page boundaries, and I'd rather not duplicate that
across the tree.

Seeing that Alexander is a maintainer of Patroni, which is very
probably used by his employer across a large set of PostgreSQL
instances, if he says that he's seen that in the field, that's good
enough for me to respond to the problem, especially if reconnecting a
standby to a promoted node where both flushed the same LSN.  Now the
level of response also depends on the invasiness of the change, and we
need a very careful evaluation here.
--
Michael

signature.asc
Description: PGP signature

Re: DOCS: Avoid using abbreviation "aka"

2024-02-29 Thread Michael Paquier

On Thu, Feb 29, 2024 at 02:42:08PM +0530, Amit Kapila wrote:
> +1. LGTM as well.

This has been introduced by ddd5f4f54a02, so if you wish to fix it
yourself, please feel free.  If you'd prefer that I take care of it,
I'm OK to do so as well.
--
Michael


signature.asc
Description: PGP signature

Re: Statistics Import and Export

2024-02-29 Thread Corey Huinker

>
>
>
> Having looked through this thread and discussed a bit with Corey
> off-line, the approach that Tom laid out up-thread seems like it would
> make the most sense overall- that is, eliminate the JSON bits and the
> SPI and instead export the stats data by running queries from the new
> version of pg_dump/server (in the FDW case) against the old server
> with the intelligence of how to transform the data into the format
> needed for the current pg_dump/server to accept, through function calls
> where the function calls generally map up to the rows/information being
> updated- a call to update the information in pg_class for each relation
> and then a call for each attribute to update the information in
> pg_statistic.
>

Thanks for the excellent summary of our conversation, though I do add that
we discussed a problem with per-attribute functions: each function would be
acquiring locks on both the relation (so it doesn't go away) and
pg_statistic, and that lock thrashing would add up. Whether that overhead
is judged significant or not is up for discussion. If it is significant, it
makes sense to package up all the attributes into one call, passing in an
array of some new pg_statistic-esque special typethe very issue that
sent me down the JSON path.

I certainly see the flexibility in having a per-attribute functions, but am
concerned about non-binary-upgrade situations where the attnums won't line
up, and if we're passing them by name then the function has dig around
looking for the right matching attnum, and that's overhead too. In the
whole-table approach, we just iterate over the attributes that exist, and
find the matching parameter row.

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-02-29 Thread Tomas Vondra




On 2/29/24 22:19, Melanie Plageman wrote:
> On Thu, Feb 29, 2024 at 7:54 AM Tomas Vondra
>  wrote:
>>
>>
>>
>> On 2/29/24 00:40, Melanie Plageman wrote:
>>> On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra
>>>  wrote:



 On 2/28/24 21:06, Melanie Plageman wrote:
> On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra
>  wrote:
>>
>> On 2/28/24 15:56, Tomas Vondra wrote:
 ...
>>>
>>> Sure, I can do that. It'll take a couple hours to get the results, I'll
>>> share them when I have them.
>>>
>>
>> Here are the results with only patches 0001 - 0012 applied (i.e. without
>> the patch introducing the streaming read API, and the patch switching
>> the bitmap heap scan to use it).
>>
>> The changes in performance don't disappear entirely, but the scale is
>> certainly much smaller - both in the complete results for all runs, and
>> for the "optimal" runs that would actually pick bitmapscan.
>
> Hmm. I'm trying to think how my refactor could have had this impact.
> It seems like all the most notable regressions are with 4 parallel
> workers. What do the numeric column labels mean across the top
> (2,4,8,16...) -- are they related to "matches"? And if so, what does
> that mean?
>

 That's the number of distinct values matched by the query, which should
 be an approximation of the number of matching rows. The number of
 distinct values in the data set differs by data set, but for 1M rows
 it's roughly like this:

 uniform: 10k
 linear: 10k
 cyclic: 100

 So for example matches=128 means ~1% of rows for uniform/linear, and
 100% for cyclic data sets.
>>>
>>> Ah, thank you for the explanation. I also looked at your script after
>>> having sent this email and saw that it is clear in your script what
>>> "matches" is.
>>>
 As for the possible cause, I think it's clear most of the difference
 comes from the last patch that actually switches bitmap heap scan to the
 streaming read API. That's mostly expected/understandable, although we
 probably need to look into the regressions or cases with e_i_c=0.
>>>
>>> Right, I'm mostly surprised about the regressions for patches 0001-0012.
>>>
>>> Re eic 0: Thomas Munro and I chatted off-list, and you bring up a
>>> great point about eic 0. In old bitmapheapscan code eic 0 basically
>>> disabled prefetching but with the streaming read API, it will still
>>> issue fadvises when eic is 0. That is an easy one line fix. Thomas
>>> prefers to fix it by always avoiding an fadvise for the last buffer in
>>> a range before issuing a read (since we are about to read it anyway,
>>> best not fadvise it too). This will fix eic 0 and also cut one system
>>> call from each invocation of the streaming read machinery.
>>>
 To analyze the 0001-0012 patches, maybe it'd be helpful to run tests for
 individual patches. I can try doing that tomorrow. It'll have to be a
 limited set of tests, to reduce the time, but might tell us whether it's
 due to a single patch or multiple patches.
>>>
>>> Yes, tomorrow I planned to start trying to repro some of the "red"
>>> cases myself. Any one of the commits could cause a slight regression
>>> but a 3.5x regression is quite surprising, so I might focus on trying
>>> to repro that locally and then narrow down which patch causes it.
>>>
>>> For the non-cached regressions, perhaps the commit to use the correct
>>> recheck flag (0004) when prefetching could be the culprit. And for the
>>> cached regressions, my money is on the commit which changes the whole
>>> control flow of BitmapHeapNext() and the next_block() and next_tuple()
>>> functions (0010).
>>>
>>
>> I do have some partial results, comparing the patches. I only ran one of
>> the more affected workloads (cyclic) on the xeon, attached is a PDF
>> comparing master and the 0001-0014 patches. The percentages are timing
>> vs. the preceding patch (green - faster, red - slower).
> 
> Just confirming: the results are for uncached?
> 

Yes, cyclic data set, uncached case. I picked this because it seemed
like one of the most affected cases. Do you want me to test some other
cases too?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Psql meta-command conninfo+

2024-02-29 Thread Nathan Bossart

On Thu, Feb 29, 2024 at 10:02:21PM +, Maiquel Grassi wrote:
> Sorry for the delay. I will make the adjustments as requested soon.

Looking forward to it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: SQL:2011 application time

2024-02-29 Thread Paul Jungwirth


On 2/13/24 21:00, jian he wrote:

Hi
more minor issues.

+ FindFKComparisonOperators(
+ fkconstraint, tab, i, fkattnum,
+ _check_ok, _pfeqop_item,
+ pktypoid[i], fktypoid[i], opclasses[i],
+ is_temporal, false,
+ [i], [i], [i]);
+ }
+ if (is_temporal) {
+ pkattnum[numpks] = pkperiodattnum;
+ pktypoid[numpks] = pkperiodtypoid;
+ fkattnum[numpks] = fkperiodattnum;
+ fktypoid[numpks] = fkperiodtypoid;

- pfeqop = get_opfamily_member(opfamily, opcintype, fktyped,
- eqstrategy);
- if (OidIsValid(pfeqop))
- {
- pfeqop_right = fktyped;
- ffeqop = get_opfamily_member(opfamily, fktyped, fktyped,
- eqstrategy);
- }
- else
- {
- /* keep compiler quiet */
- pfeqop_right = InvalidOid;
- ffeqop = InvalidOid;
- }
+ FindFKComparisonOperators(
+ fkconstraint, tab, numpks, fkattnum,
+ _check_ok, _pfeqop_item,
+ pkperiodtypoid, fkperiodtypoid, opclasses[numpks],
+ is_temporal, true,
+ [numpks], [numpks], [numpks]);
+ numfks += 1;
+ numpks += 1;
+ }

opening curly brace should be the next line,


Fixed in v25 (submitted in my other email).


also do you think it's
good idea to add following in the `if (is_temporal)` branch
`
Assert(OidIsValid(fkperiodtypoid) && OidIsValid(pkperiodtypoid));
Assert(OidIsValid(pkperiodattnum > 0 && fkperiodattnum > 0));
`

` if (is_temporal)` branch, you can set the FindFKComparisonOperators
10th argument (is_temporal)
to true, since you are already in the ` if (is_temporal)` branch.

maybe we need some extra comments on
`
+ numfks += 1;
+ numpks += 1;
`
since it might not be that evident?


That branch doesn't exist anymore. Same with the increments.


Do you think it's a good idea to list arguments line by line (with
good indentation) is good format? like:
FindFKComparisonOperators(fkconstraint,
tab,
i,
fkattnum,
_check_ok,
_pfeqop_item,
pktypoid[i],
fktypoid[i],
opclasses[i],
false,
false,
[i],
[i],
[i]);


There are places we do that, but most code I've seen tries to fill the line. I haven't followed that 
strictly here, but I'm trying to get better at doing what pg_indent wants.


Yours,

--
Paul  ~{:-)
p...@illuminatedcomputing.com

RE: Psql meta-command conninfo+

2024-02-29 Thread Maiquel Grassi

On Sat, Feb 17, 2024 at 02:53:43PM +, Maiquel Grassi wrote:
>> The "pg_stat_ssl" view is available from >= PG 9.5, and the "pg_stat_gssapi" 
>> view is
>> available from >= PG 12. The "compression" column was removed from the
>> "pg_stat_ssl" from >= PG 14. All of these cases introduce greater complexity 
>> in
>> maintaining the SQL. The central idea from the beginning has always been to 
>> show
>> the user all the information from \conninfo and extend it in \conninfo+.

>IMHO we should use the views whenever possible (for the reason stated
>above [0]).  I think it's okay if we need to fall back to a different
>approach for older versions.  But presumably we'll discontinue psql support
>for these old server versions at some point, at which point we can simply
>delete the dead code that doesn't use the views.

>> The absence
>> of the "compression" column in version 14 and above makes dealing with this 
>> even
>> more complicated, and not showing it seems to contradict \conninfo.

>I would be okay with using PQsslAttribute() for all versions for this one
>since any remaining support for this feature is on its way out.  Once psql
>no longer supports any versions that allow SSL compression, we could
>probably remove it from \conninfo[+] completely or hard-code it to "off".

>> SSL support has been available since version 7.1 (see documentation); if 
>> there was
>> support before that, I can't say. In this regard, it may seem strange, but 
>> there are still
>> many legacy systems running older versions of PostgreSQL. Just yesterday, I 
>> assisted
>> a client who is still using PG 8.2. In these cases, using the "pg_stat_ssl" 
>> and
>> "pg_stat_gssapi" views would not be possible because they don't exist on the 
>> server.
>> I believe that psql should cover as many cases as possible when it comes to 
>> compatibility
>> with older versions (even those no longer supported). In this case, 
>> concerning SSL and
>> GSS, I think libpq is better prepared to handle this.

>At the moment, the psql support cutoff appears to be v9.2 (see commit
>cf0cab8), which has been out of support for over 6 years.  If \conninfo+
>happens to work for older versions, then great, but I don't think we should
>expend too much energy in this area.

[0] https://postgr.es/m/20240216155449.GA1236054%40nathanxps13

//

Hi Nathan!

Sorry for the delay. I will make the adjustments as requested soon.

Regards,
Maiquel Grassi.

Commitfest Manager for March

2024-02-29 Thread Daniel Gustafsson

We are now hours away from starting the last commitfest for v17 and AFAICS
there have been no volunteers for the position of Commitfest manager (cfm) yet.
As per usual it's likely beneficial if the CFM of the last CF before freeze is
someone with an seasoned eye to what can make it and what can't, but the
important part is that we get someone with the time and energy to invest.

Anyone interested?

--
Daniel Gustafsson

Make query cancellation keys longer

2024-02-29 Thread Heikki Linnakangas

Currently, cancel request key is a 32-bit token, which isn't very much 
entropy. If you want to cancel another session's query, you can 
brute-force it. In most environments, an unauthorized cancellation of a 
query isn't very serious, but it nevertheless would be nice to have more 
protection from it. The attached patch makes it longer. It is an 
optional protocol feature, so it's fully backwards-compatible with 
clients that don't support longer keys.


If the client requests the "_pq_.extended_query_cancel" protocol 
feature, the server will generate a longer 256-bit cancellation key. 
However, the new longer key length is no longer hardcoded in the 
protocol. The client is expected to deal with variable length keys, up 
to some reasonable upper limit (TODO: document the maximum). This 
flexibility allows e.g. a connection pooler to add more information to 
the cancel key, which could be useful. If the client doesn't request the 
protocol feature, the server generates a 32-bit key like before.


One complication with this was that because we no longer know how long 
the key should be, 4-bytes or something longer, until the backend has 
performed the protocol negotiation, we cannot generate the key in the 
postmaster before forking the process anymore. The first patch here 
changes things so that the cancellation key is generated later, in the 
backend, and the backend advertises the key in the PMSignalState array. 
This is similar to how this has always worked in EXEC_BACKEND mode with 
the ShmemBackendArray, but instead of having a separate array, I added 
fields to the PMSignalState slots. This removes a bunch of 
EXEC_BACKEND-specific code, which is nice.


Any thoughts on this? Documentation is still missing, and there's one 
TODO on adding a portable time-constant memcmp() function; I'll add 
those if there's agreement on this otherwise.


--
Heikki Linnakangas
Neon (https://neon.tech)From f871e915fee08b3cc27e732646f3a62ec1c3024b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Thu, 29 Feb 2024 23:20:52 +0200
Subject: [PATCH 1/2] Move cancel key generation to after forking the backend

Move responsibility of generating the cancel key to the backend
process. The cancel key is now generated after forking, and the
backend advertises it in the PMSignalState array. When a cancel
request arrives, the backend handling it scans the PMSignalState array
to find the target pid and cancel key. This is similar to how this
previously worked in the EXEC_BACKEND case with the ShmemBackendArray,
but instead of having the separate array, we store the cancel keys in
PMSignalState array, and use it also in the !EXEC_BACKEND case.

While we're at it, switch to using atomics in pmsignal.c for the state
field. That feels easier to reason about than volatile
pointers. (TODO: also switch to pg_atomic_flag for PMSignalFlags
array. Weirdly we don't have "pg_atomic_test_clear_flag" nor
"pg_atomic_set_flag". Those would be the natural opertions for
PMSignalFlags. Maybe we should add those, but I didn't want to go that
deep down the rabbit hole in this patch.)

This is needed by the next patch, so that we can generate the cancel
key later, after negotiating the protocol version with the client.
---
 src/backend/postmaster/postmaster.c | 185 +---
 src/backend/storage/ipc/ipci.c  |  11 --
 src/backend/storage/ipc/pmsignal.c  | 124 ++-
 src/backend/storage/lmgr/proc.c |  16 ++-
 src/include/postmaster/postmaster.h |   3 -
 src/include/storage/pmsignal.h  |   3 +-
 src/tools/pgindent/typedefs.list|   1 +
 7 files changed, 116 insertions(+), 227 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index da0c627107e..408f8fb9b6d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -177,7 +177,6 @@
 typedef struct bkend
 {
 	pid_t		pid;			/* process id of backend */
-	int32		cancel_key;		/* cancel key for cancels for this backend */
 	int			child_slot;		/* PMChildSlot for this backend, if any */
 	int			bkend_type;		/* child process flavor, see above */
 	bool		dead_end;		/* is it going to send an error and quit? */
@@ -187,10 +186,6 @@ typedef struct bkend
 
 static dlist_head BackendList = DLIST_STATIC_INIT(BackendList);
 
-#ifdef EXEC_BACKEND
-static Backend *ShmemBackendArray;
-#endif
-
 BackgroundWorker *MyBgworkerEntry = NULL;
 
 
@@ -430,7 +425,6 @@ static void SendNegotiateProtocolVersion(List *unrecognized_protocol_options);
 static void processCancelRequest(Port *port, void *pkt);
 static void report_fork_failure_to_client(Port *port, int errnum);
 static CAC_state canAcceptConnections(int backend_type);
-static bool RandomCancelKey(int32 *cancel_key);
 static void signal_child(pid_t pid, int signal);
 static void sigquit_child(pid_t pid);
 static bool SignalSomeChildren(int signal, int target);
@@ -517,7 +511,6 @@ typedef struct
 #endif
 	void	   *UsedShmemSegAddr;

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-02-29 Thread Melanie Plageman

On Thu, Feb 29, 2024 at 7:54 AM Tomas Vondra
 wrote:
>
>
>
> On 2/29/24 00:40, Melanie Plageman wrote:
> > On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra
> >  wrote:
> >>
> >>
> >>
> >> On 2/28/24 21:06, Melanie Plageman wrote:
> >>> On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra
> >>>  wrote:
> 
>  On 2/28/24 15:56, Tomas Vondra wrote:
> >> ...
> >
> > Sure, I can do that. It'll take a couple hours to get the results, I'll
> > share them when I have them.
> >
> 
>  Here are the results with only patches 0001 - 0012 applied (i.e. without
>  the patch introducing the streaming read API, and the patch switching
>  the bitmap heap scan to use it).
> 
>  The changes in performance don't disappear entirely, but the scale is
>  certainly much smaller - both in the complete results for all runs, and
>  for the "optimal" runs that would actually pick bitmapscan.
> >>>
> >>> Hmm. I'm trying to think how my refactor could have had this impact.
> >>> It seems like all the most notable regressions are with 4 parallel
> >>> workers. What do the numeric column labels mean across the top
> >>> (2,4,8,16...) -- are they related to "matches"? And if so, what does
> >>> that mean?
> >>>
> >>
> >> That's the number of distinct values matched by the query, which should
> >> be an approximation of the number of matching rows. The number of
> >> distinct values in the data set differs by data set, but for 1M rows
> >> it's roughly like this:
> >>
> >> uniform: 10k
> >> linear: 10k
> >> cyclic: 100
> >>
> >> So for example matches=128 means ~1% of rows for uniform/linear, and
> >> 100% for cyclic data sets.
> >
> > Ah, thank you for the explanation. I also looked at your script after
> > having sent this email and saw that it is clear in your script what
> > "matches" is.
> >
> >> As for the possible cause, I think it's clear most of the difference
> >> comes from the last patch that actually switches bitmap heap scan to the
> >> streaming read API. That's mostly expected/understandable, although we
> >> probably need to look into the regressions or cases with e_i_c=0.
> >
> > Right, I'm mostly surprised about the regressions for patches 0001-0012.
> >
> > Re eic 0: Thomas Munro and I chatted off-list, and you bring up a
> > great point about eic 0. In old bitmapheapscan code eic 0 basically
> > disabled prefetching but with the streaming read API, it will still
> > issue fadvises when eic is 0. That is an easy one line fix. Thomas
> > prefers to fix it by always avoiding an fadvise for the last buffer in
> > a range before issuing a read (since we are about to read it anyway,
> > best not fadvise it too). This will fix eic 0 and also cut one system
> > call from each invocation of the streaming read machinery.
> >
> >> To analyze the 0001-0012 patches, maybe it'd be helpful to run tests for
> >> individual patches. I can try doing that tomorrow. It'll have to be a
> >> limited set of tests, to reduce the time, but might tell us whether it's
> >> due to a single patch or multiple patches.
> >
> > Yes, tomorrow I planned to start trying to repro some of the "red"
> > cases myself. Any one of the commits could cause a slight regression
> > but a 3.5x regression is quite surprising, so I might focus on trying
> > to repro that locally and then narrow down which patch causes it.
> >
> > For the non-cached regressions, perhaps the commit to use the correct
> > recheck flag (0004) when prefetching could be the culprit. And for the
> > cached regressions, my money is on the commit which changes the whole
> > control flow of BitmapHeapNext() and the next_block() and next_tuple()
> > functions (0010).
> >
>
> I do have some partial results, comparing the patches. I only ran one of
> the more affected workloads (cyclic) on the xeon, attached is a PDF
> comparing master and the 0001-0014 patches. The percentages are timing
> vs. the preceding patch (green - faster, red - slower).

Just confirming: the results are for uncached?

- Melanie

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-02-29 Thread Daniel Gustafsson

> On 27 Feb 2024, at 20:20, Jacob Champion  
> wrote:
> 
> On Fri, Feb 23, 2024 at 5:01 PM Jacob Champion
>  wrote:
>> The
>> patchset is now carrying a lot of squash-cruft, and I plan to flatten
>> it in the next version.
> 
> This is done in v17, which is also now based on the two patches pulled
> out by Daniel in [1]. Besides the squashes, which make up most of the
> range-diff, I've fixed a call to strncasecmp() which is not available
> on Windows.

Two quick questions:

+   /* TODO */
+   CHECK_SETOPT(actx, CURLOPT_WRITEDATA, stderr);
I might be missing something, but what this is intended for in
setup_curl_handles()?


--- /dev/null
+++ b/src/interfaces/libpq/fe-auth-oauth-iddawc.c
As discussed off-list I think we should leave iddawc support for later and
focus on getting one library properly supported to start with.  If you agree,
let's drop this from the patchset to make it easier to digest.  We should make
sure we keep pluggability such that another library can be supported though,
much like the libpq TLS support.

--
Daniel Gustafsson

Re: Atomic ops for unlogged LSN

2024-02-29 Thread Nathan Bossart

Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Statistics Import and Export

2024-02-29 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> On Thu, Feb 15, 2024 at 4:09 AM Corey Huinker 
> wrote:
> > Posting v5 updates of pg_import_rel_stats() and pg_import_ext_stats(),
> > which address many of the concerns listed earlier.
> >
> > Leaving the export/import scripts off for the time being, as they haven't
> > changed and the next likely change is to fold them into pg_dump.

> v6 posted below.
> 
> Changes:
> 
> - Additional documentation about the overall process.
> - Rewording of SGML docs.
> - removed a fair number of columns from the transformation queries.
> - enabled require_match_oids in extended statistics, but I'm having my
> doubts about the value of that.
> - moved stats extraction functions to an fe_utils file stats_export.c that
> will be used by both pg_export_stats and pg_dump.
> - pg_export_stats now generates SQL statements rather than a tsv, and has
> boolean flags to set the validate and require_match_oids parameters in the
> calls to pg_import_(rel|ext)_stats.
> - pg_import_stats is gone, as importing can now be done with psql.

Having looked through this thread and discussed a bit with Corey
off-line, the approach that Tom laid out up-thread seems like it would
make the most sense overall- that is, eliminate the JSON bits and the
SPI and instead export the stats data by running queries from the new
version of pg_dump/server (in the FDW case) against the old server
with the intelligence of how to transform the data into the format
needed for the current pg_dump/server to accept, through function calls
where the function calls generally map up to the rows/information being
updated- a call to update the information in pg_class for each relation
and then a call for each attribute to update the information in
pg_statistic.

Part of this process would include mapping from OIDs/attrnum's to names
on the source side and then from those names to the appropriate
OIDs/attrnum's on the destination side.

As this code would be used by both pg_dump and the postgres_fdw, it
seems logical that it would go into the common library.  Further, it
would make sense to have this code be able to handle multiple major
versions for the foreign side, such as how postgres_fdw and pg_dump
already do.

In terms of working to ensure that newer versions support loading from
older dumps (that is, that v18 would be able to load a dump file created
by a v17 pg_dump against a v17 server in the face of changes having been
made to the statistics system in v18), we could have the functions take
a version parameter (to handle cases where the data structure is the
same but the contents have to be handled differently), use overloaded
functions, or have version-specific names for the functions.  I'm also
generally supportive of the idea that we, perhaps initially, only
support dumping/loading stats with pg_dump when in binary-upgrade mode,
which removes our need to be concerned with this (perhaps that would be
a good v1 of this feature?) as the version of pg_dump needs to match
that of pg_upgrade and the destination server for various other reasons.
Including a switch to exclude stats on restore might also be an
acceptable answer, or even simply excluding them by default when going
between major versions except in binary-upgrade mode.

Along those same lines when it comes to a 'v1', I'd say that we may wish
to consider excluding extended statistics, which I am fairly confident
Corey's heard a number of times previously already but thought I would
add my own support for that.  To the extent that we do want to make
extended stats work down the road, we should probably have some
pre-patches to flush out the missing _in/_recv functions for those types
which don't have them today- and that would include modifying the _out
of those types to use names instead of OIDs/attrnums.  In thinking about
this, I was reviewing specifically pg_dependencies.  To the extent that
there are people who depend on the current output, I would think that
they'd actually appreciate this change.

I don't generally feel like we need to be checking that the OIDs between
the old server and the new server match- I appreciate that that should
be the case in a binary-upgrade situation but it still feels unnecessary
and complicated and clutters up the output and the function calls.

Overall, I definitely think this is a good project to work on as it's an
often, rightfully, complained about issue when it comes to pg_upgrade
and the amount of downtime required for it before the upgraded system
can be reasonably used again.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Refactor SASL exchange in preparation for OAuth Bearer

2024-02-29 Thread Jacob Champion

On Wed, Feb 28, 2024 at 2:54 PM Daniel Gustafsson  wrote:
> I rank that as one of my better typos actually. Fixed though.

LGTM!

Thanks,
--Jacob

Re: MERGE ... RETURNING

2024-02-29 Thread Jeff Davis

On Thu, 2024-02-29 at 19:24 +, Dean Rasheed wrote:
> Attached is a rebased version on top of 5f2e179bd3 (support for MERGE
> into views), with a few additional tests to confirm that MERGE ...
> RETURNING works for views as well as tables.

Thank you for the rebase. I just missed your message (race condition),
so I replied to v15:

https://www.postgresql.org/message-id/e03a87eb4e728c5e475b360b5845979f78d49020.camel%40j-davis.com

> I see that this patch was discussed at the PostgreSQL Developers
> Meeting. Did anything new come out of that discussion?

I don't think we made any conclusions at the meeting, but I expressed
that we need input from one of them on this patch.

Regards,
Jeff Davis

Re: MERGE ... RETURNING

2024-02-29 Thread Jeff Davis



Can we get some input on whether the current MERGE ... RETURNING patch
is the right approach from a language standpoint?

We've gone through a lot of iterations -- thank you Dean, for
implementing so many variations.

To summarize, most of the problem has been in retrieving the action
(INSERT/UPDATE/DELETE) taken or the WHEN-clause number applied to a
particular matched row. The reason this is important is because the row
returned is the old row for a DELETE action, and the new row for an
INSERT or UPDATE action. Without a way to distinguish the particular
action, the RETURNING clause returns a mixture of old and new rows,
which would be hard to use sensibly.

Granted, DELETE in a MERGE may be a less common case. But given that we
also have INSERT ... ON CONFLICT, MERGE commands are more likely to be
the complicated cases where distinguishing the action or clause number
is important.

But linguistically it's not clear where the action or clause number
should come from. The clauses don't have assigned numbers, and even if
they did, linguistically it's not clear how to refer to the clause
number in a language like SQL. Would it be a special identifier, a
function, a special function, or be a column in a special table
reference? Or, do we just have one RETURNING-clause per WHEN-clause,
and let the user use a literal of their choice in the RETURNING clause?

The current implementation uses a special function MERGING (a
grammatical construct without an OID that parses into a new MergingFunc
expr), which takes keywords ACTION or CLAUSE_NUMBER in the argument
positions. That's not totally unprecedented in SQL -- the XML and JSON
functions are kind of similar. But it's different in the sense that
MERGING is also context-sensitive: grammatically, it fits pretty much
anywhere a function fits, but then gets rejected at parse analysis time
(or perhaps even execution time?) if it's not called from the right
place.

Is that a reasonable thing to do?

Another related topic came up, which is that the RETURNING clause (for
UPDATE as well as MERGE) should probably accept some kind of alias like
NEW/OLD or BEFORE/AFTER to address the version of the row that you
want. That doesn't eliminate the need for the MERGING function, but
it's good to think about how that might fit in with whatever we do
here.

Regards,
Jeff Davis

Re: RangeTblEntry.inh vs. RTE_SUBQUERY

2024-02-29 Thread Tom Lane

Alvaro Herrera  writes:
> On 2024-Feb-29, Tom Lane wrote:
>> I agree that perminfoindex seems to have suffered from add-at-the-end
>> syndrome, and if we do touch the field order you made an improvement
>> there.  (BTW, who thought they needn't bother with a comment for
>> perminfoindex?)

> There is a comment for it, or at least a61b1f74823c added one, though
> not immediately adjacent.  I do see that it's now further away than it
> was.  Perhaps we could add /* index of RTEPermissionInfo entry, or 0 */
> to the line.

That'd be enough for me.

regards, tom lane

RE: Remove AIX Support (was: Re: Relation bulk write facility)

2024-02-29 Thread Phil Florent

Hi,
Historically many public hospitals I work for had IBM Power hardware.
The SMT8 (8 threads/cores) capabilities of Power CPU are useful to lower Oracle 
licence & support cost. We migrate to PostgreSQL and it runs very well on 
Power, especially since the (relatively) recent parallel executions features of 
the RDBMS match very well the CPU capabilities.
We chose to run PostgreSQL on Debian/Power (Little Endian) since ppc64le is an 
official Debian port. No AIX then. Only problem is that we still need to access 
Oracle databases and it can be useful to read directly with oracle_fdw but this 
tool needs an instant client and it's not open source of course. Oracle 
provides a binary but they don't provide patches for Debian/Power Little Endian 
(strange situation...) Just to say that of course we chose Linux for PostgreSQL 
but sometimes things are not so easy... We could have chosen AIX and we still 
have a  about interoperability.
Best regards,
Phil

De : Andres Freund 
Envoyé : jeudi 29 février 2024 10:35
À : Michael Banck 
Cc : Noah Misch ; Thomas Munro ; 
Heikki Linnakangas ; Peter Smith ; 
Robert Haas ; vignesh C ; 
pgsql-hackers ; Melanie Plageman 

Objet : Re: Remove AIX Support (was: Re: Relation bulk write facility)

Hi,

On 2024-02-29 10:24:24 +0100, Michael Banck wrote:
> On Thu, Feb 29, 2024 at 12:57:31AM -0800, Andres Freund wrote:
> > On 2024-02-29 09:13:04 +0100, Michael Banck wrote:
> > > The commit message says there is not a lot of user demand and that might
> > > be right, but contrary to other fringe OSes that got removed like HPPA
> > > or Irix, I believe Postgres on AIX is still used in production and if
> > > so, probably in a mission-critical manner at some old-school
> > > institutions (in fact, one of our customers does just that) and not as a
> > > thought-experiment. It is probably well-known among Postgres hackers
> > > that AIX support is problematic/a burden, but the current users might
> > > not be aware of this.
> >
> > Then these users should have paid somebody to actually do maintenance work 
> > on
> > the AIX support,o it doesn't regularly stand in the way of implementing
> > various things.
>
> Right, absolutely.
>
> But: did we ever tell them to do that? I don't think it's reasonable for
> them to expect to follow -hackers and jump in when somebody grumbles
> about AIX being a burden somewhere deep down a thread...

Well, the thing is that it's commonly going to be deep down some threads that
portability problems cause pain.  This is far from the only time. Just a few
threads:

https://postgr.es/m/CA+TgmoauCAv+p4Z57PqgVgNxsApxKs3Yh9mDLdUDB8fep-s=1...@mail.gmail.com
https://postgr.es/m/CA+hUKGK=DOC+hE-62FKfZy=ybt5ulkrg3zczd-jfykm-ipn...@mail.gmail.com
https://postgr.es/m/20230124165814.2njc7gnvubn2a...@awork3.anarazel.de
https://postgr.es/m/2385119.1696354...@sss.pgh.pa.us
https://postgr.es/m/20221005200710.luvw5evhwf6cl...@awork3.anarazel.de
https://postgr.es/m/20220820204401.vrf5kejih6jofvqb%40awork3.anarazel.de
https://postgr.es/m/E1oWpzF-002EG4-AG%40gemulon.postgresql.org

This is far from all.

The only platform rivalling AIX on the pain-caused metric is windows. And at
least that can be tested via CI (or locally).  We've been relying on the gcc
buildfarm to be able to maintain AIX at all, and that's not a resource that
scales to many users.

Greetings,

Andres Freund

Re: RangeTblEntry.inh vs. RTE_SUBQUERY

2024-02-29 Thread Alvaro Herrera

On 2024-Feb-29, Tom Lane wrote:

> I agree that perminfoindex seems to have suffered from add-at-the-end
> syndrome, and if we do touch the field order you made an improvement
> there.  (BTW, who thought they needn't bother with a comment for
> perminfoindex?)

There is a comment for it, or at least a61b1f74823c added one, though
not immediately adjacent.  I do see that it's now further away than it
was.  Perhaps we could add /* index of RTEPermissionInfo entry, or 0 */
to the line.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"The ability of users to misuse tools is, of course, legendary" (David Steele)
https://postgr.es/m/11b38a96-6ded-4668-b772-40f992132...@pgmasters.net

Re: Comments on Custom RMGRs

2024-02-29 Thread Jeff Davis

On Thu, 2024-02-29 at 21:47 +0700, Danil Anisimow wrote:
> Answering your questions might take some time as I want to write a
> sample patch for pg_stat_statements and make some tests.
> What do you think about putting the patch to commitfest as it closing
> in a few hours?

Added to March CF.

I don't have an immediate use case in mind for this, so please drive
that part of the discussion. I can't promise this for 17, but if the
patch is simple enough and a quick consensus develops, then it's
possible.

Regards,
Jeff Davis

Re: RangeTblEntry.inh vs. RTE_SUBQUERY

2024-02-29 Thread Tom Lane

Peter Eisentraut  writes:
> In nodes/parsenodes.h, it says both
>  This *must* be false for RTEs other than RTE_RELATION entries.

Well, that's true in the parser ...

> and also puts it under
>  Fields valid in all RTEs:
> which are both wrong on opposite ends of the spectrum.
> I think it would make more sense to group inh under "Fields valid for a 
> plain relation RTE" and then explain the exception for subqueries, like 
> it is done for several other fields.

Dunno.  The adjacent "lateral" field is also used for only selected
RTE kinds.

I'd be inclined to leave it where it is and just improve the
commentary.  That could read like

 *inh is true for relation references that should be expanded to include
 *inheritance children, if the rel has any.  In the parser this
 *will only be true for RTE_RELATION entries.  The planner also uses
 *this field to mark RTE_SUBQUERY entries that contain UNION ALL
 *queries that it has flattened into pulled-up subqueries
 *(creating a structure much like the effects of inheritance).

If you do insist on moving it, please put it next to relkind so it
packs better.

I agree that perminfoindex seems to have suffered from add-at-the-end
syndrome, and if we do touch the field order you made an improvement
there.  (BTW, who thought they needn't bother with a comment for
perminfoindex?)

regards, tom lane

Re: Atomic ops for unlogged LSN

2024-02-29 Thread Bharath Rupireddy

On Thu, Feb 29, 2024 at 10:04 PM Nathan Bossart
 wrote:
>
> Here is a new version of the patch that uses the new atomic read/write
> functions with full barriers that were added in commit bd5132d.  Thoughts?

Thanks for getting the other patch in. The attached v6 patch LGTM.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-02-29 Thread Давыдов Виталий


Dear All,

Consider, please, my patch for async commit for twophase transactions. It can 
be applicable when catchup performance is not enought with publication 
parameter twophase = on.

The key changes are:
 * Use XLogSetAsyncXactLSN instead of XLogFlush as it is for usual 
transactions. * In case of async commit only, save 2PC state in the pg_twophase 
file (but not fsync it) in addition to saving in the WAL. The file is used as 
an alternative to storing 2pc state in the memory. * On recovery, reject 
pg_twophase files with future xids.Probably, 2PC async commit should be enabled 
by a GUC (not implemented in the patch).

With best regards,
Vitaly


 
From cbaaa7270d771f9ccd6def08f0f02ce61dc15ff6 Mon Sep 17 00:00:00 2001
From: Vitaly Davydov 
Date: Thu, 29 Feb 2024 18:58:13 +0300
Subject: [PATCH] Async commit support for twophase transactions

---
 src/backend/access/transam/twophase.c | 171 +-
 1 file changed, 138 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 234c8d08eb..352266be14 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -109,6 +109,8 @@
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
+#define POSTGRESQL_TWOPHASE_SUPPORT_ASYNC_COMMIT
+
 /*
  * Directory where Two-phase commit files reside within PGDATA
  */
@@ -169,6 +171,7 @@ typedef struct GlobalTransactionData
 	BackendId	locking_backend;	/* backend currently working on the xact */
 	bool		valid;			/* true if PGPROC entry is in proc array */
 	bool		ondisk;			/* true if prepare state file is on disk */
+	bool		infile;			/* true if prepared state saved in file (but not fsync-ed) */
 	bool		inredo;			/* true if entry was added via xlog_redo */
 	char		gid[GIDSIZE];	/* The GID assigned to the prepared xact */
 }			GlobalTransactionData;
@@ -227,12 +230,14 @@ static void RemoveGXact(GlobalTransaction gxact);
 static void XlogReadTwoPhaseData(XLogRecPtr lsn, char **buf, int *len);
 static char *ProcessTwoPhaseBuffer(TransactionId xid,
    XLogRecPtr prepare_start_lsn,
-   bool fromdisk, bool setParent, bool setNextXid);
+   bool fromdisk, bool setParent, bool setNextXid,
+   const char *filename);
 static void MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid,
 const char *gid, TimestampTz prepared_at, Oid owner,
 Oid databaseid);
 static void RemoveTwoPhaseFile(TransactionId xid, bool giveWarning);
-static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len);
+static void RemoveTwoPhaseFileByName(const char *filename, bool giveWarning);
+static void RecreateTwoPhaseFile(TransactionId xid, void *content, int len, bool dosync);
 
 /*
  * Initialization of shared memory
@@ -427,6 +432,7 @@ MarkAsPreparing(TransactionId xid, const char *gid,
 
 	MarkAsPreparingGuts(gxact, xid, gid, prepared_at, owner, databaseid);
 
+	gxact->infile = false;
 	gxact->ondisk = false;
 
 	/* And insert it into the active array */
@@ -1204,6 +1210,37 @@ EndPrepare(GlobalTransaction gxact)
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
  errmsg("two-phase state file maximum length exceeded")));
 
+#ifdef POSTGRESQL_TWOPHASE_SUPPORT_ASYNC_COMMIT
+
+	Assert(gxact->infile == false);
+
+	if (synchronous_commit == SYNCHRONOUS_COMMIT_OFF)
+	{
+		char		   *buf;
+		size_t			len = 0;
+		size_t			offset = 0;
+
+		for (record = records.head; record != NULL; record = record->next)
+			len += record->len;
+
+		if (len > 0)
+		{
+			buf = palloc(len);
+
+			for (record = records.head; record != NULL; record = record->next)
+			{
+memcpy(buf + offset, record->data, record->len);
+offset += record->len;
+			}
+
+			RecreateTwoPhaseFile(gxact->xid, buf, len, false);
+			pfree(buf);
+			gxact->infile = true;
+		}
+	}
+
+#endif
+
 	/*
 	 * Now writing 2PC state data to WAL. We let the WAL's CRC protection
 	 * cover us, so no need to calculate a separate CRC.
@@ -1239,8 +1276,24 @@ EndPrepare(GlobalTransaction gxact)
    gxact->prepare_end_lsn);
 	}
 
+#if !defined(POSTGRESQL_TWOPHASE_SUPPORT_ASYNC_COMMIT)
+
 	XLogFlush(gxact->prepare_end_lsn);
 
+#else
+
+	if (synchronous_commit > SYNCHRONOUS_COMMIT_OFF)
+	{
+		/* Flush XLOG to disk */
+		XLogFlush(gxact->prepare_end_lsn);
+	}
+	else
+	{
+		XLogSetAsyncXactLSN(gxact->prepare_end_lsn);
+	}
+
+#endif
+
 	/* If we crash now, we have prepared: WAL replay will fix things */
 
 	/* Store record's start location to read that later on Commit */
@@ -1546,12 +1599,11 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	 * in WAL files if the LSN is after the last checkpoint record, or moved
 	 * to disk if for some reason they have lived for a long time.
 	 */
-	if (gxact->ondisk)
+	if (gxact->infile || gxact->ondisk)
 		buf = ReadTwoPhaseFile(xid, false);
 	else
 		XlogReadTwoPhaseData(gxact->prepare_start_lsn, , NULL);
 
-
 	/*
 	 * Disassemble the header area
 	 */
@@

Re: Fix for edge case in date_bin() function

2024-02-29 Thread Tom Lane

Moaaz Assali  writes:
> However, I don't see the issue with the INT64 -> UINT64 mapping. The
> current implementation results in integer overflows (errors instead after
> the recent patch) even for valid timestamps where the result of date_bin()
> is also another valid timestamp.

> On the other hand, the INT64 -> UINT64 mapping solves this issue and allows
> the input of any valid source and origin timestamps as long as the stride
> chosen doesn't output invalid timestamps that cannot be represented by
> Timestamp(tz) type anyways. Since all INT64 values can be mapped 1-to-1 in
> UINT64, I don't see where the problem is.

What I don't like about it is that it's complicated (and you didn't
make any effort whatsoever to make the code intelligible or self-
documenting), and that complication has zero real-world benefit.
The only way to hit an overflow in this subtraction is with dates
well beyond 20 AD.  If you are actually dealing with such dates
(maybe you're an astronomer or a geologist), then timestamp[tz] isn't
the data type for you, because you probably need orders of magnitude
wider range than it's got.

Now I'll freely admit that the pg_xxx_yyy_overflow() functions are
notationally klugy, but they're well documented and they're something
that people would need to understand anyway for a lot of other places
in Postgres.  So I think there's less cognitive load for readers of
the code in the let's-throw-an-error approach than in writing one-off
magic code that in the end can avoid only one of the three possible
overflow cases in this function.

regards, tom lane

Re: Atomic ops for unlogged LSN

2024-02-29 Thread John Morris

Looks good to me.

  *   John

From: Nathan Bossart 
Date: Thursday, February 29, 2024 at 8:34 AM
To: Andres Freund 
Cc: Stephen Frost , John Morris 
, Bharath Rupireddy 
, Michael Paquier 
, Robert Haas , 
pgsql-hack...@postgresql.org 
Subject: Re: Atomic ops for unlogged LSN
Here is a new version of the patch that uses the new atomic read/write
functions with full barriers that were added in commit bd5132d.  Thoughts?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Atomic ops for unlogged LSN

2024-02-29 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> Here is a new version of the patch that uses the new atomic read/write
> functions with full barriers that were added in commit bd5132d.  Thoughts?

Saw that commit go in- glad to see it.  Thanks for updating this patch
too.  The changes look good to me.

Thanks again,

Stephen

signature.asc
Description: PGP signature

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Alexander Kukushkin

Hi Kyotaro,

On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi 
wrote:

In the first place, it's important to note that we do not guarantee
> that an async standby can always switch its replication connection to
> the old primary or another sibling standby. This is due to the
> variations in replication lag among standbys. pg_rewind is required to
> adjust such discrepancies.
>

Sure, I know. But in this case the async standby received and flushed
absolutely the same amount of WAL as the promoted one.

>
> I might be overlooking something, but I don't understand how this
> occurs without purposefully tweaking WAL files. The repro script
> pushes an incomplete WAL file to the archive as a non-partial
> segment. This shouldn't happen in the real world.
>

It easily happens if the primary crashed and standbys didn't receive
another page with continuation record.

In the repro script, the replication connection of the second standby
> is switched from the old primary to the first standby after its
> promotion. After the switching, replication is expected to continue
> from the beginning of the last replayed segment.

Well, maybe, but apparently the standby is busy trying to decode a record
that spans multiple pages, and it is just infinitely waiting for the next
page to arrive. Also, the restart "fixes" the problem, because indeed it is
reading the file from the beginning.

> But with the script,
> the second standby copies the intentionally broken file, which differs
> from the data that should be received via streaming.

As I already said, this is a simple way to emulate the primary crash while
standbys receiving WAL.
It could easily happen that the record spans on multiple pages is not fully
received and flushed.

-- 
Regards,
--
Alexander Kukushkin

Re: Supporting MERGE on updatable views

2024-02-29 Thread Dean Rasheed

On Thu, 29 Feb 2024 at 09:50, Alvaro Herrera  wrote:
>
> By all means let's get the feature out there.  It's not a frequently
> requested thing but it does seem to come up.
>

Pushed. Thanks for reviewing.

Regards,
Dean

Re: Infinite loop in XLogPageRead() on standby

2024-02-29 Thread Alexander Kukushkin

Hi Michael,

On Thu, 29 Feb 2024 at 06:05, Michael Paquier  wrote:

>
> Wow.  Have you seen that in an actual production environment?
>

Yes, we see it regularly, and it is reproducible in test environments as
well.


> my $start_page = start_of_page($end_lsn);
> my $wal_file = write_wal($primary, $TLI, $start_page,
>  "\x00" x $WAL_BLOCK_SIZE);
> # copy the file we just "hacked" to the archive
> copy($wal_file, $primary->archive_dir);
>
> So you are emulating a failure by filling with zeros the second page
> where the last emit_message() generated a record, and the page before
> that includes the continuation record.  Then abuse of WAL archiving to
> force the replay of the last record.  That's kind of cool.
>

Right, at this point it is easier than to cause an artificial crash on the
primary after it finished writing just one page.


> > To be honest, I don't know yet how to fix it nicely. I am thinking about
> > returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a
> new
> > timeline while trying to read a page and if this page is invalid.
>
> Hmm.  I suspect that you may be right on a TLI change when reading a
> page.  There are a bunch of side cases with continuation records and
> header validation around XLogReaderValidatePageHeader().  Perhaps you
> have an idea of patch to show your point?
>

Not yet, but hopefully I will get something done next week.


>
> Nit.  In your test, it seems to me that you should not call directly
> set_standby_mode and enable_restoring, just rely on has_restoring with
> the standby option included.
>

Thanks, I'll look into it.

-- 
Regards,
--
Alexander Kukushkin

Re: Atomic ops for unlogged LSN

2024-02-29 Thread Nathan Bossart

Here is a new version of the patch that uses the new atomic read/write
functions with full barriers that were added in commit bd5132d.  Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From c40594c62ccfbf75cb0d3787cb9367d15ae37de8 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 29 Feb 2024 10:25:46 -0600
Subject: [PATCH v6 1/1] Convert unloggedLSN to an atomic variable.

Currently, this variable is an XLogRecPtr protected by a spinlock.
By converting it to an atomic variable, we can remove the spinlock,
which saves a small amount of shared memory space.  Since this code
is not performance-critical, we use atomic operations with full
barrier semantics to make it easy to reason about correctness.

Author: John Morris
Reviewed-by: Michael Paquier, Robert Haas, Andres Freund, Stephen Frost, Bharath Rupireddy
Discussion: https://postgr.es/m/BYAPR13MB26772534335255E50318C574A0409%40BYAPR13MB2677.namprd13.prod.outlook.com
Discussion: https://postgr.es/m/MN2PR13MB2688FD8B757316CB5C54C8A2A0DDA%40MN2PR13MB2688.namprd13.prod.outlook.com
---
 src/backend/access/transam/xlog.c | 26 +-
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1162d55bf..eb02e3b6a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -470,9 +470,8 @@ typedef struct XLogCtlData
 
 	XLogSegNo	lastRemovedSegNo;	/* latest removed/recycled XLOG segment */
 
-	/* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */
-	XLogRecPtr	unloggedLSN;
-	slock_t		ulsn_lck;
+	/* Fake LSN counter, for unlogged relations. */
+	pg_atomic_uint64 unloggedLSN;
 
 	/* Time and LSN of last xlog segment switch. Protected by WALWriteLock. */
 	pg_time_t	lastSegSwitchTime;
@@ -4498,14 +4497,7 @@ DataChecksumsEnabled(void)
 XLogRecPtr
 GetFakeLSNForUnloggedRel(void)
 {
-	XLogRecPtr	nextUnloggedLSN;
-
-	/* increment the unloggedLSN counter, need SpinLock */
-	SpinLockAcquire(>ulsn_lck);
-	nextUnloggedLSN = XLogCtl->unloggedLSN++;
-	SpinLockRelease(>ulsn_lck);
-
-	return nextUnloggedLSN;
+	return pg_atomic_fetch_add_u64(>unloggedLSN, 1);
 }
 
 /*
@@ -4921,7 +4913,7 @@ XLOGShmemInit(void)
 
 	SpinLockInit(>Insert.insertpos_lck);
 	SpinLockInit(>info_lck);
-	SpinLockInit(>ulsn_lck);
+	pg_atomic_init_u64(>unloggedLSN, InvalidXLogRecPtr);
 }
 
 /*
@@ -5526,9 +5518,11 @@ StartupXLOG(void)
 	 * the unlogged LSN counter can be reset too.
 	 */
 	if (ControlFile->state == DB_SHUTDOWNED)
-		XLogCtl->unloggedLSN = ControlFile->unloggedLSN;
+		pg_atomic_write_membarrier_u64(>unloggedLSN,
+	   ControlFile->unloggedLSN);
 	else
-		XLogCtl->unloggedLSN = FirstNormalUnloggedLSN;
+		pg_atomic_write_membarrier_u64(>unloggedLSN,
+	   FirstNormalUnloggedLSN);
 
 	/*
 	 * Copy any missing timeline history files between 'now' and the recovery
@@ -7110,9 +7104,7 @@ CreateCheckPoint(int flags)
 	 * unused on non-shutdown checkpoints, but seems useful to store it always
 	 * for debugging purposes.
 	 */
-	SpinLockAcquire(>ulsn_lck);
-	ControlFile->unloggedLSN = XLogCtl->unloggedLSN;
-	SpinLockRelease(>ulsn_lck);
+	ControlFile->unloggedLSN = pg_atomic_read_membarrier_u64(>unloggedLSN);
 
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
-- 
2.25.1

Re: locked reads for atomics

2024-02-29 Thread Nathan Bossart

Committed.  Thank you for reviewing!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: ALTER TABLE SET ACCESS METHOD on partitioned tables

2024-02-29 Thread Justin Pryzby

On Wed, Feb 28, 2024 at 05:08:49PM +0900, Michael Paquier wrote:
> On Wed, Feb 21, 2024 at 08:46:48AM +0100, Peter Eisentraut wrote:
> > Yes, I think most people agreed that that would be the preferred behavior.
> 
> Challenge accepted.  As of the patch attached.

Thanks for picking it up.  I find it pretty hard to switch back to
put the needed effort into a patch after a long period.

> I have implemented that so as we keep the default, historical
> behavior: if pg_class.relam is 0 for a partitioned table, use the AM
> defined by default_table_access_method.  The patch only adds a path to
> switch to a different AM than the GUC when creating a new partition if
> and only if a partitioned table has been manipulated with ALTER TABLE
> SET ACCESS METHOD to update its AM to something else than the GUC.
> Similarly to tablespaces, CREATE TABLE USING is *not* supported for
> partitioned tables, same behavior as previously.

This patch allows resetting relam=0 by running ALTER TABLE SET AM to the
same value as the GUC.  Maybe it'd be better to have an explicit SET
DEFAULT (as in b9424d01 and 4f622503).

-- 
Justin

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-02-29 Thread Jacob Champion

[re-adding the CC list I dropped earlier]

On Wed, Feb 28, 2024 at 1:52 PM Daniel Gustafsson  wrote:
>
> > On 28 Feb 2024, at 22:50, Andrew Dunstan  wrote:
> > Can you give some more details about what this python gadget would buy us? 
> > I note that there are a couple of CPAN modules that provide OAuth2 servers, 
> > not sure if they would be of any use.
>
> The main benefit would be to be able to provide a full testharness without
> adding any additional dependencies over what we already have (Python being
> required by meson).  That should ideally make it easy to get good coverage 
> from
> BF animals as no installation is needed.

As an additional note, the test suite ideally needs to be able to
exercise failure modes where the provider itself is malfunctioning. So
we hand-roll responses rather than deferring to an external
OAuth/OpenID implementation, which adds HTTP and JSON dependencies at
minimum, and Python includes both. See also the discussion with
Stephen upthread [1].

(I do think it'd be nice to eventually include a prepackaged OAuth
server in the test suite, to stack coverage for the happy path and
further test interoperability.)

Thanks,
--Jacob

[1] 
https://postgr.es/m/CAAWbhmh%2B6q4t3P%2BwDmS%3DJuHBpcgF-VM2cXNft8XV02yk-cHCpQ%40mail.gmail.com

Re: Comments on Custom RMGRs

2024-02-29 Thread Danil Anisimow

On Tue, Feb 27, 2024 at 2:56 AM Jeff Davis  wrote:
> Let's pick this discussion back up, then. Where should the hook go?
> Does it need to be broken into phases like resource owners? What
> guidance can we provide to extension authors to use it correctly?
>
> Simon's right that these things don't need to be 100% answered for
> every hook we add; but I agree with Andres and Robert that this could
> benefit from some more discussion about the details.
>
> The proposal calls the hook right after CheckPointPredicate() and
> before CheckPointBuffers(). Is that the right place for the use case
> you have in mind with pg_stat_statements?

Hello!

Answering your questions might take some time as I want to write a sample
patch for pg_stat_statements and make some tests.
What do you think about putting the patch to commitfest as it closing in a
few hours?

--
Regards,
Daniil Anisimov
Postgres Professional: http://postgrespro.com

Re: Injection points: some tools to wait and wake

2024-02-29 Thread Andrey M. Borodin




> On 29 Feb 2024, at 15:20, Bertrand Drouvot  
> wrote:
> 
> done in v2 attached.

Works fine for me. Thanks!


Best regards, Andrey Borodin.

Re: RangeTblEntry jumble omissions

2024-02-29 Thread Peter Eisentraut


On 26.02.24 02:08, Michael Paquier wrote:

On Fri, Feb 23, 2024 at 06:52:54PM -0500, Tom Lane wrote:

Julien Rouhaud  writes:

On Fri, Feb 23, 2024 at 04:26:53PM +0100, Peter Eisentraut wrote:

- funcordinality
This was probably just forgotten.  It should be included because the WITH
ORDINALITY clause changes the query result.



Agreed.


Seems OK.


+1.


Ok, I have added funcordinality for the RTE_FUNCTION case, and left the 
others alone.

Re: RangeTblEntry.inh vs. RTE_SUBQUERY

2024-02-29 Thread Peter Eisentraut


On 23.02.24 16:19, Tom Lane wrote:

Dean Rasheed  writes:

On Fri, 23 Feb 2024 at 14:35, Peter Eisentraut  wrote:

Various code comments say that the RangeTblEntry field inh may only be
set for entries of kind RTE_RELATION.



Yes, it's explained a bit more clearly/accurately in expand_inherited_rtentry():



  * "inh" is only allowed in two cases: RELATION and SUBQUERY RTEs.


Yes.  The latter has been accurate for a very long time, so I'm
surprised that there are any places that think otherwise.  We need
to fix them --- where did you see this exactly?


In nodes/parsenodes.h, it says both

This *must* be false for RTEs other than RTE_RELATION entries.

and also puts it under

Fields valid in all RTEs:

which are both wrong on opposite ends of the spectrum.

I think it would make more sense to group inh under "Fields valid for a 
plain relation RTE" and then explain the exception for subqueries, like 
it is done for several other fields.


See attached patch for a proposal.  (I also shuffled a few fields around 
to make the order a bit more logical.)From 631fa6a211f7d2022e78e5683ed234a00ae144bd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 29 Feb 2024 13:06:48 +0100
Subject: [PATCH] Fix description and grouping of RangeTblEntry.inh

---
 src/backend/nodes/outfuncs.c |  5 +++--
 src/backend/nodes/queryjumblefuncs.c |  2 +-
 src/backend/nodes/readfuncs.c|  5 +++--
 src/backend/parser/parse_relation.c  | 11 ++-
 src/include/nodes/parsenodes.h   | 15 +--
 5 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 2c30bba212..1c66be3af8 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -505,8 +505,9 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
WRITE_OID_FIELD(relid);
WRITE_CHAR_FIELD(relkind);
WRITE_INT_FIELD(rellockmode);
-   WRITE_NODE_FIELD(tablesample);
WRITE_UINT_FIELD(perminfoindex);
+   WRITE_BOOL_FIELD(inh);
+   WRITE_NODE_FIELD(tablesample);
break;
case RTE_SUBQUERY:
WRITE_NODE_FIELD(subquery);
@@ -516,6 +517,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
WRITE_CHAR_FIELD(relkind);
WRITE_INT_FIELD(rellockmode);
WRITE_UINT_FIELD(perminfoindex);
+   WRITE_BOOL_FIELD(inh);
break;
case RTE_JOIN:
WRITE_ENUM_FIELD(jointype, JoinType);
@@ -564,7 +566,6 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
}
 
WRITE_BOOL_FIELD(lateral);
-   WRITE_BOOL_FIELD(inh);
WRITE_BOOL_FIELD(inFromCl);
WRITE_NODE_FIELD(securityQuals);
 }
diff --git a/src/backend/nodes/queryjumblefuncs.c 
b/src/backend/nodes/queryjumblefuncs.c
index 82f725baaa..1900b8bca4 100644
--- a/src/backend/nodes/queryjumblefuncs.c
+++ b/src/backend/nodes/queryjumblefuncs.c
@@ -364,8 +364,8 @@ _jumbleRangeTblEntry(JumbleState *jstate, Node *node)
{
case RTE_RELATION:
JUMBLE_FIELD(relid);
-   JUMBLE_NODE(tablesample);
JUMBLE_FIELD(inh);
+   JUMBLE_NODE(tablesample);
break;
case RTE_SUBQUERY:
JUMBLE_NODE(subquery);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index b1e2f2b440..b80441e0c2 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -359,8 +359,9 @@ _readRangeTblEntry(void)
READ_OID_FIELD(relid);
READ_CHAR_FIELD(relkind);
READ_INT_FIELD(rellockmode);
-   READ_NODE_FIELD(tablesample);
READ_UINT_FIELD(perminfoindex);
+   READ_BOOL_FIELD(inh);
+   READ_NODE_FIELD(tablesample);
break;
case RTE_SUBQUERY:
READ_NODE_FIELD(subquery);
@@ -370,6 +371,7 @@ _readRangeTblEntry(void)
READ_CHAR_FIELD(relkind);
READ_INT_FIELD(rellockmode);
READ_UINT_FIELD(perminfoindex);
+   READ_BOOL_FIELD(inh);
break;
case RTE_JOIN:
READ_ENUM_FIELD(jointype, JoinType);
@@ -428,7 +430,6 @@ _readRangeTblEntry(void)
}
 
READ_BOOL_FIELD(lateral);
-   READ_BOOL_FIELD(inh);
READ_BOOL_FIELD(inFromCl);
READ_NODE_FIELD(securityQuals);
 
diff --git a/src/backend/parser/parse_relation.c 
b/src/backend/parser/parse_relation.c
index

Re: brininsert optimization opportunity

2024-02-29 Thread Alvaro Herrera

On 2024-Feb-13, Tomas Vondra wrote:

> One thing that is not very clear to me is that I don't think there's a
> very good way to determine which places need the cleanup call. Because
> it depends on (a) what kind of index is used and (b) what happens in the
> code called earlier (which may easily do arbitrary stuff). Which means
> we have to call the cleanup whenever the code *might* have done inserts
> into the index. Maybe it's not such an issue in practice, though.

I think it's not an issue, or rather that we should not try to guess.
Instead make it a simple rule: if aminsert is called, then
aminsertcleanup must be called afterwards, period.

> On 1/8/24 16:51, Alvaro Herrera wrote:

> > So I think we should do 0001 and perhaps some further tweaks to the
> > original brininsert optimization commit: I think the aminsertcleanup
> > callback should receive the indexRelation as first argument; and also I
> > think it's not index_insert_cleanup() job to worry about whether
> > ii_AmCache is NULL or not, but instead the callback should be invoked
> > always, and then it's aminsertcleanup job to do nothing if ii_AmCache is
> > NULL. [...]

> I don't quite see why we should invoke the callback with ii_AmCache=NULL
> though. If there's nothing cached, why bother? It just means all cleanup
> callbacks have to do this NULL check on their own.

Guessing that aminsertcleanup is not needed when ii_AmCache is NULL
seems like a leaky abstraction.  I propose to have only the AM know
whether the cleanup call is important or not, without
index_insert_cleanup assuming that it's related to ii_AmCache.  Somebody
could decide to have something completely different during insert
cleanup, which is not in ii_AmCache.

> I wonder if there's a nice way to check this in assert-enabled builds?
> Could we tweak nbtree (or index AM in general) to check that all places
> that called aminsert also called aminsertcleanup?
> 
> For example, I was thinking we might add a flag to IndexInfo (separate
> from the ii_AmCache), tracking if aminsert() was called, and Then later
> check the aminsertcleanup() got called too. The problem however is
> there's no obviously convenient place for this check, because IndexInfo
> is not freed explicitly ...

I agree it would be nice to have a way to verify, but it doesn't seem
100% essential.  After all, it's not very common to add new calls to
aminsert.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-29 Thread John Naylor

On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada  wrote:

> - v63-0008 patch fixes a bug in tidstore.

- page->nwords = wordnum + 1;
- Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1]));
+ page->nwords = wordnum;
+ Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));

Yikes, I'm guessing this failed in a non-assert builds? I wonder why
my compiler didn't yell at me... Have you tried a tidstore-debug build
without asserts?

> - v63-0009 patch is a draft idea of cleanup memory context handling.

Thanks, looks pretty good!

+ ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+"tidstore storage",

"tidstore storage" sounds a bit strange -- maybe look at some other
context names for ideas.

- leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
+ leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
+? tree->leaf_ctx
+: tree->context, allocsize);

Instead of branching here, can we copy "context" to "leaf_ctx" when
necessary (those names should look more like eachother, btw)? I think
that means anything not covered by this case:

+#ifndef RT_VARLEN_VALUE_SIZE
+ if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
+ tree->leaf_ctx = SlabContextCreate(ctx,
+RT_STR(RT_PREFIX) "radix_tree leaf contex",
+RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+sizeof(RT_VALUE_TYPE));
+#endif /* !RT_VARLEN_VALUE_SIZE */

...also, we should document why we're using slab here. On that, I
don't recall why we are? We've never had a fixed-length type test case
on 64-bit, so it wasn't because it won through benchmarking. It seems
a hold-over from the days of "multi-value leaves". Is it to avoid the
possibility of space wastage with non-power-of-two size types?

For this stanza that remains unchanged:

for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
  MemoryContextDelete(tree->node_slabs[i]);
}

if (tree->leaf_ctx)
{
  MemoryContextDelete(tree->leaf_ctx);
}

...is there a reason we can't just delete tree->ctx, and let that
recursively delete child contexts?

Secondly, I thought about my recent work to skip checking if we first
need to create a root node, and that has a harmless (for vacuum at
least) but slightly untidy behavior: When RT_SET is first called, and
the key is bigger than 255, new nodes will go on top of the root node.
These have chunk '0'. If all subsequent keys are big enough, the
orginal root node will stay empty. If all keys are deleted, there will
be a chain of empty nodes remaining. Again, I believe this is
harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
on this, but likely not today.

Thirdly, cosmetic: With the introduction of single-value leaves, it
seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?

Re: Improve readability by using designated initializers when possible

2024-02-29 Thread Peter Eisentraut


On 27.02.24 08:57, Alvaro Herrera wrote:

On 2024-Feb-27, Michael Paquier wrote:


These would cause compilation failures.  Saying that, this is a very
nice cleanup, so I've fixed these and applied the patch after checking
that the one-one replacements were correct.


Oh, I thought we were going to get rid of ObjectClass altogether -- I
mean, have getObjectClass() return ObjectAddress->classId, and then
define the OCLASS values for each catalog OID [... tries to ...]  But
this(*) doesn't work for two reasons:


I have long wondered what the point of ObjectClass is.  I find the extra 
layer of redirection, which is used only in small parts of the code, and 
the similarity to ObjectType confusing.  I happened to have a draft 
patch for its removal lying around, so I'll show it here, rebased over 
what has already been done in this thread.



1. some switches processing the OCLASS enum don't have "default:" cases.
This is so that the compiler tells us when we fail to add support for
some new object class (and it's been helpful).  If we were to


I think you can also handle that with some assertions and proper test 
coverage.  It's not even clear how strong this benefit is.  For example, 
in AlterObjectNamespace_oid(), you could still put a new OCLASS into the 
"ignore object types that don't have schema-qualified names" case, and 
it might or might not be wrong.  Also, there are already various OCLASS 
switches that do have a default case, so it's not even clear what the 
preferred coding style should be.



2. all users of getObjectClass would have to include the catalog header
specific to every catalog it wants to handle; so tablecmds.c and
dependency.c would have to include almost all catalog includes, for
example.


This doesn't seem to be a problem in practice; see patch.
From c3acb638301b7a00d18baaf464c29f742734629a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 29 Feb 2024 12:06:15 +0100
Subject: [PATCH] Remove ObjectClass

---
 src/backend/catalog/dependency.c| 223 +++-
 src/backend/catalog/objectaddress.c | 311 
 src/backend/commands/alter.c|  73 ++-
 src/backend/commands/tablecmds.c|  60 ++
 src/include/catalog/dependency.h|  51 -
 5 files changed, 188 insertions(+), 530 deletions(-)

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 192dedb3cb..d6a5d1d185 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1351,9 +1351,9 @@ deleteOneObject(const ObjectAddress *object, Relation 
*depRel, int flags)
 static void
 doDeletion(const ObjectAddress *object, int flags)
 {
-   switch (getObjectClass(object))
+   switch (object->classId)
{
-   case OCLASS_CLASS:
+   case RelationRelationId:
{
charrelKind = 
get_rel_relkind(object->objectId);
 
@@ -1384,104 +1384,80 @@ doDeletion(const ObjectAddress *object, int flags)
break;
}
 
-   case OCLASS_PROC:
+   case ProcedureRelationId:
RemoveFunctionById(object->objectId);
break;
 
-   case OCLASS_TYPE:
+   case TypeRelationId:
RemoveTypeById(object->objectId);
break;
 
-   case OCLASS_CONSTRAINT:
+   case ConstraintRelationId:
RemoveConstraintById(object->objectId);
break;
 
-   case OCLASS_DEFAULT:
+   case AttrDefaultRelationId:
RemoveAttrDefaultById(object->objectId);
break;
 
-   case OCLASS_LARGEOBJECT:
+   case LargeObjectRelationId:
LargeObjectDrop(object->objectId);
break;
 
-   case OCLASS_OPERATOR:
+   case OperatorRelationId:
RemoveOperatorById(object->objectId);
break;
 
-   case OCLASS_REWRITE:
+   case RewriteRelationId:
RemoveRewriteRuleById(object->objectId);
break;
 
-   case OCLASS_TRIGGER:
+   case TriggerRelationId:
RemoveTriggerById(object->objectId);
break;
 
-   case OCLASS_STATISTIC_EXT:
+   case StatisticExtRelationId:
RemoveStatisticsById(object->objectId);
break;
 
-   case OCLASS_TSCONFIG:
+   case TSConfigRelationId:
RemoveTSConfigurationById(object->objectId);
break;
 
-   case OCLASS_EXTENSION:
+   case ExtensionRelationId:

Re: Synchronizing slots from primary to standby

2024-02-29 Thread shveta malik

On Thu, Feb 29, 2024 at 7:04 AM Zhijie Hou (Fujitsu)
 wrote:
>
> Here is the v101 patch set which addressed above comments.
>

Thanks for the patch. Few comments:

1)  Shall we mention in doc that shutdown will wait for standbys in
standby_slot_names to confirm receiving WAL:

Suggestion for logicaldecoding.sgml:

When standby_slot_names is utilized, the primary
server will not completely shut down until the corresponding standbys,
associated with the physical replication slots specified in
standby_slot_names, have confirmed receiving the
WAL up to the latest flushed position on the primary server.

slot.c
2)
/*
 * If a logical slot name is provided in standby_slot_names, report
 * a message and skip it. Although logical slots are disallowed in
 * the GUC check_hook(validate_standby_slots), it is still
 * possible for a user to drop an existing physical slot and
 * recreate a logical slot with the same name.
 */

This is not completely true, we can still specify a logical slot
during instance start and it will accept it.

Suggestion:
/*
 * If a logical slot name is provided in standby_slot_names, report
 * a message and skip it. It is possible for user to specify a
 * logical slot name in standby_slot_names just before the server
 * startup. The GUC check_hook(validate_standby_slots) can not
 * validate such a slot during startup as the ReplicationSlotCtl
 * shared memory is not initialized by that time. It is also
 * possible for user to drop an existing physical slot and
 * recreate a logical slot with the same name.
 */

3. Wait for physical standby to confirm receiving the given lsn

standby -->standbys

4.
In StandbyConfirmedFlush(), is it better to have below errdetail in
all problematic cases:
Logical replication is waiting on the standby associated with \"%s\

We have it only for inactive pid case but we are waiting in all cases.

thanks
Shveta

Re: Synchronizing slots from primary to standby

2024-02-29 Thread Amit Kapila

On Thu, Feb 29, 2024 at 3:23 PM Amit Kapila  wrote:
>
> Few additional comments on the latest patch:
> =
> 1.
>  static XLogRecPtr
>  WalSndWaitForWal(XLogRecPtr loc)
> {
> ...
> + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr &&
> + (!replication_active || StandbyConfirmedFlush(loc, WARNING)))
> + {
> + /*
> + * Fast path to avoid acquiring the spinlock in case we already know
> + * we have enough WAL available and all the standby servers have
> + * confirmed receipt of WAL up to RecentFlushPtr. This is particularly
> + * interesting if we're far behind.
> + */
>   return RecentFlushPtr;
> + }
> ...
> ...
> + * Wait for WALs to be flushed to disk only if the postmaster has not
> + * asked us to stop.
> + */
> + if (loc > RecentFlushPtr && !got_STOPPING)
> + wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL;
> +
> + /*
> + * Check if the standby slots have caught up to the flushed position.
> + * It is good to wait up to RecentFlushPtr and then let it send the
> + * changes to logical subscribers one by one which are already covered
> + * in RecentFlushPtr without needing to wait on every change for
> + * standby confirmation. Note that after receiving the shutdown signal,
> + * an ERROR is reported if any slots are dropped, invalidated, or
> + * inactive. This measure is taken to prevent the walsender from
> + * waiting indefinitely.
> + */
> + else if (replication_active &&
> + !StandbyConfirmedFlush(RecentFlushPtr, WARNING))
> + {
> + wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION;
> + wait_for_standby = true;
> + }
> + else
> + {
> + /* Already caught up and doesn't need to wait for standby_slots. */
>   break;
> + }
> ...
> }
>
> Can we try to move these checks into a separate function that returns
> a boolean and has an out parameter as wait_event?
>
> 2. How about naming StandbyConfirmedFlush() as StandbySlotsAreCaughtup?
>

Some more comments:
==
1.
+ foreach_ptr(char, name, elemlist)
+ {
+ ReplicationSlot *slot;
+
+ slot = SearchNamedReplicationSlot(name, true);
+
+ if (!slot)
+ {
+ GUC_check_errdetail("replication slot \"%s\" does not exist",
+ name);
+ ok = false;
+ break;
+ }
+
+ if (!SlotIsPhysical(slot))
+ {
+ GUC_check_errdetail("\"%s\" is not a physical replication slot",
+ name);
+ ok = false;
+ break;
+ }
+ }

Won't the second check need protection via ReplicationSlotControlLock?

2. In WaitForStandbyConfirmation(), we are anyway calling
StandbyConfirmedFlush() before the actual sleep in the loop, so does
calling it at the beginning of the function will serve any purpose? If
so, it is better to add some comments explaining the same.

3. Also do we need to perform the validation checks done in
StandbyConfirmedFlush() repeatedly when it is invoked in a loop? We
can probably separate those checks and perform them just once.

4.
+   *
+   * XXX: If needed, this can be attempted in future.

Remove this part of the comment.

-- 
With Regards,
Amit Kapila.

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-02-29 Thread Alvaro Herrera

On 2024-Feb-29, Kyotaro Horiguchi wrote:

> At Tue, 27 Feb 2024 18:33:18 +0100, Alvaro Herrera  
> wrote in 
> > Here's the complete set, with these two names using the singular.
> 
> The commit by the second patch added several GUC descriptions:
> 
> > Sets the size of the dedicated buffer pool used for the commit timestamp 
> > cache.
> 
> Some of them, commit_timestamp_buffers, transaction_buffers,
> subtransaction_buffers use 0 to mean auto-tuning based on
> shared-buffer size. I think it's worth adding an extra_desc such as "0
> to automatically determine this value based on the shared buffer
> size".

How about this?

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La victoria es para quien se atreve a estar solo"
>From d0d7216eb4e2e2e9e71aa849cf90c218bbe2b164 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 29 Feb 2024 11:45:31 +0100
Subject: [PATCH] extra_desc

---
 src/backend/utils/misc/guc_tables.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 93ded31ed9..543a87c659 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2287,7 +2287,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"commit_timestamp_buffers", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the size of the dedicated buffer pool used for the commit timestamp cache."),
-			NULL,
+			gettext_noop("Specify 0 to have this value determined as a fraction of shared_buffers."),
 			GUC_UNIT_BLOCKS
 		},
 		_timestamp_buffers,
@@ -2342,7 +2342,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"subtransaction_buffers", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the size of the dedicated buffer pool used for the sub-transaction cache."),
-			NULL,
+			gettext_noop("Specify 0 to have this value determined as a fraction of shared_buffers."),
 			GUC_UNIT_BLOCKS
 		},
 		_buffers,
@@ -2353,7 +2353,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"transaction_buffers", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the size of the dedicated buffer pool used for the transaction status cache."),
-			NULL,
+			gettext_noop("Specify 0 to have this value determined as a fraction of shared_buffers."),
 			GUC_UNIT_BLOCKS
 		},
 		_buffers,
@@ -2868,7 +2868,7 @@ struct config_int ConfigureNamesInt[] =
 	{
 		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
 			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
-			NULL,
+			gettext_noop("Specify -1 to have this value determined as a fraction of shared_buffers."),
 			GUC_UNIT_XBLOCKS
 		},
 		,
-- 
2.39.2

Re: Support a wildcard in backtrace_functions

2024-02-29 Thread Alvaro Herrera

On 2024-Feb-28, Jelte Fennema-Nio wrote:

> diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
> index 699d9d0a241..553e4785520 100644
> --- a/src/backend/utils/error/elog.c
> +++ b/src/backend/utils/error/elog.c
> @@ -843,6 +843,8 @@ matches_backtrace_functions(const char *funcname)
>   if (*p == '\0') /* end of 
> backtrace_function_list */
>   break;
>  
> + if (strcmp("*", p) == 0)
> + return true;
>   if (strcmp(funcname, p) == 0)
>   return true;
>   p += strlen(p) + 1;

Hmm, so if I write "foo,*" this will work but check all function names
first and on the second entry.  But if I write "foo*" the GUC value will
be accepted but match nothing (as will "*foo" or "foo*bar").  I don't
like either of these behaviors.  I think we should tighten this up: an
asterisk should be allowed only if it appears alone in the string
(short-circuiting check_backtrace_functions before strspn); and let's
leave the strspn() call alone.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
Bob [Floyd] used to say that he was planning to get a Ph.D. by the "green
stamp method," namely by saving envelopes addressed to him as 'Dr. Floyd'.
After collecting 500 such letters, he mused, a university somewhere in
Arizona would probably grant him a degree.  (Don Knuth)

type cache cleanup improvements

2024-02-29 Thread Teodor Sigaev


Hi!

I'd like to suggest two independent patches to improve performance of type cache 
cleanup. I found a case where type cache cleanup was a reason for low 
performance. In short, customer makes 100 thousand temporary tables in one 
transaction.


1 mapRelType.patch
  It just adds a local map between relation and its type as it was suggested in 
comment above TypeCacheRelCallback(). Unfortunately, using syscache here was 
impossible because this call back could be called outside transaction and it 
makes impossible catalog lookups.


2 hash_seq_init_with_hash_value.patch
 TypeCacheTypCallback() loop over type hash  to find entry with given hash 
value. Here there are two problems: 1) there isn't  interface to dynahash to 
search entry with given hash value and 2) hash value calculation algorithm is 
differ from system cache. But coming hashvalue is came from system cache. Patch 
is addressed to both issues. It suggests hash_seq_init_with_hash_value() call 
which inits hash sequential scan over the single bucket which could contain 
entry with given hash value, and hash_seq_search() will iterate only over such 
entries. Anf patch changes hash algorithm to match syscache. Actually, patch 
makes small refactoring of dynahash, it makes common function hash_do_lookup() 
which does initial lookup in hash.


Some artificial performance test is in attachment, command to test is 'time psql 
< custom_types_and_array.sql', here I show only last rollback time and total 
execution time:

1) master 92d2ab7554f92b841ea71bcc72eaa8ab11aae662
Time: 33353,288 ms (00:33,353)
psql < custom_types_and_array.sql  0,82s user 0,71s system 1% cpu 1:28,36 total

2) mapRelType.patch
Time: 7455,581 ms (00:07,456)
psql < custom_types_and_array.sql  1,39s user 1,19s system 6% cpu 41,220 total

3) hash_seq_init_with_hash_value.patch
Time: 24975,886 ms (00:24,976)
psql < custom_types_and_array.sql  1,33s user 1,25s system 3% cpu 1:19,77 total

4) both
Time: 89,446 ms
psql < custom_types_and_array.sql  0,72s user 0,52s system 10% cpu 12,137 total


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c
index f411e33b8e7..49e27ca8ba4 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -331,6 +331,14 @@ static TupleDesc find_or_make_matching_shared_tupledesc(TupleDesc tupdesc);
 static dsa_pointer share_tupledesc(dsa_area *area, TupleDesc tupdesc,
    uint32 typmod);
 
+/*
+ * Hash function compatible with one-arg system cache hash function.
+ */
+static uint32
+type_cache_syshash(const void *key, Size keysize)
+{
+	return GetSysCacheHashValue1(TYPEOID, ObjectIdGetDatum(*(const Oid*)key));
+}
 
 /*
  * lookup_type_cache
@@ -356,8 +364,16 @@ lookup_type_cache(Oid type_id, int flags)
 
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(TypeCacheEntry);
+		/*
+		 * Hash function must be compatible to make possible work
+		 * hash_seq_init_with_hash_value(). Hash value in TypeEntry is taken
+		 * from system cache and we use the same (compatible) to use it's hash
+		 * value to speedup search by hash value instead of scanning whole hash
+		 */
+		ctl.hash = type_cache_syshash;
+
 		TypeCacheHash = hash_create("Type information cache", 64,
-	, HASH_ELEM | HASH_BLOBS);
+	, HASH_ELEM | HASH_FUNCTION);
 
 		/* Also set up callbacks for SI invalidations */
 		CacheRegisterRelcacheCallback(TypeCacheRelCallback, (Datum) 0);
@@ -408,8 +424,7 @@ lookup_type_cache(Oid type_id, int flags)
 
 		/* These fields can never change, by definition */
 		typentry->type_id = type_id;
-		typentry->type_id_hash = GetSysCacheHashValue1(TYPEOID,
-	   ObjectIdGetDatum(type_id));
+		typentry->type_id_hash = get_hash_value(TypeCacheHash, _id);
 
 		/* Keep this part in sync with the code below */
 		typentry->typlen = typtup->typlen;
@@ -2359,20 +2374,21 @@ TypeCacheTypCallback(Datum arg, int cacheid, uint32 hashvalue)
 	TypeCacheEntry *typentry;
 
 	/* TypeCacheHash must exist, else this callback wouldn't be registered */
-	hash_seq_init(, TypeCacheHash);
+	if (hashvalue == 0)
+		hash_seq_init(, TypeCacheHash);
+	else
+		hash_seq_init_with_hash_value(, TypeCacheHash, hashvalue);
+
 	while ((typentry = (TypeCacheEntry *) hash_seq_search()) != NULL)
 	{
-		/* Is this the targeted type row (or it's a total cache flush)? */
-		if (hashvalue == 0 || typentry->type_id_hash == hashvalue)
-		{
-			/*
-			 * Mark the data obtained directly from pg_type as invalid.  Also,
-			 * if it's a domain, typnotnull might've changed, so we'll need to
-			 * recalculate its constraints.
-			 */
-			typentry->flags &= ~(TCFLAGS_HAVE_PG_TYPE_DATA |
- TCFLAGS_CHECKED_DOMAIN_CONSTRAINTS);
-		}
+		Assert(hashvalue == 0 || typentry->type_id_hash == hashvalue);
+		/*
+		 * Mark the data obtained directly from pg_type as invalid.

Re: initdb's -c option behaves wrong way?

2024-02-29 Thread Daniel Gustafsson

> On 22 Jan 2024, at 11:09, Daniel Gustafsson  wrote:

> Feel free to go ahead with that
> version, or let me know if you want me to deal with it.

I took the liberty to add this to the upcoming CF to make sure we don't lose
track of it.

--
Daniel Gustafsson

Re: Injection points: some tools to wait and wake

2024-02-29 Thread Bertrand Drouvot

Hi,

On Thu, Feb 29, 2024 at 02:34:35PM +0500, Andrey M. Borodin wrote:
> 
> 
> > On 29 Feb 2024, at 13:35, Bertrand Drouvot  
> > wrote:
> > 
> > Something like the attached?
> 
> There's extraneous print "done\n";

doh! bad copy/paste ;-) 

> Also can we have optional backend type :)

done in v2 attached.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From a2ece828e93e6e919243e8d2529b33d48d05f084 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 29 Feb 2024 08:31:28 +
Subject: [PATCH v2] Adding wait_for_injection_point helper

---
 src/test/perl/PostgreSQL/Test/Cluster.pm  | 25 +++
 .../recovery/t/041_checkpoint_at_promote.pl   |  8 +-
 2 files changed, 26 insertions(+), 7 deletions(-)
  59.8% src/test/perl/PostgreSQL/Test/
  40.1% src/test/recovery/t/

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 44c1bb5afd..559d3febca 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2742,6 +2742,31 @@ sub lsn
 
 =pod
 
+=item $node->wait_for_injection_point(injection_name, backend_type)
+
+Wait for the backend_type to wait for the injection point name.
+
+=cut
+
+sub wait_for_injection_point
+{
+	my ($self, $injection_name, $backend_type) = @_;
+
+	$backend_type = defined($backend_type) ? qq('$backend_type') : 'backend_type';
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT count(*) > 0 FROM pg_stat_activity
+		WHERE backend_type = $backend_type AND wait_event = '$injection_name'
+	])
+	  or die
+	  "timed out waiting for the backend type to wait for the injection point name";
+
+	return;
+}
+
+=pod
+
 =item $node->wait_for_catchup(standby_name, mode, target_lsn)
 
 Wait for the replication connection with application_name standby_name until
diff --git a/src/test/recovery/t/041_checkpoint_at_promote.pl b/src/test/recovery/t/041_checkpoint_at_promote.pl
index 47381a2c82..3d6faabc0b 100644
--- a/src/test/recovery/t/041_checkpoint_at_promote.pl
+++ b/src/test/recovery/t/041_checkpoint_at_promote.pl
@@ -79,13 +79,7 @@ $node_primary->wait_for_replay_catchup($node_standby);
 # Wait until the checkpointer is in the middle of the restart point
 # processing, relying on the custom wait event generated in the
 # wait callback used in the injection point previously attached.
-ok( $node_standby->poll_query_until(
-		'postgres',
-		qq[SELECT count(*) FROM pg_stat_activity
-   WHERE backend_type = 'checkpointer' AND wait_event = 'CreateRestartPoint' ;],
-		'1'),
-	'checkpointer is waiting in restart point'
-) or die "Timed out while waiting for checkpointer to run restart point";
+$node_standby->wait_for_injection_point('CreateRestartPoint','checkpointer');
 
 # Check the logs that the restart point has started on standby.  This is
 # optional, but let's be sure.
-- 
2.34.1

Re: Support a wildcard in backtrace_functions

2024-02-29 Thread Daniel Gustafsson

> On 28 Feb 2024, at 19:50, Jelte Fennema-Nio  wrote:
> 
> On Wed, 28 Feb 2024 at 19:04, Daniel Gustafsson  wrote:
>> This should be "equal to or higher" right?
> 
> Correct, nicely spotted. Fixed that. I also updated the docs for the
> new backtrace_functions_min_level GUC itself too, as well as creating
> a dedicated options array for the GUC. Because when updating its docs
> I realized that none of the existing level arrays matched what we
> wanted.

Looks good, I'm marking this ready for committer for now.  I just have a few
small comments:

+A single * character is interpreted as a wildcard 
and
+will cause all errors in the log to contain backtraces.
This should mention that it's all error matching the level (or higher) of the
newly introduced GUC.


+   gettext_noop("Sets the message levels that create backtraces when 
backtrace_functions is configured"),
I think we should add the same "Each level includes.." long_desc, and the
short_desc should end with period.


+   
+Backtrace support is not available on all platforms, and the quality
+of the backtraces depends on compilation options.
+   
I don't think we need to duplicate this para here, having it on
backtrace_functions suffice.

--
Daniel Gustafsson

Re: Synchronizing slots from primary to standby

2024-02-29 Thread Amit Kapila

On Thu, Feb 29, 2024 at 9:13 AM Peter Smith  wrote:
>
> On Tue, Feb 27, 2024 at 11:35 PM Amit Kapila  wrote:
> >
>
> > Also, adding wait sounds
> > more like a boolean. So, I don't see the proposed names any better
> > than the current one.
> >
>
> Anyway, the point is that the current GUC name 'standby_slot_names' is
> not ideal IMO because it doesn't have enough meaning by itself -- e.g.
> you have to read the accompanying comment or documentation to have any
> idea of its purpose.
>

Yeah, one has to read the description but that is true for other
parameters like "temp_tablespaces". I don't have any better ideas but
open to suggestions.

-- 
With Regards,
Amit Kapila.

Re: Remove AIX Support (was: Re: Relation bulk write facility)

2024-02-29 Thread Daniel Gustafsson

> On 29 Feb 2024, at 10:24, Michael Banck  wrote:
> On Thu, Feb 29, 2024 at 12:57:31AM -0800, Andres Freund wrote:

>> Then these users should have paid somebody to actually do maintenance work on
>> the AIX support,o it doesn't regularly stand in the way of implementing
>> various things.
> 
> Right, absolutely.
> 
> But: did we ever tell them to do that? I don't think it's reasonable for
> them to expect to follow -hackers and jump in when somebody grumbles
> about AIX being a burden somewhere deep down a thread...

Having spent a fair bit of time within open source projects that companies rely
on, my experience is that those companies who need to hear such news have zero
interaction with the project and most of time don't even know the project
community exist.  That conversely also means that the project don't know they
exist.  If their consultants and suppliers, who have a higher probability of
knowing this, hasn't told them then it's highly unlikely that anything we say
will get across.

--
Daniel Gustafsson

1 2 >

1 - 100 of 115 matches

Mail list logo