Re: [PATCH] Small refactoring of inval.c and inval.h

2023-11-05 Thread Michael Paquier
On Tue, Jul 25, 2023 at 06:38:46PM +0300, Aleksander Alekseev wrote:
> Unless of course we want to change its signature too. I don't think
> this is going to be a good API change.

 extern void InvalidateSystemCaches(void);
-extern void InvalidateSystemCachesExtended(bool debug_discard);

Indeed, that looks a bit strange, but is there a strong need in
removing it, as you are proposing?  There is always a risk that this
could be called by some out-of-core code.  This is one of these
things where we could break something just for the sake of breaking
it, so there is no real benefit IMO.

>> As for the /*--- public functions ---*/ comment, that one was just not
>> moved by b89e151054a0, which should have done so; but even at that
>> point, it had already been somewhat broken by the addition of
>> PrepareInvalidationState() (a static function) in 6cb4afff33ba below it.
>> If we really want to clean this up, we could move PrepareInvalidationState()
>> to just below RegisterSnapshotInvalidation() and move the "public
>> functions" header to just below it (so that it appears just before
>> InvalidateSystemCaches).
>>
>> If we just remove the "public functions" comment, then we're still
>> inside a section that starts with the "private support functions"
>> comment, which seems even worse to me than the current situation.
> 
> Fixed.

This part looks OK to me, so I'll see about applying this part of the
patch.
--
Michael


signature.asc
Description: PGP signature


Re: Compiling warnings on old GCC

2023-11-05 Thread Richard Guo
On Mon, Nov 6, 2023 at 2:51 PM David Rowley  wrote:

> On Mon, 6 Nov 2023 at 19:14, Richard Guo  wrote:
> > I came across the following compiling warnings on GCC (Red Hat 4.8.5-44)
> > 4.8.5 with 'CFLAGS=-Og'
>
> > I wonder if this is worth fixing, maybe by a trivial patch like
> > attached.
>
> There's some relevant discussion in
> https://postgr.es/m/flat/20220602024243.GJ29853%40telsasoft.com


Ah, thanks for pointing that out.  Somehow I failed to follow that
thread.

It seems that the controversial '-Og' coupled with the old GCC version
(4.8) makes it not worth fixing.  So please ignore this thread.

Thanks
Richard


Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-05 Thread Andrey M. Borodin


> On 6 Nov 2023, at 09:09, Dilip Kumar  wrote:
> 
> 
>> Having hashtable to find SLRU page in the buffer IMV is too slow. Some 
>> comments on this approach can be found here [0].
>> I'm OK with having HTAB for that if we are sure performance does not degrade 
>> significantly, but I really doubt this is the case.
>> I even think SLRU buffers used HTAB in some ancient times, but I could not 
>> find commit when it was changed to linear search.
> 
> The main intention of having this buffer mapping hash is to find the
> SLRU page faster than sequence search when banks are relatively bigger
> in size, but if we find the cases where having hash creates more
> overhead than providing gain then I am fine to remove the hash because
> the whole purpose of adding hash here to make the lookup faster.  So
> far in my test I did not find the slowness.  Do you or anyone else
> have any test case based on the previous research on whether it
> creates any slowness?
PFA test benchmark_slru_page_readonly(). In this test we run 
SimpleLruReadPage_ReadOnly() (essential part of TransactionIdGetStatus())
before introducing HTAB for buffer mapping I get
Time: 14837.851 ms (00:14.838)
with buffer HTAB I get
Time: 22723.243 ms (00:22.723)

This hash table makes getting transaction status ~50% slower.

Benchmark script I used:
make -C $HOME/postgresMX -j 8 install && (pkill -9 postgres; rm -rf test; 
./initdb test && echo "shared_preload_libraries = 'test_slru'">> 
test/postgresql.conf && ./pg_ctl -D test start && ./psql -c 'create extension 
test_slru' postgres && ./pg_ctl -D test restart && ./psql -c "SELECT 
count(test_slru_page_write(a, 'Test SLRU'))
  FROM generate_series(12346, 12393, 1) as a;" -c '\timing' -c "SELECT 
benchmark_slru_page_readonly(12377);" postgres)

> 
>> Maybe we could decouple locks and counters from SLRU banks? Banks were meant 
>> to be small to exploit performance of local linear search. Lock partitions 
>> have to be bigger for sure.
> 
> Yeah, that could also be an idea if we plan to drop the hash.  I mean
> bank-wise counter is fine as we are finding a victim buffer within a
> bank itself, but each lock could cover more slots than one bank size
> or in other words, it can protect multiple banks.  Let's hear more
> opinion on this.
+1

> 
>> 
>> On 30 Oct 2023, at 09:20, Dilip Kumar  wrote:
>> 
>> I have taken 0001 and 0002 from [1], done some bug fixes in 0001
>> 
>> 
>> BTW can you please describe in more detail what kind of bugs?
> 
> Yeah, actually that patch was using the same GUC
> (multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
> well as for  MultiXactMemberCtl, see the below patch snippet from the
> original patch.

Ouch. We were running this for serveral years with this bug... Thanks!


Best regards, Andrey Borodin.



0001-Implement-benchmark_slru_page_readonly-to-assess-SLR.patch
Description: Binary data




Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

2023-11-05 Thread Michael Paquier
On Thu, Nov 02, 2023 at 11:03:35AM +0900, Kyotaro Horiguchi wrote:
> At Wed, 1 Nov 2023 08:39:17 +0900, Michael Paquier  
> wrote in 
>> See in StartupXLOG(), around the comment "complain if we did not roll
>> forward far enough to reach".  This complains if archive recovery has
>> been requested *or* if we retrieved a backup end LSN from the
>> backup_label.
> 
> Please note that backupStartPoint is not reset even when reaching the
> backup end point during crash recovery. If backup_label enforces
> archive recovery, I think this point won't be an issue as you
> mentioned. For the record, my earlier proposal aimed to detect
> reaching the end point even during crash recovery.

Good point.  Not doing ReachedEndOfBackup() at the end of crash
recovery feels inconsistent, especially since we care about some of
these fields in this case.

If a .signal file is required when we read a backup_label, yes that
would not be a problem because we'd always link backupEndPoint's
destiny with a requested archive recovery, but there seem to be little
love for enforcing that based on the feedback of this thread, so.. 
--
Michael


signature.asc
Description: PGP signature


Re: Compiling warnings on old GCC

2023-11-05 Thread David Rowley
On Mon, 6 Nov 2023 at 19:14, Richard Guo  wrote:
> I came across the following compiling warnings on GCC (Red Hat 4.8.5-44)
> 4.8.5 with 'CFLAGS=-Og'

> I wonder if this is worth fixing, maybe by a trivial patch like
> attached.

There's some relevant discussion in
https://postgr.es/m/flat/20220602024243.GJ29853%40telsasoft.com

David




RE: Is this a problem in GenericXLogFinish()?

2023-11-05 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

> Dear Amit, Michael,
> 
> Thanks for making the patch!
> 
> > Why register wbuf at all if there are no tuples to add and it is not
> > the same as bucketbuf? Also, I think this won't be correct if prevbuf
> > and wrtbuf are the same and also we have no tuples to add to wbuf. I
> > have attached a naive and crude way to achieve it. This needs more
> > work both in terms of trying to find a better way to change the code
> > or ensure this won't break any existing case. I have just run the
> > existing tests. Such a fix certainly required more testing.
> 
> I'm verifying the idea (currently it seems OK), but at least there is a 
> coding error -
> wbuf_flags should be uint8 here. PSA the fixed patch.

Here is a new patch which is bit refactored. It did:

* If-conditions in _hash_freeovflpage() were swapped.
* Based on above, an Assert(xlrec.ntups == 0) was added.
* A condition in hash_xlog_squeeze_page() was followed the change as well
* comments were adjusted

Next we should add some test codes. I will continue considering but please post 
anything
If you have idea.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



fix_hash_squeeze_wal_3.patch
Description: fix_hash_squeeze_wal_3.patch


Re: Add recovery to pg_control and remove backup_label

2023-11-05 Thread Michael Paquier
On Sun, Nov 05, 2023 at 01:45:39PM -0400, David Steele wrote:
> Rebased on 151ffcf6.

I like this patch a lot.  Even if the backup_label file is removed, we
still have all the debug information from the backup history file,
thanks to its LABEL, BACKUP METHOD and BACKUP FROM, so no information
is lost.  It does a 1:1 replacement of the contents parsed from the
backup_label needed by recovery by fetching them from the control
file.  Sounds like a straight-forward change to me.

The patch is failing the recovery test 039_end_of_wal.pl.  Could you
look at the failure?

 /* Build and save the contents of the backup history file */
-history_file = build_backup_content(state, true);
+history_file = build_backup_content(state);

build_backup_content() sounds like an incorrect name if it is a
routine onlyused to build the contents of backup history files. 

Why is there nothing updated in src/bin/pg_controldata/?

+/* Clear fields used to initialize recovery */
+ControlFile->backupCheckPoint = InvalidXLogRecPtr;
+ControlFile->backupStartPointTLI = 0;
+ControlFile->backupRecoveryRequired = false;
+ControlFile->backupFromStandby = false;

These variables in the control file are cleaned up when the
backup_label file was read previously, but backup_label is renamed to
backup_label.old a bit later than that.  Your logic looks correct seen
from here, but shouldn't these variables be set much later, aka just
*after* UpdateControlFile().  This gap between the initialization of
the control file and the in-memory reset makes the code quite brittle,
IMO.

-   basebackup_progress_wait_wal_archive();
-   do_pg_backup_stop(backup_state, !opt->nowait);

Why is that moved?

-The backup label
-file includes the label string you gave to 
pg_backup_start,
-as well as the time at which pg_backup_start was run, 
and
-the name of the starting WAL file.  In case of confusion it is therefore
-possible to look inside a backup file and determine exactly which
-backup session the dump file came from.  The tablespace map file includes
+The tablespace map file includes

It may be worth mentioning that the backup history file holds this
information on the primary's pg_wal, as well.

The changes in sendFileWithContent() may be worth a patch of its own.

--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -146,6 +146,9 @@ typedef struct ControlFileData
@@ -160,14 +163,25 @@ typedef struct ControlFileData
 XLogRecPtrminRecoveryPoint;
 TimeLineIDminRecoveryPointTLI;
+XLogRecPtrbackupCheckPoint;
 XLogRecPtrbackupStartPoint;
+TimeLineIDbackupStartPointTLI;
 XLogRecPtrbackupEndPoint;
+bool backupRecoveryRequired;
+bool backupFromStandby;

This increases the size of the control file from 296B to 312B with an
8-byte alignment, as far as I can see.  The size of the control file
has been always a sensitive subject especially with the hard limit of
PG_CONTROL_MAX_SAFE_SIZE.  Well, the point of this patch is that this
is the price to pay to prevent users from doing something stupid with
a removal of a backup_label when they should not.  Do others have an
opinion about this increase in size?

Actually, grouping backupStartPointTLI and minRecoveryPointTLI should
reduce more the size with some alignment magic, no?

-   /*
-* BACKUP METHOD lets us know if this was a typical backup ("streamed",
-* which could mean either pg_basebackup or the pg_backup_start/stop
-* method was used) or if this label came from somewhere else (the only
-* other option today being from pg_rewind).  If this was a streamed
-* backup then we know that we need to play through until we get to the
-* end of the WAL which was generated during the backup (at which point 
we
-* will have reached consistency and backupEndRequired will be reset to 
be
-* false).
-*/
-   if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
-   {
-   if (strcmp(backuptype, "streamed") == 0)
-   *backupEndRequired = true;
-   }

backupRecoveryRequired in the control file is switched to false for
pg_rewind and true for streamed backups.  My gut feeling is telling me
that this should be OK, as out-of-core tools would need an upgrade if
they relied on the backend_label file anyway.  I can see that this
change makes use lose some documentation, unfortunately.  Shouldn't
these removed lines be moved to pg_control.h instead for the
description of backupEndRequired?

doc/src/sgml/ref/pg_rewind.sgml and
src/backend/access/transam/xlogrecovery.c still include references to
the backup_label file.
--
Michael


signature.asc
Description: PGP signature


Re: RFC: Logging plan of the running query

2023-11-05 Thread Ashutosh Bapat
On Fri, Nov 3, 2023 at 7:31 PM Ashutosh Bapat
 wrote:
>
> I have following questions related to the functionality. (Please point
> me to the relevant discussion if this has been already discussed.)
>
> 1. When a backend is running nested queries, we will see the plan of
> the innermost query. That query may not be the actual culprit if the
> user query is running slowly. E.g a query being run as part of inner
> side nested loop when the nested loop itself is the bottleneck. I
> think it will be useful to print plans of all the whole query stack.

To further explain this point, consider following scenario

-- A completely useless function which executes two SQL statements
which take small amount individually
#create function multstmt() returns int
language sql as $$
select count(*) from pg_class where reltype = 12345;
select count(*) from pg_type limit 10;
$$;

-- force a suboptimal plan
#set enable_hashjoin to false;
#set enable_mergejoin to false;

-- A completely useless but long running query
#select c.oid, t.oid from pg_class c, pg_type t where multstmt(c.oid)
= multstmt(t.oid);

This take a few minutes on my laptop.

In a separate session query pg_stat_activity. We will see the original query
#select pid, query, backend_type from pg_stat_activity where pid = 349330;
  pid   |  query
   |  backend_type
+-+
 349330 | select c.oid, t.oid from pg_class c, pg_type t where
multstmt(c.oid) = multstmt(t.oid); | client backend
(1 row)

Run the plan output function a few times
#select pg_log_query_plan(349330);

You will observe different plans based on which of the innermost query
is runnning
LOG:  query plan running on backend with PID 349330 is:
Query Text:
select count(*) from pg_class where reltype = typeid;
select count(*) from pg_type where oid = typeid;

Query Parameters: $1 = '600'
Aggregate  (cost=18.18..18.19 rows=1 width=8)
  Output: count(*)
  ->  Seq Scan on pg_catalog.pg_class  (cost=0.00..18.18 rows=2 width=0)
Output: oid, relname, relnamespace, reltype,
reloftype, relowner, relam, relfilenode, reltablespace, relpages,
reltuples, relallvisible, reltoastrelid, relhasindex, relisshared,
relpersistence, relkind, relnatts, relchecks, relhasrules,
relhastriggers, relhassubclass, relrowsecurity, relforcerowsecurity,
relispopulated, relreplident, relispartition, relrewrite,
relfrozenxid, relminmxid, relacl, reloptions, relpartbound
Filter: (pg_class.reltype = $1)
Settings: enable_hashjoin = 'off', enable_mergejoin = 'off'
2023-11-06 11:52:25.610 IST [349330] LOG:  query plan running on
backend with PID 349330 is:
Query Text:
select count(*) from pg_class where reltype = typeid;
select count(*) from pg_type where oid = typeid;

Query Parameters: $1 = '2203'
Aggregate  (cost=4.30..4.31 rows=1 width=4)
  Output: count(*)
  ->  Index Only Scan using pg_type_oid_index on
pg_catalog.pg_type  (cost=0.28..4.29 rows=1 width=0)
Output: oid
Index Cond: (pg_type.oid = $1)
Settings: enable_hashjoin = 'off', enable_mergejoin = 'off'

Individual pieces are confusing here since the query run by the
backend is not the one being EXPLAINed. A user may not know that the
queries being EXPLAINed arise from the function call multstmt(). So
they won't be able to stitch the pieces together unless they see plan
of the outermost query with loops and costs. What might help if we
explain each query in the hierarchy.

I think we can start with what auto_explain is doing. Always print the
plan of the outermost query; the query found in pg_stat_activity. In a
later version we might find a way to print plans of all the queries in
the stack and do so in a readable manner.

This makes tracking activeQueryDesc a bit tricky. My guess is that the
outermost query's descriptor will be available through ActivePortal
most of the time. But there are cases when ExecutorRun is called by
passing a queryDesc directly. So may be there are some cases where
that's not true.

-- 
Best Wishes,
Ashutosh Bapat




Re: POC: Extension for adding distributed tracing - pg_tracing

2023-11-05 Thread Shlok Kyal
Hi,

On Thu, 12 Oct 2023 at 14:32, Anthonin Bonnefoy
 wrote:
>
> Hi!
>
> I've made a new batch of changes and improvements.
> New features:
> - Triggers are now correctly supported. They were not correctly
> attached to the ExecutorFinish Span before.
> - Additional configuration: exporting parameter values in span
> metadata can be disabled. Deparsing can also be disabled now.
> - Dbid and userid are now exported in span's metadata
> - New Notify channel and threshold parameters. This channel will
> receive a notification when the span buffer usage crosses the
> threshold.
> - Explicit transaction with extended protocol is now correctly
> handled. This is done by keeping 2 different trace contexts, one for
> parser/planner trace context at the root level and the other for
> nested queries and executors. The spans now correctly show the parse
> of the next statement happening before the ExecutorEnd of the previous
> statement (see screenshot).
>
> Changes:
> - Parse span is now outside of the top span. When multiple statements
> are processed, they are parsed together so it didn't make sense to
> attach Parse to a specific statement.
> - Deparse info is now exported in its own column. This will leave the
> possibility to the trace forwarder to either use it as metadata or put
> it in the span name.
> - Renaming: Spans now have a span_type (Select, Executor, Parser...)
> and a span_operation (ExecutorRun, Select $1...)
> - For spans created without propagated trace context, the same traceid
> will be used for statements within the same transaction.
> - pg_tracing_consume_spans and pg_tracing_peek_spans are now
> restricted to users with pg_read_all_stats role
> - In instrument.h, instr->firsttime is only set if timer was requested
>
> The code should be more stable from now on. Most of the features I had
> planned are implemented.
> I've also started writing the module's documentation. It's still a bit
> bare but I will be working on completing this.

In CFbot, I see that build is failing for this patch (link:
https://cirrus-ci.com/task/5378223450619904?logs=build#L1532) with
following error:
[07:58:05.037] In file included from ../src/include/executor/instrument.h:16,
[07:58:05.037] from ../src/include/jit/jit.h:14,
[07:58:05.037] from ../contrib/pg_tracing/span.h:16,
[07:58:05.037] from ../contrib/pg_tracing/pg_tracing_explain.h:4,
[07:58:05.037] from ../contrib/pg_tracing/pg_tracing.c:43:
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c: In function
‘add_node_counters’:
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2330:70: error:
‘BufferUsage’ has no member named ‘blk_read_time’; did you mean
‘temp_blk_read_time’?
[07:58:05.037] 2330 | blk_read_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_read_time);
[07:58:05.037] | ^
[07:58:05.037] ../src/include/portability/instr_time.h:126:12: note:
in definition of macro ‘INSTR_TIME_GET_NANOSEC’
[07:58:05.037] 126 | ((int64) (t).ticks)
[07:58:05.037] | ^
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2330:18: note: in
expansion of macro ‘INSTR_TIME_GET_MILLISEC’
[07:58:05.037] 2330 | blk_read_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_read_time);
[07:58:05.037] | ^~~
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2331:71: error:
‘BufferUsage’ has no member named ‘blk_write_time’; did you mean
‘temp_blk_write_time’?
[07:58:05.037] 2331 | blk_write_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_write_time);
[07:58:05.037] | ^~
[07:58:05.037] ../src/include/portability/instr_time.h:126:12: note:
in definition of macro ‘INSTR_TIME_GET_NANOSEC’
[07:58:05.037] 126 | ((int64) (t).ticks)
[07:58:05.037] | ^
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2331:19: note: in
expansion of macro ‘INSTR_TIME_GET_MILLISEC’
[07:58:05.037] 2331 | blk_write_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_write_time);
[07:58:05.037] | ^~~

I also tried to build this patch locally and the build is failing with
the same error.

Thanks
Shlok Kumar Kyal




Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-05 Thread Bharath Rupireddy
On Sat, Nov 4, 2023 at 7:19 AM Andres Freund  wrote:
>
> On 2023-11-03 00:55:00 +0530, Bharath Rupireddy wrote:
> > On Wed, Nov 1, 2023 at 4:24 AM Michael Paquier  wrote:
> > >
> > > On Tue, Oct 31, 2023 at 04:26:18PM +0900, torikoshia wrote:
> > > > Yes, calling pg_stat_reset_shared() for all stats types can do what I 
> > > > wanted
> > > > to do.
> > > > But calling it with 6 different parameters seems tiresome and I thought 
> > > > it
> > > > would be convenient to have a parameter to delete all cluster-wide
> > > > statistics at once.
> > > > I may be wrong, but I imagine that it's more common to want to delete 
> > > > all of
> > > > the statistics for an entire cluster rather than just a portion of it.
>
> Yes, agreed. However, I'd suggest adding pg_stat_reset_shared(), instead of
> pg_stat_reset_shared('all') for this purpose.

An overloaded function seems a better choice than another target
'all'. I'm all +1 for it as others seem to concur with the idea of
having something to reset all shared stats.

> > > If more flexibility is wanted in this function, could it be an option
> > > to consider a flavor like pg_stat_reset_shared(text[]), where it is
> > > possible to specify a list of shared stats types to reset?  Perhaps
> > > there are no real use cases for it, just wanted to mention it anyway
> > > regarding the fact that it could have benefits to refactor this code
> > > to use a bitwise operator for its internals with bit flags for each
> > > type.
>
> I don't think there is much point in such an API - if the caller actually
> wants to delete individual stats, it's not too hard to loop.

-1 for text[] version.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Compiling warnings on old GCC

2023-11-05 Thread Richard Guo
I came across the following compiling warnings on GCC (Red Hat 4.8.5-44)
4.8.5 with 'CFLAGS=-Og'

be-fsstubs.c: In function ‘be_lo_export’:
be-fsstubs.c:537:24: warning: ‘fd’ may be used uninitialized in this
function [-Wmaybe-uninitialized]
  if (CloseTransientFile(fd) != 0)
^
In file included from trigger.c:14:0:
trigger.c: In function ‘ExecCallTriggerFunc’:
../../../src/include/postgres.h:314:2: warning: ‘result’ may be used
uninitialized in this function [-Wmaybe-uninitialized]
  return (Pointer) X;
  ^
trigger.c:2316:9: note: ‘result’ was declared here
  Datum  result;
 ^

I wonder if this is worth fixing, maybe by a trivial patch like
attached.

Thanks
Richard


v1-0001-Silence-old-compiler-about-Wmaybe-uninitialized.patch
Description: Binary data


Re: Add new option 'all' to pg_stat_reset_shared()

2023-11-05 Thread torikoshia

Thanks all for the comments!

On Fri, Nov 3, 2023 at 5:17 AM Matthias van de Meent 
 wrote:

Knowing that your metrics have a shared starting point can be quite
valuable, as it allows you to do some math that would otherwise be
much less accurate when working with stats over a short amount of
time. I've not used these stats systems much myself, but skew between
metrics caused by different reset points can be difficult to detect
and debug, so I think an atomic call to reset all these stats could be
worth implementing.


Since each stats, except wal_prefetch was reset acquiring LWLock, 
attached PoC patch makes the call atomic by using these LWlocks.


If this is the right direction, I'll try to make wal_prefetch also take 
LWLock.


On 2023-11-04 10:49, Andres Freund wrote:

Yes, agreed. However, I'd suggest adding pg_stat_reset_shared(), 
instead of

pg_stat_reset_shared('all') for this purpose.


In the attached PoC patch the shared statistics are reset by calling 
pg_stat_reset_shared() not pg_stat_reset_shared('all').


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom f0a5cc6ad6cc351adff9302c3e9b2a227a5d9cb4 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Mon, 6 Nov 2023 14:53:19 +0900
Subject: [PATCH v1] [WIP]Add ability to reset all statistics to
 pg_stat_reset_shared()

Currently pg_stat_reset_shared() requires its argument to specify
statistics to be reset, and there is no way to reset all statistics
at the same time.
This patch makes it possible when no argument is specified.

Note this patch is WIP and pg_stat_recovery_prefetch is not reset.
---
 doc/src/sgml/monitoring.sgml|  4 +-
 src/backend/utils/activity/pgstat.c | 75 +
 src/backend/utils/adt/pgstatfuncs.c | 10 
 src/include/catalog/pg_proc.dat |  4 ++
 src/include/pgstat.h|  1 +
 5 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e068f7e247..9d1e3bf4db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4716,7 +4716,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 void


-Resets some cluster-wide statistics counters to zero, depending on the
+Resets cluster-wide statistics counters to zero, depending on the
 argument.  The argument can be bgwriter to reset
 all the counters shown in
 the pg_stat_bgwriter view,
@@ -4730,6 +4730,8 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 pg_stat_wal view or
 recovery_prefetch to reset all the counters shown
 in the pg_stat_recovery_prefetch view.
+If no argument is specified, reset all these views at once.
+Note current patch is WIP and pg_stat_recovery_prefetch is not reset.


 This function is restricted to superusers by default, but other users
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index d743fc0b28..9b9398439f 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -767,6 +767,81 @@ pgstat_reset_of_kind(PgStat_Kind kind)
 		pgstat_reset_entries_of_kind(kind, ts);
 }
 
+/* Reset below cluster-wide statistics:
+ * - pg_stat_bgwriter
+ * - pg_stat_checkpointer
+ * - pg_stat_archiver
+ * - pg_stat_io
+ * - pg_stat_wal
+ *
+ * Note recovery_prefetch is not implemented(WIP).
+ */
+void
+pgstat_reset_shared_all(void)
+{
+#define STATS_SHARED_NUM_LWLOCK	4
+	TimestampTz ts = GetCurrentTimestamp();
+
+	PgStatShared_Archiver *stats_archiver = >archiver;
+	PgStatShared_BgWriter *stats_bgwriter = >bgwriter;
+	PgStatShared_Checkpointer *stats_checkpointer = >checkpointer;
+	PgStatShared_Wal *stats_wal = >wal;
+	PgStatShared_IO *stats_io = >io;
+
+	// Acquire LWLocks
+	LWLock *locks[] = {_archiver->lock,  _bgwriter->lock,
+	   _checkpointer->lock, _wal->lock};
+
+	for (int i = 0; i < STATS_SHARED_NUM_LWLOCK; i++)
+		LWLockAcquire(locks[i], LW_EXCLUSIVE);
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		LWLock	*bktype_lock = _io->locks[i];
+		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
+	}
+
+	// Reset stats
+	pgstat_copy_changecounted_stats(_archiver->reset_offset,
+	_archiver->stats,
+	sizeof(stats_archiver->stats),
+	_archiver->changecount);
+	stats_archiver->stats.stat_reset_timestamp = ts;
+
+	pgstat_copy_changecounted_stats(_bgwriter->reset_offset,
+	_bgwriter->stats,
+	sizeof(stats_bgwriter->stats),
+	_bgwriter->changecount);
+	stats_bgwriter->stats.stat_reset_timestamp = ts;
+
+	pgstat_copy_changecounted_stats(_checkpointer->reset_offset,
+	_checkpointer->stats,
+	sizeof(stats_checkpointer->stats),
+	_checkpointer->changecount);
+	stats_checkpointer->stats.stat_reset_timestamp = ts;
+
+	memset(_wal->stats, 0, sizeof(stats_wal->stats));
+	

Re: Add recovery to pg_control and remove backup_label

2023-11-05 Thread Michael Paquier
On Fri, Oct 27, 2023 at 10:10:42AM -0400, David Steele wrote:
> We are still planning to address this issue in the back branches.

FWIW, redesigning the backend code in charge of doing base backups in
the back branches is out of scope.  Based on a read of the proposed
patch, it includes catalog changes which would require a catversion
bump, so that's not going to work anyway.
--
Michael


signature.asc
Description: PGP signature


Re: Extract numeric filed in JSONB more effectively

2023-11-05 Thread zhihuifan1213

Chapman Flack  writes:

(This is Andy Fan and I just switch to my new email address).

Hi Chap,

Thanks for alway keep an eye on this!

> Adding this comment via the CF app so it isn't lost, while an
> improperly-interpreted-DKIM-headers issue is still preventing me from
> mailing directly to -hackers.
>
> It was my view that the patch was getting close by the end of the last
> commitfest, but still contained a bit of a logic wart made necessary by
> a questionable choice of error message wording, such that in my view it
> would be better to determine whether a different error message would
> better conform to ISO SQL in the first place, and obviate the need for
> the logic wart.
>
> There seemed to be some progress possible on that when petere had time
> to weigh in on the standard shortly after the last CF ended.
>
> So, it would not have been my choice to assign RfC status before
> getting to a resolution on that.

I agree with this.

>
> Also, it is possible for a JsonbValue to hold a timestamp (as a result
> of a jsonpath evaluation, I don't think that can happen any other
> way),

I believe this is where our disagreement lies.

CREATE TABLE employees (
  
   id serial PRIMARY KEY,
   data jsonb
);

INSERT INTO employees (data) VALUES (
   '{
  "employees":[
 {
"firstName":"John",
"lastName":"Doe",
"hireDate":"2022-01-01T09:00:00Z",
"age": 30
 },
 {
"firstName":"Jane",
"lastName":"Smith",
"hireDate":"2022-02-01T10:00:00Z",
"age": 25
 }
  ]
   }'
);

select
jsonb_path_query_tz(data, '$.employees[*] ? (@.hireDate >=
"2022-02-01T00:00:00Z" && @.hireDate < "2022-03-01T00:00:00Z")')
from employees;

select jsonb_path_query_tz(data, '$.employees[*].hireDate ? (@ >=
"2022-02-01T00:00:00Z" && @ < "2022-03-01T00:00:00Z")') from employees;
select pg_typeof(jsonb_path_query_tz(data, '$.employees[*].hireDate ? (@
>= "2022-02-01T00:00:00Z" && @ < "2022-03-01T00:00:00Z")')) from
employees;

select jsonb_path_query_tz(data, '$.employees[*].hireDate ? (@
>= "2022-02-01T00:00:00Z" && @ < "2022-03-01T00:00:00Z")')::timestamp
from employees;
select jsonb_path_query_tz(data, '$.employees[*].hireDate ? (@
>= "2022-02-01T00:00:00Z" && @ < "2022-03-01T00:00:00Z")')::timestamptz
from employees;

I tried all of the above queires and can't find a place where this
optimization would apply. am I miss something? 


> and if such a jsonpath evaluation were to be the source expression of a
> cast to SQL timestamp, that situation seems exactly analogous to the
> other situations being optimized here and would require only a few more
> lines in the exact pattern here introduced.

Could you provide an example of this? 

> While that could be called
> out of scope when this patch's title refers to "numeric field"
> specifically, it might be worth considering for completeness. The patch
> does, after all, handle boolean already, as well as numeric.

I'd never arugment for this, at this point at least. 

v15 is provides without any fundamental changes.  Just rebase to the
lastest code and prepared a better commit message.

>From 4d53fda4974fa18827350a5f42482f0eec29d6ba Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Mon, 6 Nov 2023 11:09:14 +0800
Subject: [PATCH v15 1/1] Improve the performance of Jsonb extraction.

In the past, when we needed to extract a numeric value from a field in a
JSONB object, even though the JSONB object already contained a binary
matching numeric type, we would first find the corresponding JsonbValue,
then convert the JsonbValue to Jsonb, and finally use the cast system to
convert the Jsonb to a Numeric-like data type. This approach was very
inefficient in terms of performance.

In the current patch, if we encounter a function or operator that needs
to extract a JSONB field and cast it to a numeric-like data type, the
request will be automatically converted into extracting the field as a
JsonbValue data type from the JSONB field, then, convert the JsonbValue
to a numeric data type. If necessary, the final conversion from the
numeric data type to another numeric-like data type is done through the
casting system. This series of conversions is implemented through the
planner support function.  By utilizing these methods, the cumbersome
JSONB-related operations are avoided. Because the boolean type and
numeric type share certain similarities in their attributes, we have
implemented the same optimization approach for both.

At the implementation level, considering that we have multiple operators
and various target data types, and to avoid an excessive number of
functions, we have deconstructed the two steps mentioned earlier into
two categories of functions. The first category of functions extracts
the data as a JsonbValue type, while the second category of 

Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2023-11-05 Thread Dilip Kumar
On Sun, Nov 5, 2023 at 1:37 AM Andrey M. Borodin  wrote:

> On 30 Oct 2023, at 09:20, Dilip Kumar  wrote:
>
> changed the logic of SlruAdjustNSlots() in 0002, such that now it
> starts with the next power of 2 value of the configured slots and
> keeps doubling the number of banks until we reach the number of banks
> to the max SLRU_MAX_BANKS(128) and bank size is bigger than
> SLRU_MIN_BANK_SIZE (8).  By doing so, we will ensure we don't have too
> many banks
>
> There was nothing wrong with having too many banks. Until bank-wise locks and 
> counters were added in later patchsets.

I agree with that, but I feel with bank-wise locks we are removing
major contention from the centralized control lock and we can see that
from my first email that how much benefit we can get in one of the
simple test cases when we create subtransaction overflow.

> Having hashtable to find SLRU page in the buffer IMV is too slow. Some 
> comments on this approach can be found here [0].
> I'm OK with having HTAB for that if we are sure performance does not degrade 
> significantly, but I really doubt this is the case.
> I even think SLRU buffers used HTAB in some ancient times, but I could not 
> find commit when it was changed to linear search.

The main intention of having this buffer mapping hash is to find the
SLRU page faster than sequence search when banks are relatively bigger
in size, but if we find the cases where having hash creates more
overhead than providing gain then I am fine to remove the hash because
the whole purpose of adding hash here to make the lookup faster.  So
far in my test I did not find the slowness.  Do you or anyone else
have any test case based on the previous research on whether it
creates any slowness?

> Maybe we could decouple locks and counters from SLRU banks? Banks were meant 
> to be small to exploit performance of local linear search. Lock partitions 
> have to be bigger for sure.

Yeah, that could also be an idea if we plan to drop the hash.  I mean
bank-wise counter is fine as we are finding a victim buffer within a
bank itself, but each lock could cover more slots than one bank size
or in other words, it can protect multiple banks.  Let's hear more
opinion on this.

>
> On 30 Oct 2023, at 09:20, Dilip Kumar  wrote:
>
> I have taken 0001 and 0002 from [1], done some bug fixes in 0001
>
>
> BTW can you please describe in more detail what kind of bugs?

Yeah, actually that patch was using the same GUC
(multixact_offsets_buffers) in SimpleLruInit for MultiXactOffsetCtl as
well as for  MultiXactMemberCtl, see the below patch snippet from the
original patch.

@@ -1851,13 +1851,13 @@ MultiXactShmemInit(void)
  MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;

  SimpleLruInit(MultiXactOffsetCtl,
-   "MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
+   "MultiXactOffset", multixact_offsets_buffers, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
LWTRANCHE_MULTIXACTOFFSET_BUFFER,
SYNC_HANDLER_MULTIXACT_OFFSET);
  SlruPagePrecedesUnitTests(MultiXactOffsetCtl, MULTIXACT_OFFSETS_PER_PAGE);
  SimpleLruInit(MultiXactMemberCtl,
-   "MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
+   "MultiXactMember", multixact_offsets_buffers, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
LWTRANCHE_MULTIXACTMEMBER_BUFFER,
SYNC_HANDLER_MULTIXACT_MEMBER);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Synchronizing slots from primary to standby

2023-11-05 Thread shveta malik
On Mon, Nov 6, 2023 at 7:01 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, November 3, 2023 7:32 PM Amit Kapila 
> >
> > On Thu, Nov 2, 2023 at 2:35 PM Zhijie Hou (Fujitsu) 
> > wrote:
> > >
> > > Here is the new version patch set(V29) which addressed Peter
> > > comments[1][2] and fixed one doc compile error.
> > >
> >
> > Few comments:
> > ==
> > 1.
> > +   
> > +failover (boolean)
> > +
> > + 
> > +  Specifies whether the replication slot assocaited with the
> > subscription
> > +  is enabled to be synced to the physical standbys so that logical
> > +  replication can be resumed from the new primary after failover.
> > +  The default is true.
> >
> > Why do you think it is a good idea to keep the default value as true?
> > I think the user needs to enable standby for syncing slots which is not a 
> > default
> > feature, so by default, the failover property should also be false. AFAICS, 
> > it is
> > false for create_slot SQL API as per the below change; so that way also 
> > keeping
> > default true for a subscription doesn't make sense.
> > @@ -479,6 +479,7 @@ CREATE OR REPLACE FUNCTION
> > pg_create_logical_replication_slot(
> >  IN slot_name name, IN plugin name,
> >  IN temporary boolean DEFAULT false,
> >  IN twophase boolean DEFAULT false,
> > +IN failover boolean DEFAULT false,
> >  OUT slot_name name, OUT lsn pg_lsn)
> >
> > BTW, the below change indicates that the code treats default as false; so, 
> > it
> > seems to be a documentation error.
>
> I think the document is wrong and fixed it.
>
> >
> > 2.
> > -
> >  /*
> >   * Common option parsing function for CREATE and ALTER SUBSCRIPTION
> > commands.
> >   *
> >
> > Spurious line removal.
> >
> > 3.
> > + else if (opts.slot_name && failover_enabled) {
> > + walrcv_alter_slot(wrconn, opts.slot_name, opts.failover);
> > + ereport(NOTICE, (errmsg("altered replication slot \"%s\" on
> > + publisher", opts.slot_name))); }
> >
> > I think we can add a comment to describe why it makes sense to enable the
> > failover property of the slot in this case. Can we change the notice 
> > message to:
> > "enabled failover for replication slot \"%s\" on publisher"
>
> Added.
>
> >
> > 4.
> >  libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname,
> > - bool temporary, bool two_phase, CRSSnapshotAction snapshot_action,
> > - XLogRecPtr *lsn)
> > + bool temporary, bool two_phase, bool failover, CRSSnapshotAction
> > + snapshot_action, XLogRecPtr *lsn)
> >  {
> >   PGresult   *res;
> >   StringInfoData cmd;
> > @@ -913,7 +917,14 @@ libpqrcv_create_slot(WalReceiverConn *conn, const
> > char *slotname,
> >   else
> >   appendStringInfoChar(, ' ');
> >   }
> > -
> > + if (failover)
> > + {
> > + appendStringInfoString(, "FAILOVER"); if (use_new_options_syntax)
> > + appendStringInfoString(, ", "); else appendStringInfoChar(, '
> > + '); }
> >
> > I don't see a corresponding change in repl_gram.y. I think the following 
> > part of
> > the code needs to be changed:
> > /* CREATE_REPLICATION_SLOT slot [TEMPORARY] LOGICAL plugin [options] */
> > | K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT
> > create_slot_options
> >
>
> I think after 0266e98, we started to use the new syntax(see the
> generic_option_list rule) and we can avoid changing the repl_gram.y when 
> adding
> new options. The new failover can be detected when parsing the generic option
> list(in parseCreateReplSlotOptions).
>
>
> >
> > 5.
> > @@ -228,6 +230,28 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo
> > fcinfo, bool confirm, bool bin
> >   NameStr(MyReplicationSlot->data.plugin),
> >   format_procedure(fcinfo->flinfo->fn_oid;
> > ..
> > + if (XLogRecPtrIsInvalid(upto_lsn))
> > + wal_to_wait = end_of_wal;
> > + else
> > + wal_to_wait = Min(upto_lsn, end_of_wal);
> > +
> > + /* Initialize standby_slot_names_list */ SlotSyncInitConfig();
> > +
> > + /*
> > + * Wait for specified streaming replication standby servers (if any)
> > + * to confirm receipt of WAL upto wal_to_wait.
> > + */
> > + WalSndWaitForStandbyConfirmation(wal_to_wait);
> > +
> > + /*
> > + * The memory context used to allocate standby_slot_names_list will be
> > + * freed at the end of this call. So free and nullify the list in
> > + * order to avoid usage of freed list in the next call to this
> > + * function.
> > + */
> > + SlotSyncFreeConfig();
> >
> > What if there is an error in WalSndWaitForStandbyConfirmation() before 
> > calling
> > SlotSyncFreeConfig()? I think the problem you are trying to avoid by 
> > freeing it
> > here can occur. I think it is better to do this in a logical decoding 
> > context and
> > free the list along with it as we are doing in commit c7256e6564(see PG15).
>
> I will analyze more about this case and update in next version.
>
> > Also,
> > it is better to allocate this list somewhere in
> > WalSndWaitForStandbyConfirmation(), probably in WalSndGetStandbySlots,
> > 

Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)

2023-11-05 Thread Amit Kapila
On Sun, Nov 5, 2023 at 4:01 AM Bharath Rupireddy
 wrote:
>
> On Thu, Nov 2, 2023 at 7:19 AM Peter Smith  wrote:
> >
> > But that's not quite compatible with what Alvaro [2] had written long
> > back ("... the only acquisitions that would log messages are those in
> > StartReplication and StartLogicalReplication.").
> >
> > In other words, ReplicationSlotAcquire/ReplicationSlotRelease is
> > called by more places than you care to log from.
>
> I refreshed my thoughts for this patch and I think it's enough if
> walsenders alone log messages when slots become active and inactive.
> How about something like the attached v11 patch?
>

+ * This function is currently used only in walsender.
+ */
+void
+ReplicationSlotAcquireAndLog(const char *name, bool nowait)

BTW, is the reason for using it only in walsender is that it is a
background process and it is not very apparent whether the slot is
created, and for foreground processes, it is a bit clear when the
command is executed. If so, the other alternative is to either use a
parameter to the existing function or directly use am_walsender flag
to distinguish when to print the message in acquire/release calls.

Can you please tell me the use case of this additional message?

A few other minor comments:
1.
+Causes each replication command and related activity to be logged in
+the server log.

Can we be bit more specific by changing to something like: "Causes
each replication command and slot acquisition/release to be logged in
the server log."

2.
+ ereport(log_replication_commands ? LOG : DEBUG1,
+ (errmsg("walsender process with PID %d acquired %s replication slot \"%s\"",

It seems PID and process name is quite unlike what we print in other
similar messages. For example, see below messages when we start
replication via pub/sub :

2023-11-06 08:41:57.867 IST [24400] LOG:  received replication
command: CREATE_REPLICATION_SLOT "sub1" LOGICAL pgoutput (SNAPSHOT
'nothing')
2023-11-06 08:41:57.867 IST [24400] STATEMENT:
CREATE_REPLICATION_SLOT "sub1" LOGICAL pgoutput (SNAPSHOT 'nothing')
...
...
2023-11-06 08:41:57.993 IST [22332] LOG:  walsender process with PID
22332 acquired logical replication slot "sub1"
2023-11-06 08:41:57.993 IST [22332] STATEMENT:  START_REPLICATION SLOT
"sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names
'"pub1"')
...
...
2023-11-06 08:41:58.015 IST [22332] LOG:  starting logical decoding
for slot "sub1"
2023-11-06 08:41:58.015 IST [22332] DETAIL:  Streaming transactions
committing after 0/1522730, reading WAL from 0/15226F8.
2023-11-06 08:41:58.015 IST [22332] STATEMENT:  START_REPLICATION SLOT
"sub1" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names
'"pub1"')

We can get the PID from the log line as for other logs and I don't see
the process name printed anywhere else.

-- 
With Regards,
Amit Kapila.




RE: Is this a problem in GenericXLogFinish()?

2023-11-05 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Michael,

Thanks for making the patch!

> Why register wbuf at all if there are no tuples to add and it is not
> the same as bucketbuf? Also, I think this won't be correct if prevbuf
> and wrtbuf are the same and also we have no tuples to add to wbuf. I
> have attached a naive and crude way to achieve it. This needs more
> work both in terms of trying to find a better way to change the code
> or ensure this won't break any existing case. I have just run the
> existing tests. Such a fix certainly required more testing.

I'm verifying the idea (currently it seems OK), but at least there is a coding 
error -
wbuf_flags should be uint8 here. PSA the fixed patch.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED



fix_hash_squeeze_wal_2.patch
Description: fix_hash_squeeze_wal_2.patch


Re: Support run-time partition pruning for hash join

2023-11-05 Thread Richard Guo
On Sat, Nov 4, 2023 at 6:00 PM Alexander Lakhin  wrote:

> 02.11.2023 14:19, Richard Guo wrote:
>
> However, the cfbot indicates that there are test cases that fail on
> FreeBSD [1] (no failure on other platforms).  So I set up a FreeBSD-13
> locally but just cannot reproduce the failure.  I must be doing
> something wrong.  Can anyone give me some hints or suggestions?
>
> I've managed to reproduce that failure on my Ubuntu with:
> CPPFLAGS="-Og -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES"
> ./configure ... make check
>

Wow, thank you so much.  You saved me a lot of time.  It turns out that
it was caused by me not making JoinPartitionPruneInfo a node.  The same
issue can also exist for JoinPartitionPruneCandidateInfo - if you
pprint(root) at some point you'll see 'could not dump unrecognized node
type' warning.

Fixed this issue in v4.

Thanks
Richard


v4-0001-Support-run-time-partition-pruning-for-hash-join.patch
Description: Binary data


Re: pg_upgrade and logical replication

2023-11-05 Thread Peter Smith
Here are some review comments for patch v11-0001

==
Commit message

1.
The subscription's replication origin are needed to ensure
that we don't replicate anything twice.

~

/are needed/is needed/

~~~

2.
Author: Julien Rouhaud
Reviewed-by: FIXME
Discussion: https://postgr.es/m/20230217075433.u5mjly4d5cr4hcfe@jrouhaud

~

Include Vignesh as another author.

==
doc/src/sgml/ref/pgupgrade.sgml

3.
+ pg_upgrade attempts to migrate subscription
+ dependencies which includes the subscription tables information present in
+ pg_subscription_rel
+ system table and the subscription replication origin which
+ will help in continuing logical replication from where the old subscriber
+ was replicating. This helps in avoiding the need for setting up the

I became a bit lost reading paragraph due to the multiple 'which'...

SUGGESTION
pg_upgrade attempts to migrate subscription dependencies which
includes the subscription table information present in
pg_subscription_rel system
catalog and also the subscription replication origin. This allows
logical replication on the new subscriber to continue from where the
old subscriber was up to.

~~~

4.
+ was replicating. This helps in avoiding the need for setting up the
+ subscription objects manually which requires truncating all the
+ subscription tables and setting the logical replication slots. Migration

SUGGESTION
Having the ability to migrate subscription objects avoids the need to
set them up manually, which would require truncating all the
subscription tables and setting the logical replication slots.

~

TBH, I am wondering what is the purpose of this sentence. It seems
more like a justification for the patch, but does the user need to
know all this?

~~~

5.
+  
+   All the subscription tables in the old subscriber should be in
+   i (initialize), r (ready) or
+   s (synchronized). This can be verified by checking
+   pg_subscription_rel.srsubstate.
+  

/should be in/should be in state/

~~~

6.
+  
+   The replication origin entry corresponding to each of the subscriptions
+   should exist in the old cluster. This can be checking
+   pg_subscription and
+   pg_replication_origin
+   system tables.
+  

missing words?

/This can be checking/This can be found by checking/

~~~

7.
+
+ The subscriptions will be migrated to new cluster in disabled state, they
+ can be enabled after upgrade by following the steps:
+

The first bullet also says "Enable the subscription..." so I think
this paragraph should be worded like the below.

SUGGESTION
The subscriptions will be migrated to the new cluster in a disabled
state. After migration, do this:

==
src/backend/catalog/pg_subscription.c

8.
 #include "nodes/makefuncs.h"
+#include "replication/origin.h"
+#include "replication/worker_internal.h"
 #include "storage/lmgr.h"

Why does this change need to be in the patch when there are no other
code changes in this file?

==
src/backend/utils/adt/pg_upgrade_support.c

9. binary_upgrade_create_sub_rel_state

IMO a better name for this function would be
'binary_upgrade_add_sub_rel_state' (because it delegates to
AddSubscriptionRelState).

Then it would obey the same name pattern as the other function
'binary_upgrade_replorigin_advance' (which delegates to
replorigin_advance).

~~~

10.
+/*
+ * binary_upgrade_create_sub_rel_state
+ *
+ * Add the relation with the specified relation state to pg_subscription_rel
+ * table.
+ */
+Datum
+binary_upgrade_create_sub_rel_state(PG_FUNCTION_ARGS)
+{
+ Relation rel;
+ HeapTuple tup;
+ Oid subid;
+ Form_pg_subscription form;
+ char*subname;
+ Oid relid;
+ char relstate;
+ XLogRecPtr sublsn;

10a.
/to pg_subscription_rel table./to pg_subscription_rel catalog./

~

10b.
Maybe it would be helpful if the function argument were documented
up-front in the function-comment, or in the variable declarations.

SUGGESTION
char  *subname;  /* ARG0 = subscription name */
Oidrelid;/* ARG1 = relation Oid */
char   relstate; /* ARG2 = subrel state */
XLogRecPtr sublsn;   /* ARG3 (optional) = subscription lsn */

~~~

11.
if (PG_ARGISNULL(3))
sublsn = InvalidXLogRecPtr;
else
sublsn = PG_GETARG_LSN(3);
FWIW, I'd write that as a one-line ternary assignment allowing all the
args to be grouped nicely together.

SUGGESTION
sublsn = PG_ARGISNULL(3) ? InvalidXLogRecPtr : PG_GETARG_LSN(3);

~~~

12. binary_upgrade_replorigin_advance

/*
 * binary_upgrade_replorigin_advance
 *
 * Update the remote_lsn for the subscriber's replication origin.
 */
Datum
binary_upgrade_replorigin_advance(PG_FUNCTION_ARGS)
{
Relation rel;
HeapTuple tup;
Oid subid;
Form_pg_subscription form;
char*subname;
XLogRecPtr sublsn;
char originname[NAMEDATALEN];
RepOriginId originid;
~

Similar to previous comment #10b. Maybe it would be helpful if the
function argument were documented up-front in the function-comment, or
in the variable declarations.


minor doc issues.

2023-11-05 Thread jian he
hi

https://www.postgresql.org/docs/current/datatype-json.html
Table 8.23. JSON Primitive Types and Corresponding PostgreSQL Types

"SQL NULL is a different concept"
can we change to
"Only accept lowercase null. SQL NULL is a different concept"
I saw people ask similar questions on stackoverflow. maybe worth mentioning.

-
https://www.postgresql.org/docs/16/view-pg-backend-memory-contexts.html
Table 54.4. pg_backend_memory_contexts Columns
should be
Table 54.4. pg_backend_memory_contexts View.




Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value

2023-11-05 Thread jian he
minor doc issues.
Returns the chunk id of the TOASTed value, or NULL if the value is not TOASTed.
Should it be "chunk_id"?

you may place it after pg_create_logical_replication_slot entry to
make it look like alphabetical order.

There is no test. maybe we can add following to src/test/regress/sql/misc.sql
create table val(t text);
INSERT into val(t) SELECT string_agg(
  chr((ascii('B') + round(random() * 25)) :: integer),'')
FROM generate_series(1,2500);
select pg_column_toast_chunk_id(t) is  not null from val;
drop table val;




Re: Something seems weird inside tts_virtual_copyslot()

2023-11-05 Thread David Rowley
On Sat, 4 Nov 2023 at 15:15, Andres Freund  wrote:
>
> On 2023-11-01 11:35:50 +1300, David Rowley wrote:
> > I changed the Assert in tts_virtual_copyslot() to check the natts
> > match in each of the slots and all of the regression tests still pass,
> > so it seems we have no tests where there's an attribute number
> > mismatch...
>
> If we want to prohibit that, I think we ought to assert this in
> ExecCopySlot(), rather than just tts_virtual_copyslot.
>
> Even that does survive the test - but I don't think it'd be really wrong to
> copy from a slot with more columns into one with fewer. And it seems plausible
> that that could happen somewhere, e.g. when copying from a slot in a child
> partition with additional columns into a slot from the parent, where the
> column types/order otherwise matches, so we don't have to use the attribute
> mapping infrastructure.

Do you have any examples of when this could happen?

I played around with partitioned tables and partitions with various
combinations of dropped columns and can't see any cases of this. Given
the assert's condition has been backwards for 5 years now
(4da597edf1b), it just seems a bit unlikely that we have cases where
the source slot can have more attributes than the destination.

Given the Assert has been that way around for this long without any
complaints, I think we should just insist that the natts must match in
each slot.  If we later discover some reason that there's some corner
case where they don't match, we can adjust the code then.

I played around with the attached patch which removes the Assert and
puts some additional Assert checks inside ExecCopySlot() which
additionally checks the attribute types also match.  There are valid
cases where they don't match and that seems to be limited to cases
where we're performing DML on a table with a dropped column.
expand_insert_targetlist() will add NULL::int4 constants to the
targetlist in place of dropped columns but the tupledesc of the table
will have the atttypid set to InvalidOid per what that gets set to
when a column is dropped in RemoveAttributeById().

> > I think if we are going to support copying slots where the source and
> > destination don't have the same number of attributes then the
> > following comment should explain what's allowed and what's not
> > allowed:
> >
> > /*
> > * Copy the contents of the source slot into the destination slot's own
> > * context. Invoked using callback of the destination slot.
> > */
> > void (*copyslot) (TupleTableSlot *dstslot, TupleTableSlot *srcslot);
>
> Arguably the more relevant place would be document this in ExecCopySlot(), as
> that's what "users" of ExecCopySlot() would presumably would look at.  I dug a
> bit in the history, and we used to say

I think it depends on what you're documenting.  Writing comments above
the copyslot API function declaration is useful to define the API
standard for what new implementations of that interface must abide by.
Comments in ExecCopySlot() are useful to users of that function.  It
seems to me that both locations are relevant. New implementations of
copyslot need to know what they must handle, else they're left just to
look at what other implementations do and guess the rest.

David
diff --git a/src/backend/executor/execTuples.c 
b/src/backend/executor/execTuples.c
index 2c2712ceac..5a2db83987 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -253,8 +253,6 @@ tts_virtual_copyslot(TupleTableSlot *dstslot, 
TupleTableSlot *srcslot)
 {
TupleDesc   srcdesc = srcslot->tts_tupleDescriptor;
 
-   Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts);
-
tts_virtual_clear(dstslot);
 
slot_getallattrs(srcslot);
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 4210d6d838..cc32550840 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -174,7 +174,10 @@ struct TupleTableSlotOps
 
/*
 * Copy the contents of the source slot into the destination slot's own
-* context. Invoked using callback of the destination slot.
+* context. Invoked using callback of the destination slot.  'dstslot' 
and
+* 'srcslot' can be assumed to match in terms of attribute count and 
data
+* types.  However, either TupleDesc may sometimes contain attribute 
types
+* of InvalidOid in cases where the attribute refers to a dropped 
column.
 */
void(*copyslot) (TupleTableSlot *dstslot, TupleTableSlot 
*srcslot);
 
@@ -340,6 +343,35 @@ extern void slot_getsomeattrs_int(TupleTableSlot *slot, 
int attnum);
 
 #ifndef FRONTEND
 
+
+/*
+ * VerifySlotCompatibility
+ * Ensure the two given slots are compatible with one another in 
terms of
+ * their number of attributes and attribute types.
+ */
+static void
+VerifySlotCompatibility(TupleTableSlot *slot1, TupleTableSlot *slot2)
+{
+#ifdef 

Re: Add recovery to pg_control and remove backup_label

2023-11-05 Thread David Steele

Rebased on 151ffcf6.diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae54..6be8fb902c5 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -935,19 +935,20 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
  ready to archive.
 
 
- pg_backup_stop will return one row with three
- values. The second of these fields should be written to a file named
- backup_label in the root directory of the backup. The
- third field should be written to a file named
- tablespace_map unless the field is empty. These 
files are
+ pg_backup_stop returns the
+ pg_control file, which must be stored in the
+ global directory of the backup. It also returns the
+ tablespace_map file, which should be written in the
+ root directory of the backup unless the field is empty. These files are
  vital to the backup working and must be written byte for byte without
- modification, which may require opening the file in binary mode.
+ modification, which will require opening the file in binary mode.
 


 
  Once the WAL segment files active during the backup are archived, you are
- done.  The file identified by pg_backup_stop's first 
return
+ done.  The file identified by pg_backup_stop's
+ lsn return
  value is the last segment that is required to form a complete set of
  backup files.  On a primary, if archive_mode is 
enabled and the
  wait_for_archive parameter is true,
@@ -1013,7 +1014,15 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);

 

-You should, however, omit from the backup the files within the
+You must exclude global/pg_control from your backup
+and put the contents of the pg_control_file column
+returned from pg_backup_stop in your backup at
+global/pg_control. This file contains the information
+required to safely recover.
+   
+
+   
+You should also omit from the backup the files within the
 cluster's pg_wal/ subdirectory.  This
 slight adjustment is worthwhile because it reduces the risk
 of mistakes when restoring.  This is easy to arrange if
@@ -1062,12 +1071,7 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);

 

-The backup label
-file includes the label string you gave to 
pg_backup_start,
-as well as the time at which pg_backup_start was run, 
and
-the name of the starting WAL file.  In case of confusion it is therefore
-possible to look inside a backup file and determine exactly which
-backup session the dump file came from.  The tablespace map file includes
+The tablespace map file includes
 the symbolic link names as they exist in the directory
 pg_tblspc/ and the full path of each symbolic link.
 These files are not merely for your information; their presence and
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a6fcac0824a..6fd2eb8cc50 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26815,7 +26815,10 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 
free (88 chunks); 1029560
   label text
   , fast boolean
)
-pg_lsn
+record
+( lsn pg_lsn,
+timeline_id int8,
+start timestamptz )


 Prepares the server to begin an on-line backup.  The only required
@@ -26827,6 +26830,13 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 
free (88 chunks); 1029560
 as possible.  This forces an immediate checkpoint which will cause a
 spike in I/O operations, slowing any concurrently executing queries.

+   
+The result columns contain information about the start of the backup
+and can be ignored: the lsn column holds the
+starting write-ahead log location, the
+timeline_id column holds the starting timeline,
+and the stop column holds the starting 
timestamp.
+   

 This function is restricted to superusers by default, but other users
 can be granted EXECUTE to run the function.
@@ -26842,13 +26852,15 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 
622360 free (88 chunks); 1029560
   wait_for_archive 
boolean
)
 record
-( lsn pg_lsn,
-labelfile text,
-spcmapfile text )
+( pg_control_file text,
+tablespace_map_file text,
+lsn pg_lsn,
+timeline_id int8,
+stop timestamptz )


 Finishes performing an on-line backup.  The desired contents of the
-backup label file and the tablespace map file are returned as part of
+pg_control file and the tablespace map file are returned as part of
 the result of the function and must be written to files in the
 backup area.  These files must not be written to the live data 
directory
 (doing so will cause PostgreSQL to fail to 

Re: Add recovery to pg_control and remove backup_label

2023-11-05 Thread David Steele

On 10/27/23 13:45, David G. Johnston wrote:


Let me modify that to make it a bit more clear, I actually wouldn't care 
if pg_backup_end outputs an entire binary pg_control file as part of the 
SQL resultset.


My proposal would be to, in addition, place in the temporary directory 
on the server, Postgres-written versions of pg_control and 
tablespace_map as part of the pg_backup_end processing.  The client 
software would then have a choice.  Write the contents of the SQL 
resultset to newly created binary mode files in the destination, or, 
copy the server-written files from the temporary directory to the 
destination.


That said, I'm starting to dislike that idea myself.  It only really 
makes sense if the files could be placed in the data directory but that 
isn't doable given concurrent backups and not wanting to place the 
source server into an inconsistent state.


Pretty much the conclusion I have come to myself over the years.

Regards,
-David





Re: MERGE ... RETURNING

2023-11-05 Thread Dean Rasheed
On Wed, 1 Nov 2023 at 17:49, Jeff Davis  wrote:
>
> Most of my concern is that parts of the implementation feel like a
> hack, which makes me concerned that we're approaching it the wrong way.
>

OK, that's a fair point. Attached is a new version, replacing those
parts of the implementation with a new MergingFunc node. It doesn't
add that much more complexity, and I think the new code is much
neater.

Also, I think this makes it easier / more natural to add additional
returning options, like Merlin's suggestion to return a user-defined
label value, though I haven't implemented that.

I have gone with the name originally suggested by Vik -- MERGING(),
which means that that has to be a new col-name keyword. I'm not
especially wedded to that name, but I think that it's not a bad
choice, and I think going with that is preferable to making MERGE a
col-name keyword.

So (quoting the example from the docs), the new syntax looks like this:

MERGE INTO products p
  USING stock s ON p.product_id = s.product_id
  WHEN MATCHED AND s.quantity > 0 THEN
UPDATE SET in_stock = true, quantity = s.quantity
  WHEN MATCHED THEN
UPDATE SET in_stock = false, quantity = 0
  WHEN NOT MATCHED THEN
INSERT (product_id, in_stock, quantity)
  VALUES (s.product_id, true, s.quantity)
  RETURNING MERGING(ACTION), MERGING(CLAUSE_NUMBER), p.*;

 action | clause_number | product_id | in_stock | quantity
+---++--+--
 UPDATE | 1 |   1001 | t|   50
 UPDATE | 2 |   1002 | f|0
 INSERT | 3 |   1003 | t|   10

By default, the returned column names are automatically taken from the
argument to the MERGING() function (which isn't actually a function
anymore).

There's one bug that I know about, to do with cross-partition updates,
but since that's a pre-existing bug, I'll start a new thread for it.

Regards,
Dean
diff --git a/doc/src/sgml/dml.sgml b/doc/src/sgml/dml.sgml
new file mode 100644
index cbbc5e2..3d95bdb
--- a/doc/src/sgml/dml.sgml
+++ b/doc/src/sgml/dml.sgml
@@ -283,10 +283,15 @@ DELETE FROM products;
RETURNING
   
 
+  
+   MERGE
+   RETURNING
+  
+
   
Sometimes it is useful to obtain data from modified rows while they are
being manipulated.  The INSERT, UPDATE,
-   and DELETE commands all have an
+   DELETE, and MERGE commands all have an
optional RETURNING clause that supports this.  Use
of RETURNING avoids performing an extra database query to
collect the data, and is especially valuable when it would otherwise be
@@ -339,6 +344,21 @@ DELETE FROM products
 
   
 
+  
+   In a MERGE, the data available to RETURNING is
+   the content of the source row plus the content of the inserted, updated, or
+   deleted target row.  Since it is quite common for the source and target to
+   have many of the same columns, specifying RETURNING *
+   can lead to a lot of duplicated columns, so it is often more useful to
+   qualify it so as to return just the source or target row.  For example:
+
+MERGE INTO products p USING new_products n ON p.product_no = n.product_no
+  WHEN NOT MATCHED THEN INSERT VALUES (n.product_no, n.name, n.price)
+  WHEN MATCHED THEN UPDATE SET name = n.name, price = n.price
+  RETURNING p.*;
+
+  
+
   
If there are triggers () on the target table,
the data available to RETURNING is the row as modified by
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index a6fcac0..06d5fe8
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -21960,6 +21960,84 @@ SELECT count(*) FROM sometable;
 
  
 
+ 
+  Merge Support Functions
+
+  
+   MERGE
+   RETURNING
+  
+
+  
+   MERGING
+  
+
+  
+   PostgreSQL includes one merge support function
+   that may be used in the RETURNING list of a
+command to return additional information about
+   the action taken for each row:
+
+MERGING ( property )
+
+   The following are valid property values specifying what to return:
+
+   
+
+ ACTION
+ 
+  
+   The merge action command executed for the current row
+   ('INSERT', 'UPDATE', or
+   'DELETE').
+  
+ 
+
+
+
+ CLAUSE_NUMBER
+ 
+  
+   The 1-based index of the WHEN clause executed for
+   the current row.
+  
+ 
+
+   
+  
+
+  
+   The following example illustrates how this may be used to determine
+   which actions were performed for each row affected by the
+   MERGE command:
+
+  
+
+  
+   Note that this function can only be used in the RETURNING
+   list of a MERGE command.  It is an error to use it in any
+   other part of a query.
+  
+
+ 
+
  
   Subquery Expressions
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
new file mode 100644
index fe8def4..51a15ca
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1387,9 +1387,9 @@
  to a client upon the
  completion of an SQL command, usually a