Re: Add \syncpipeline command to pgbench

2024-01-18 Thread Anthonin Bonnefoy
On Fri, Jan 19, 2024 at 5:08 AM Michael Paquier  wrote:
> The logic looks sound, but I have a
> comment about the docs: could it be better to group \syncpipeline with
> \startpipeline and \endpipeline?  \syncpipeline requires a pipeline to
> work.

I've updated the doc to group the commands. It does look better and
more consistent with similar command groups like \if.

Regards,
Anthonin


v2-0001-pgbench-Add-syncpipeline-to-pgbench.patch
Description: Binary data


Re: pgbnech: allow to cancel queries during benchmark

2024-01-18 Thread Yugo NAGATA
On Mon, 15 Jan 2024 16:49:44 +0900 (JST)
Tatsuo Ishii  wrote:

> > On Wed, 6 Sep 2023 20:13:34 +0900
> > Yugo NAGATA  wrote:
> >  
> >> I attached the updated patch v3. The changes since the previous
> >> patch includes the following;
> >> 
> >> I removed the unnecessary condition (&& false) that you
> >> pointed out in [1].
> >> 
> >> The test was rewritten by using IPC::Run signal() and integrated
> >> to "001_pgbench_with_server.pl". This test is skipped on Windows
> >> because SIGINT causes to terminate the test itself as discussed
> >> in [2] about query cancellation test in psql.
> >> 
> >> I added some comments to describe how query cancellation is
> >> handled as I explained in [1].
> >> 
> >> Also, I found the previous patch didn't work on Windows so fixed it.
> >> On non-Windows system, a thread waiting a response of long query can
> >> be interrupted by SIGINT, but on Windows, threads do not return from
> >> waiting until queries they are running are cancelled. This is because,
> >> when the signal is received, the system just creates a new thread to
> >> execute the callback function specified by setup_cancel_handler, and
> >> other thread continue to run[3]. Therefore, the queries have to be
> >> cancelled in the callback function.
> >> 
> >> [1] 
> >> https://www.postgresql.org/message-id/a58388ac-5411-4760-ea46-71324d8324cb%40mines-paristech.fr
> >> [2] 
> >> https://www.postgresql.org/message-id/20230906004524.2fd6ee049f8a6c6f2690b99c%40sraoss.co.jp
> >> [3] https://learn.microsoft.com/en-us/windows/console/handlerroutine
> > 
> > I found that --disable-thread-safety option was removed in 68a4b58eca0329.
> > So, I removed codes involving ENABLE_THREAD_SAFETY from the patch.
> > 
> > Also, I wrote a commit log draft.
> 
> > Previously, Ctrl+C during benchmark killed pgbench immediately,
> > but queries running at that time were not cancelled.
> 
> Better to mention explicitely that queries keep on running on the
> backend. What about this?
> 
> Previously, Ctrl+C during benchmark killed pgbench immediately, but
> queries were not canceled and they keep on running on the backend
> until they tried to send the result to pgbench.

Thank you for your comments. I agree with you, so I fixed the message
as your suggestion.

> > The commit
> > fixes this so that cancel requests are sent for all connections
> > before pgbench exits.
> 
> "sent for" -> "sent to"

Fixed.

> > Attached is the updated version, v4.
> 
> +/* send cancel requests to all connections */
> +static void
> +cancel_all()
> +{
> + for (int i = 0; i < nclients; i++)
> + {
> + char errbuf[1];
> + if (client_states[i].cancel != NULL)
> + (void) PQcancel(client_states[i].cancel, errbuf, 
> sizeof(errbuf));
> + }
> +}
> +
> 
> Why in case of errors from PQCancel the error message is neglected? I
> think it's better to print out the error message in case of error.

Is the message useful for pgbench users? I saw the error is ignored
in pg_dump, for example in bin/pg_dump/parallel.c

/*
 * Send QueryCancel to leader connection, if enabled.  Ignore errors,
 * there's not much we can do about them anyway.
 */
if (signal_info.myAH != NULL && signal_info.myAH->connCancel != NULL)
(void) PQcancel(signal_info.myAH->connCancel,
errbuf, sizeof(errbuf));


> +  * On non-Windows, any callback function is not set. When SIGINT is
> +  * received, CancelRequested is just set, and only thread #0 is 
> interrupted
> +  * and returns from waiting input from the backend. After that, the 
> thread
> +  * sends cancel requests to all benchmark queries.
> 
> The second line is a little bit long according to the coding
> standard. Fix like this?
> 
>* On non-Windows, any callback function is not set. When SIGINT is
>* received, CancelRequested is just set, and only thread #0 is
>* interrupted and returns from waiting input from the backend. After
>* that, the thread sends cancel requests to all benchmark queries.

Fixed.

The attached is the updated patch, v5.

Regards,
Yugo Nagata

> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp


-- 
Yugo NAGATA 
>From a42e6d47b27f2e91a02c5d934509070bd41bf79b Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Mon, 24 Jul 2023 21:53:28 +0900
Subject: [PATCH v5] Allow pgbnech to cancel queries during benchmark

Previously, Ctrl+C during benchmark killed pgbench immediately,
but queries were not cancelled nd they keep on running on the
backend until they tried to send the result to pgbench.
The commit fixes this so that cancel requests are sent to all
connections before pgbench exits.

In thread #0, setup_cancel_handler is called before the benchmark
so that CancelRequested is set when SIGINT is sent. When SIGINT
is sent during the benchmark, on non-Windows, thread 

logical decoding build wrong snapshot with subtransactions

2024-01-18 Thread feichanghong
This issue has been reported in the https://www.postgresql.org/message-id/18280-4c8060178cb41750%40postgresql.org
Hoping for some feedback from kernel hackers, thanks!


Hi, hackers,
I've encountered a problem with logical decoding history snapshots. The
specific error message is: "ERROR: could not map filenode "base/5/16390" to
relation OID".


If a subtransaction that modified the catalog ends before the
restart_lsn of the logical replication slot, and the commit WAL record of
its top transaction is after the restart_lsn, the WAL record related to the
subtransaction won't be decoded during logical decoding. Therefore, the
subtransaction won't be marked as having modified the catalog, resulting in
its absence from the snapshot's committed list.


The issue seems to be caused by SnapBuildXidSetCatalogChanges
(introduced in 272248a) skipping checks for subtransactions when the top
transaction is marked as containing catalog changes.


The following steps can reproduce the problem (I increased the value of
LOG_SNAPSHOT_INTERVAL_MS to avoid the impact of bgwriter writing
XLOG_RUNNING_XACTS WAL records):
session 1:
```
CREATE TABLE tbl1 (val1 integer, val2 integer);
CREATE TABLE tbl1_part (val1 integer) PARTITION BY RANGE (val1);


SELECT 'init' FROM pg_create_logical_replication_slot('isolation_slot', 
'test_decoding');


BEGIN;
SAVEPOINT sp1;
CREATE TABLE tbl1_part_p1 PARTITION OF tbl1_part FOR VALUES FROM (0) TO (10);
RELEASE SAVEPOINT sp1;
```


session 2:
```
CHECKPOINT;
```


session 1:
```
CREATE TABLE tbl1_part_p2 PARTITION OF tbl1_part FOR VALUES FROM (10) TO (20);
COMMIT;
BEGIN;
TRUNCATE tbl1;
```


session 2:
```
CHECKPOINT;
SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 
'skip-empty-xacts', '1', 'include-xids', '0');
INSERT INTO tbl1_part VALUES (1);
SELECT data FROM pg_logical_slot_get_changes('isolation_slot', NULL, NULL, 
'skip-empty-xacts', '1', 'include-xids', '0');
```




To fix this issue, it is sufficient to remove the condition check for
ReorderBufferXidHasCatalogChanges in SnapBuildXidSetCatalogChanges.


This fix may add subtransactions that didn't change the catalog to the commit
list, which seems like a false positive. However, this is acceptable since
we only use the snapshot built during decoding to read system catalogs, as
stated in 272248a's commit message.


I have verified that the patch in the attachment resolves the issues
mentioned?? and I added some test cases.


I am eager to hear your suggestions on this!




Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.


 

fix_wrong_snapshot_for_logical_decoding.patch
Description: Binary data


Re: Change GUC hashtable to use simplehash?

2024-01-18 Thread John Naylor
On Wed, Jan 17, 2024 at 9:54 PM Heikki Linnakangas  wrote:

> Maybe something like:
>
> "
> Building blocks for creating fast inlineable hash functions. The
> functions in this file are not guaranteed to be stable between versions,
> and may differ by hardware platform. Hence they must not be used in
> indexes or other on-disk structures. See hashfn.h if you need stability.
> "
>
> typo: licencse
>
> Other than that, LGTM.

Pushed that way, thanks! After fixing another typo in big endian
builds, an s390x member reported green, so I think that aspect is
working now. I'll come back to follow-up topics shortly.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-18 Thread Masahiko Sawada
On Thu, Jan 18, 2024 at 1:30 PM John Naylor  wrote:
>
> On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada  wrote:
> > It seems we cannot make this work nicely. IIUC VacDeadItems is
> > allocated in DSM and TidStore is embedded there. However,
> > dead_items->items.area is a local pointer to dsa_area. So we cannot
> > include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
> > would need to pass dsa_area to each interface by callers.
>
> Thanks again for exploring this line of thinking! Okay, it seems even
> if there's a way to make this work, it would be too invasive to
> justify when compared with the advantage I was hoping for.
>
> > > If we can't make this work nicely, I'd be okay with keeping the tid
> > > store control object. My biggest concern is unnecessary
> > > double-locking.
> >
> > If we don't do any locking stuff in radix tree APIs and it's the
> > user's responsibility at all, probably we don't need a lock for
> > tidstore? That is, we expose lock functions as you mentioned and the
> > user (like tidstore) acquires/releases the lock before/after accessing
> > the radix tree and num_items.
>
> I'm not quite sure what the point of "num_items" is anymore, because
> it was really tied to the array in VacDeadItems. dead_items->num_items
> is essential to reading/writing the array correctly. If this number is
> wrong, the array is corrupt. There is no such requirement for the
> radix tree. We don't need to know the number of tids to add to it or
> do a lookup, or anything.

True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
we need to have the number of TIDs in a tidstore, especially in the
tidstore's control object.

>
> There are a number of places where we assert "the running count of the
> dead items" is the same as "the length of the dead items array", like
> here:
>
> @@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel)
>   BlockNumber threshold;
>
>   Assert(vacrel->num_index_scans == 0);
> - Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
> + Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
>
> As such, in HEAD I'm guessing it's arbitrary which one is used for
> control flow. Correct me if I'm mistaken. If I am wrong for some part
> of the code, it'd be good to understand when that invariant can't be
> maintained.
>
> @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
>   * Do index vacuuming (call each index's ambulkdelete routine), then do
>   * related heap vacuuming
>   */
> - if (dead_items->num_items > 0)
> + if (TidStoreNumTids(dead_items) > 0)
>   lazy_vacuum(vacrel);
>
> Like here. In HEAD, could this have used vacrel->dead_items?
>
> @@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
>   * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
>   * the second heap pass.  No more, no less.
>   */
> - Assert(index > 0);
>   Assert(vacrel->num_index_scans > 1 ||
> -(index == vacrel->lpdead_items &&
> +(TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
>   vacuumed_pages == vacrel->lpdead_item_pages));
>
>   ereport(DEBUG2,
> - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> - vacrel->relname, (long long) index, vacuumed_pages)));
> + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
> in %u pages",
> + vacrel->relname, TidStoreNumTids(vacrel->dead_items),
> + vacuumed_pages)));
>
> We assert that vacrel->lpdead_items has the expected value, and then
> the ereport repeats the function call (with a lock) to read the value
> we just consulted to pass the assert.
>
> If we *really* want to compare counts, maybe we could invent a
> debugging-only function that iterates over the tree and popcounts the
> bitmaps. That seems too expensive for regular assert builds, though.

IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
the whole lazy vacuum operation whereas num_items is the number of
LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
That is, after heap vacuum, the latter counter is reset while the
former counter is not.

The latter counter is used in lazyvacuum.c as well as the ereport in
vac_bulkdel_one_index().

>
> On the subject of debugging builds, I think it no longer makes sense
> to have the array for debug checking in tid store, even during
> development. A few months ago, we had an encoding scheme that looked
> simple on paper, but its code was fiendishly difficult to follow (at
> least for me). That's gone. In addition to the debugging count above,
> we could also put a copy of the key in the BlockTableEntry's header,
> in debug builds. We don't yet need to care about the key size, since
> we don't (yet) have runtime-embeddable values.

Putting a copy of the key in BlocktableEntry's header is an
interesting idea. But the current debug code in the tidstore also
makes sure that the tidstore returns TIDs in the correct order during
an iterate operation. I think it still has

Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Andy Fan

Hi,

v6 attached which addressed all the items Robert suggested except the
following 2 open items. They are handled differently.

>
> Here is the summary of the open-items, it would be great that Andres and
> Matthias have a look at this when they have time.
>
>>> The LockBufHdr also used init_local_spin_delay / perform_spin_delay
>>> infrastruce and then it has the same issue like ${subject}, it is pretty
>>> like the code in s_lock; Based on my current knowledge, I think we
>>> should add the check there.
>>
>> I'd like to hear from Andres, if possible. @Andres: Should these
>> sanity checks apply only to spin locks per se, or also to buffer
>> header locks?

v6 is splitted into 2 commits, one for normal SpinLock and one for
LockBufHdr lock.

commit 6276d2f66b0760053e3fdfe259971be3abba3c63
Author: yizhi.fzh 
Date:   Fri Jan 19 13:52:07 2024 +0800

Detect more misuse of spin lock automatically

Spin lock are intended for *very* short-term locks, but it is possible
to be misused in many cases. e.g. Acquiring another LWLocks or regular
locks, memory allocation, errstart when holding a spin lock. this patch
would detect such misuse automatically in a USE_ASSERT_CHECKING build.

CHECK_FOR_INTERRUPTS should be avoided as well when holding a spin lock.
Depends on what signals are left to handle, PG may raise error/fatal
which would cause the code jump to some other places which is hardly to
release the spin lock anyway.

commit 590a0c6f767f62f6c83289d55de99973bc7da417 (HEAD -> s_stuck_v3)
Author: yizhi.fzh 
Date:   Fri Jan 19 13:57:46 2024 +0800

Treat (un)LockBufHdr as a SpinLock.

The LockBufHdr also used init_local_spin_delay / perform_spin_delay
infrastructure and so it is also possible that PANIC the system
when it can't be acquired in a short time, and its code is pretty
similar with s_lock. so treat it same as SPIN lock when regarding to
misuse of spinlock detection.

>>> they were put into spin.h in v1 but later move to miscadmin.h at [1].
>>> [1]
>>> https://www.postgresql.org/message-id/CAEze2WggP-2Dhocmdhp-LxBzic%3DMXRgGA_tmv1G_9n-PDt2MQg%40mail.gmail.com
>>
>> I'm not entirely sure what the right thing to do is here, and the
>> answer may depend on the previous question. But I disagree with
>> Matthias -- I don't think miscadmin.h can be the right answer
>> regardless.

I put it into spin.h this time in commit 1, and include the extern
function VerifyNoSpinLocksHeld in spin.c into miscadmin.h like what we
did for ProcessInterrupts. This will easy the miscadmin dependency. the
changes for '#include xxx' looks better than before.

-- 
Best Regards
Andy Fan

>From 6276d2f66b0760053e3fdfe259971be3abba3c63 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Fri, 19 Jan 2024 13:52:07 +0800
Subject: [PATCH v6 1/2] Detect more misuse of spin lock automatically

Spin lock are intended for *very* short-term locks, but it is possible
to be misused in many cases. e.g. Acquiring another LWLocks or regular
locks, memory allocation, errstart when holding a spin lock. this patch
would detect such misuse automatically in a USE_ASSERT_CHECKING build.

CHECK_FOR_INTERRUPTS should be avoided as well when holding a spin lock.
Depends on what signals are left to handle, PG may raise error/fatal
which would cause the code jump to some other places which is hardly to
release the spin lock anyway.
---
 src/backend/storage/lmgr/lock.c   |  6 
 src/backend/storage/lmgr/lwlock.c |  6 
 src/backend/storage/lmgr/spin.c   | 13 +
 src/backend/utils/error/elog.c|  7 +
 src/backend/utils/mmgr/mcxt.c | 16 +++
 src/include/miscadmin.h   | 12 +++-
 src/include/storage/spin.h| 46 +--
 7 files changed, 102 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index c70a1adb9a..cb9969b860 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -776,6 +776,12 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	bool		found_conflict;
 	bool		log_lock = false;
 
+	/*
+	 * Spin lock should not be held for a long time, but the time needed here
+	 * may be too long, so let make sure no spin lock is held now.
+	 */
+	VerifyNoSpinLocksHeld();
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index b4b989ac56..2faeb3aba6 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1205,6 +1205,12 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 
 	Assert(mode == LW_SHARED || mode == LW_EXCLUSIVE);
 
+	/*
+	 * Spin lock should not be held for a long time, but the time needed here
+	 * may be too long, so let make sure no spin lock is held now.
+	 */
+	VerifyNoSpinLocksHeld();
+
 	

Re: System username in pg_stat_activity

2024-01-18 Thread Bertrand Drouvot
Hi,

On Thu, Jan 18, 2024 at 04:01:33PM +0100, Magnus Hagander wrote:
> On Mon, Jan 15, 2024 at 11:17 AM Bertrand Drouvot
> > Did you forget to share the new revision (aka v4)? I can only see the
> > "reorder_parallel_worker_bestart.patch" attached.
> 
> I did. Here it is, and also including that suggested docs fix as well
> as a rebase on current master.
> 

Thanks!

Just a few comments:

1 ===

+   The authentication method used for authenticating the connection, or
+   NULL for background processes.

What about? "NULL for background processes (except for parallel workers which
inherit this information from their leader process)"

2 ===

+   cycle before they were assigned a database role.  Contains the same
+   value as the identity part in , or NULL
+   for background processes.

Same comment about parallel workers.

3 ===

+# pg_stat_activity should contain trust and empty string for trust auth
+$res = $node->safe_psql(
+   'postgres',
+   "SELECT auth_method, auth_identity='' FROM pg_stat_activity WHERE 
pid=pg_backend_pid()",
+   connstr => "user=scram_role");
+is($res, 'trust|t', 'Users with trust authentication should show correctly in 
pg_stat_activity');
+
+# pg_stat_activity should contain NULL for auth of background processes
+# (test is a bit out of place here, but this is where the other 
pg_stat_activity.auth* tests are)
+$res = $node->safe_psql(
+   'postgres',
+   "SELECT auth_method IS NULL, auth_identity IS NULL FROM 
pg_stat_activity WHERE backend_type='checkpointer'",
+);
+is($res, 't|t', 'Background processes should show NULL for auth in 
pg_stat_activity');

What do you think about testing the parallel workers cases too? (I'm not 100%
sure it's worth the extra complexity though).

Apart from those 3, it looks good to me.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-01-18 Thread Peter Smith
Here are some review comments for patch v63-0003.

==
Commit Message

1.
This patch attempts to start slot-sync worker as a special process
which is neither a bgworker nor an Auxiliary process. The benefit
we get here is we can control the start-conditions of the worker which
further allows us to 'enable_syncslot' as PGC_SIGHUP which was otherwise
a PGC_POSTMASTER GUC when slotsync worker was registered as bgworker.

~

missing word?

/allows us to/allows us to define/

==
src/backend/postmaster/postmaster.c

2. process_pm_child_exit

+ /*
+ * Was it the slot sync worker? Normal exit or FATAL exit (FATAL can
+ * be caused by libpqwalreceiver on receiving shutdown request by the
+ * startup process during promotion) can be ignored; we'll start a new
+ * one at the next iteration of the postmaster's main loop, if
+ * necessary. Any other exit condition is treated as a crash.
+ */
+ if (pid == SlotSyncWorkerPID)
+ {
+ SlotSyncWorkerPID = 0;
+ if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+ HandleChildCrash(pid, exitstatus,
+ _("Slotsync worker process"));
+ continue;
+ }

2a.

I think the 2nd sentence is easier to read if written like:

Normal exit or FATAL exit can be ignored (FATAL can be caused by
libpqwalreceiver on receiving shutdown request by the startup process
during promotion);

~

2b.
All other names nearby are lowercase so maybe change "Slotsync worker
process" to ""slotsync worker process" or ""slot sync worker process".

==
src/backend/replication/logical/slotsync.c

3. check_primary_info

  if (!valid)
- ereport(ERROR,
+ {
+ *primary_slot_invalid = true;
+ ereport(LOG,
  errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("exiting from slot synchronization due to bad configuration"),
+ errmsg("skipping slot synchronization due to bad configuration"),
  /* translator: second %s is a GUC variable name */
  errdetail("The primary server slot \"%s\" specified by %s is not valid.",
PrimarySlotName, "primary_slot_name"));
+ }

Somehow it seems more appropriate for the *caller* to decide what to
do (e.g. "skipping...") when the primary slot is invalid. See  also
the next review comment #4b -- maybe just change this LOG to say "bad
configuration for slot synchronization".

~~~

4.
 /*
  * Check that all necessary GUCs for slot synchronization are set
- * appropriately. If not, raise an ERROR.
+ * appropriately. If not, log the message and pass 'valid' as false
+ * to the caller.
  *
  * If all checks pass, extracts the dbname from the primary_conninfo GUC and
  * returns it.
  */
 static char *
-validate_parameters_and_get_dbname(void)
+validate_parameters_and_get_dbname(bool *valid)

4a.
This feels back-to-front. I think a "validate" function should return
boolean. It can return the dbname as a side-effect only when it is
valid.

SUGGESTION
static boolean
validate_parameters_and_get_dbname(char *dbname)

~

4b.
It was a bit different when there were ERRORs but now they are LOGs
somehow it seems wrong for this function to say what the *caller* will
do. Maybe you can rewrite all the errmsg so the don't say "skipping"
but they just say "bad configuration for slot synchronization"

If valid is false then you can LOG "skipping" at the caller...

~~~

5. wait_for_valid_params_and_get_dbname

+ dbname = validate_parameters_and_get_dbname(&valid);
+ if (valid)
+ break;
+ else

This code will be simpler when the function is change to return
boolean as suggested above in #4a.

Also the 'else' is unnecessary.

SUGGESTION
if (validate_parameters_and_get_dbname(&dbname)
break;
~

6.
+ if (rc & WL_LATCH_SET)
+ ResetLatch(MyLatch);
+
+ }
+ }

Unnecessary blank line.

~~~

7. slotsync_reread_config

+ if (old_enable_syncslot != enable_syncslot)
+ {
+ /*
+ * We have reached here, so old value must be true and new must be
+ * false.
+ */
+ Assert(old_enable_syncslot);
+ Assert(!enable_syncslot);

I felt it would be better just to say Assert(enable_syncslot); at the
top of this function (before the ProcessConfigFile). Then none of this
other comment/assert if really needed because it should be
self-evident.

~~~

8. StartSlotSyncWorker

int
StartSlotSyncWorker(void)
{
pid_t pid;

#ifdef EXEC_BACKEND
switch ((pid = slotsyncworker_forkexec()))
#else
switch ((pid = fork_process()))
#endif
{
case -1:
ereport(LOG,
(errmsg("could not fork slot sync worker process: %m")));
return 0;

#ifndef EXEC_BACKEND
case 0:
/* in postmaster child ... */
InitPostmasterChild();

/* Close the postmaster's sockets */
ClosePostmasterPorts(false);

ReplSlotSyncWorkerMain(0, NULL);
break;
#endif
default:
return (int) pid;
}

/* shouldn't get here */
return 0;
}

The switch code can be rearranged so you don't need the #ifndef

SUGGESTION
#ifdef EXEC_BACKEND
switch ((pid = slotsyncworker_forkexec()))
{
#else
switch ((pid = fork_process()))
{
case 0:
/* in postmaster child ... */
InitPostmasterChild();

/* Close the postmaster's sockets */
ClosePostmasterPorts(false);

ReplSlotSyncWorkerMain(0, NULL);
break;
#endif
ca

Re: Synchronizing slots from primary to standby

2024-01-18 Thread shveta malik
On Fri, Jan 19, 2024 at 11:23 AM Amit Kapila  wrote:
>
> > > 5 === (coming from v62-0002)
> > > +   Assert(tuplestore_tuple_count(res->tuplestore) == 1);
> > >
> > > Is it even possible for the related query to not return only one row? (I 
> > > think the
> > > "count" ensures it).
> >
> > I think you are right. This assertion was added sometime back on the
> > basis of feedback on hackers. Let me review that again. I can consider
> > this comment in the next version.
> >
>
> OTOH, can't we keep the assert as it is but remove "= 1" from
> "count(*) = 1" in the query. There shouldn't be more than one slot
> with same name on the primary. Or, am I missing something?

There will be 1 record max and 0 record if the primary_slot_name is
invalid. Keeping 'count(*)=1' gives the benefit that it will straight
away give us true/false indicating if we are good or not wrt
primary_slot_name. I feel Assert can be removed and we can simply
have:

if (!tuplestore_gettupleslot(res->tuplestore, true, false, tupslot))
elog(ERROR, "failed to fetch primary_slot_name tuple");

thanks
Shveta




Re: Test slots invalidations in 035_standby_logical_decoding.pl only if dead rows are removed

2024-01-18 Thread Alexander Lakhin

Hello,

15.01.2024 12:00, Alexander Lakhin wrote:

If this approach looks promising to you, maybe we could add a submodule to
perl/PostgreSQL/Test/ and use this functionality in other tests (e.g., in
019_replslot_limit) as well.

Personally I think that having such a functionality for using in tests
might be useful not only to avoid some "problematic" behaviour but also to
test the opposite cases.


After spending a few days on it, I've discovered two more issues:
https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com

(The latter is not related to bgwriter directly, but it was discovered
thanks to the RUNNING_XACTS record flew in WAL in a lucky moment.)

So it becomes clear that the 035 test is not the only one, which might
suffer from bgwriter's activity, and inventing a way to stop bgwriter/
control it is a different subject, getting out of scope of the current
issue.


15.01.2024 11:49, Bertrand Drouvot wrote:

We did a few things in this thread, so to sum up what we've discovered:

- a race condition in InvalidatePossiblyObsoleteSlot() (see [1])
- we need to launch the vacuum(s) only if we are sure we got a newer XID horizon
( proposal in in v6 attached)
- we need a way to control how frequent xl_running_xacts are emmitted (to ensure
they are not triggered in a middle of an active slot invalidation test).

I'm not sure it's possible to address Tom's concern and keep the test 
"predictable".

So, I think I'd vote for Michael's proposal to implement a superuser-settable
developer GUC (as sending a SIGSTOP on the bgwriter (and bypass $windows_os) 
would
still not address Tom's concern anyway).

Another option would be to "sacrifice" the full predictablity of the test (in
favor of real-world behavior testing)?

[1]: 
https://www.postgresql.org/message-id/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal


So, now we have the test 035 failing due to nondeterministic vacuum
activity in the first place, and due to bgwriter's activity in the second.
Maybe it would be a right move to commit the fix, and then think about
more rare failures.

Though I have a couple of question regarding the fix left, if you don't
mind:
1) The test has minor defects in the comments, that I noted before [1];
would you like to fix them in passing?


BTW, it looks like the comment:
# One way to produce recovery conflict is to create/drop a relation and
# launch a vacuum on pg_class with hot_standby_feedback turned off on the 
standby.
in scenario 3 is a copy-paste error.
Also, there are two "Scenario 4" in this test.



2) Shall we align the 035_standby_logical_decoding with
031_recovery_conflict in regard to improving stability of vacuum?
I see the following options for this:
a) use wait_until_vacuum_can_remove() and autovacuum = off in both tests;
b) use FREEZE and autovacuum = off in both tests;
c) use wait_until_vacuum_can_remove() in 035, FREEZE in 031, and
 autovacuum = off in both.

I've re-tested the v6 patch today and confirmed that it makes the test
more stable. I ran (with bgwriter_delay = 1 in temp.config) 20 tests in
parallel and got failures ('inactiveslot slot invalidation is logged with
vacuum on pg_authid') on iterations 2, 6, 6 with no patch applied.
(With unlimited CPU, when average test duration is around 70 seconds.)

But with v6 applied, 60 iterations succeeded.

[1] 
https://www.postgresql.org/message-id/b2a1f7d0-7629-72c0-2da7-e9c4e336b...@gmail.com

Best regards,
Alexander




Re: Add 64-bit XIDs into PostgreSQL 15

2024-01-18 Thread Shubham Khanna
On Wed, Dec 13, 2023 at 5:56 PM Maxim Orlov  wrote:
>
> Hi!
>
> Just to keep this thread up to date, here's a new version after recent 
> changes in SLRU.
> I'm also change order of the patches in the set, to make adding initdb MOX 
> options after the
> "core 64 xid" patch, since MOX patch is unlikely to be committed and now for 
> test purpose only.

I tried to apply the patch but it is failing at the Head. It is giving
the following error:
Hunk #1 succeeded at 601 (offset 5 lines).
patching file src/backend/replication/slot.c
patching file src/backend/replication/walreceiver.c
patching file src/backend/replication/walsender.c
Hunk #1 succeeded at 2434 (offset 160 lines).
patching file src/backend/storage/ipc/procarray.c
Hunk #1 succeeded at 1115 with fuzz 2.
Hunk #3 succeeded at 1286 with fuzz 2.
Hunk #7 FAILED at 4341.
Hunk #8 FAILED at 4899.
Hunk #9 FAILED at 4959.
3 out of 10 hunks FAILED -- saving rejects to file
src/backend/storage/ipc/procarray.c.rej
patching file src/backend/storage/ipc/standby.c
Hunk #1 FAILED at 1043.
Hunk #2 FAILED at 1370.
2 out of 2 hunks FAILED -- saving rejects to file
src/backend/storage/ipc/standby.c.rej
Please send the Re-base version of the patch.

Thanks and Regards,
Shubham Khanna.




Re: Improve the connection failure error messages

2024-01-18 Thread Nisha Moond
On Fri, Jan 12, 2024 at 7:06 PM Aleksander Alekseev
 wrote:
>
> Hi,
>
> Thanks for the patch.
>
> > Due to this behavior, it is not possible to add a test to show the
> > error message as it is done for CREATE SUBSCRIPTION.
> > Let me know if you think there is another way to add this test.
>
> I believe it can be done with TAP tests, see for instance:
>
> contrib/auto_explain/t/001_auto_explain.pl
>
> However I wouldn't insist on including the test in scope of this
> particular patch. Such a test doesn't currently exist, it can be added
> as a separate patch, and whether this is actually a useful test (all
> the tests consume time after all...) is somewhat debatable. Personally
> I agree that it would be nice to have though.
Thank you for providing this information. Yes, I can write a TAP test
to check the log for the same error message.
I'll attempt it and perform a time analysis. I'm unsure where to
appropriately add this test. Any suggestions?

Following your suggestion, I won't include the test in the scope of
this patch. Instead, I'll start a new thread once I have sufficient
information required.

--
Thanks,
Nisha




Re: Synchronizing slots from primary to standby

2024-01-18 Thread Amit Kapila
On Wed, Jan 17, 2024 at 4:06 PM shveta malik  wrote:
>
> >
> > 5 === (coming from v62-0002)
> > +   Assert(tuplestore_tuple_count(res->tuplestore) == 1);
> >
> > Is it even possible for the related query to not return only one row? (I 
> > think the
> > "count" ensures it).
>
> I think you are right. This assertion was added sometime back on the
> basis of feedback on hackers. Let me review that again. I can consider
> this comment in the next version.
>

OTOH, can't we keep the assert as it is but remove "= 1" from
"count(*) = 1" in the query. There shouldn't be more than one slot
with same name on the primary. Or, am I missing something?

-- 
With Regards,
Amit Kapila.




Re: generate syscache info automatically

2024-01-18 Thread John Naylor
On Wed, Jan 17, 2024 at 7:46 PM Peter Eisentraut  wrote:
>
> I updated the patch to use this style (but I swapped the first two
> arguments from my example, so that the thing being created is named first).
>
> I also changed the names of the output files a bit to make them less
> confusing.  (I initially had some files named .c.h, which was weird, but
> apparently necessary to avoid confusing the build system.  But it's all
> clearer now.)
>
> Other than bugs and perhaps style opinions, I think the first patch is
> pretty good now.

LGTM. The only style consideration that occurred to me just now was
MAKE_SYSCACHE, since it doesn't match the DECLARE_... macros. It seems
like the same thing from a code generation perspective. The other
macros refer to relations, so there is a difference, but maybe it
doesn't matter. I don't have a strong opinion.




Re: "ERROR: could not open relation with OID 16391" error was encountered when reindexing

2024-01-18 Thread Michael Paquier
On Wed, Jan 17, 2024 at 04:03:46PM +0800, feichanghong wrote:
> It has been verified that the patch in the attachment can solve the
> above problems. I sincerely look forward to your suggestions!

Thanks for the patch.  I have completely forgotten to update this
thread.  Except for a few comments and names that I've tweaked, this
was OK, so applied and backpatched after splitting things into two:
- One commit for try_index_open().
- Second commit for the fix in reindex_index().

I've looked at the concurrent paths as well, and even if these involve
more relations opened we maintain a session lock on the parent
relations that we manipulate, so I could not see a pattern where the
index would be dropped and where we'd try to open it.  Now, there are
cases where it is possible to deadlock for the concurrent paths, but
that's not new: schema or database level reindexes can also hit that.

This is one of these areas where tests are hard to write now because
we want to stop operations at specific points but we cannot.  
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx

2024-01-18 Thread Michael Paquier
On Thu, Jan 18, 2024 at 09:25:16AM +0100, Anthonin Bonnefoy wrote:
> From what I can tell, there's no change in the behaviour. All paths
> would eventually go through HandleSlashCmds's cleaning logic. This is
> also mentioned in ignore_slash_options's comment.

Yeah, I can confirm that.  I would be really tempted to backpatch that
because that's a bug: we have to call ignore_slash_options() for
inactive branches when a command parses options with OT_NORMAL.  Now,
I cannot break things, either.

> Done. patch 1 adds ignore_slash_options to bind. patch 2 adds the new
> \bindx, \close and \parse commands.

0001 has been applied on HEAD.
--
Michael


signature.asc
Description: PGP signature


Re: Synchronizing slots from primary to standby

2024-01-18 Thread Masahiko Sawada
On Wed, Jan 17, 2024 at 7:30 PM shveta malik  wrote:
>
> On Wed, Jan 17, 2024 at 3:08 PM Bertrand Drouvot
>  wrote:
> >
> > Hi,
> >
> > On Tue, Jan 16, 2024 at 05:27:05PM +0530, shveta malik wrote:
> > > PFA v62. Details:
> >
> > Thanks!
> >
> > > v62-003:
> > > It is a new patch which attempts to implement slot-sync worker as a
> > > special process which is neither a bgworker nor an Auxiliary process.
> > > Here we get the benefit of converting enable_syncslot to a PGC_SIGHUP
> > > Guc rather than PGC_POSTMASTER. We launch the slot-sync worker only if
> > > it is hot-standby and 'enable_syncslot' is ON.
> >
> > The implementation looks reasonable to me (from what I can see some parts is
> > copy/paste from an already existing "special" process and some parts are
> > "sync slot" specific) which makes fully sense.
> >
> > A few remarks:
> >
> > 1 ===
> > +* Was it the slot sycn worker?
> >
> > Typo: sycn
> >
> > 2 ===
> > +* ones), and no walwriter, autovac launcher or bgwriter or 
> > slot sync
> >
> > Instead? "* ones), and no walwriter, autovac launcher, bgwriter or slot 
> > sync"
> >
> > 3 ===
> > + * restarting slot slyc worker. If stopSignaled is set, the worker will
> >
> > Typo: slyc
> >
> > 4 ===
> > +/* Flag to tell if we are in an slot sync worker process */
> >
> > s/an/a/ ?
> >
> > 5 === (coming from v62-0002)
> > +   Assert(tuplestore_tuple_count(res->tuplestore) == 1);
> >
> > Is it even possible for the related query to not return only one row? (I 
> > think the
> > "count" ensures it).
> >
> > 6 ===
> > if (conninfo_changed ||
> > primary_slotname_changed ||
> > +   old_enable_syncslot != enable_syncslot ||
> > (old_hot_standby_feedback != hot_standby_feedback))
> > {
> > ereport(LOG,
> > errmsg("slot sync worker will restart 
> > because of"
> >" a parameter change"));
> >
> > I don't think "slot sync worker will restart" is true if one change 
> > enable_syncslot
> > from on to off.
> >
> > IMHO, v62-003 is in good shape and could be merged in v62-002 (that would 
> > ease
> > the review). But let's wait to see if others think differently.
> >
> > Regards,
> >
> > --
> > Bertrand Drouvot
> > PostgreSQL Contributors Team
> > RDS Open Source Databases
> > Amazon Web Services: https://aws.amazon.com
>
>
> PFA v63.
>
> --It addresses comments by Peter given in [1], [2], comment by Nisha
> given in [3], comments by Bertrand given in [4]
> --It also moves race-condition fix from patch003 to patch002 as
> suggested by Swada-san offlist. Race-condition is mentioned in [5]
>

Thank you for updating the patch. I have some comments:

---
+latestWalEnd = GetWalRcvLatestWalEnd();
+if (remote_slot->confirmed_lsn > latestWalEnd)
+{
+elog(ERROR, "exiting from slot synchronization as the
received slot sync"
+ " LSN %X/%X for slot \"%s\" is ahead of the
standby position %X/%X",
+ LSN_FORMAT_ARGS(remote_slot->confirmed_lsn),
+ remote_slot->name,
+ LSN_FORMAT_ARGS(latestWalEnd));
+}

IIUC GetWalRcvLatestWalEnd () returns walrcv->latestWalEnd, which is
typically the primary server's flush position and doesn't mean the LSN
where the walreceiver received/flushed up to. Does it really happen
that the slot's confirmed_flush_lsn is higher than the primary's flush
lsn?

---
After dropping a database on the primary, I got the following LOG (PID
2978463 is the slotsync worker on the standby):

LOG:  still waiting for backend with PID 2978463 to accept ProcSignalBarrier
CONTEXT:  WAL redo at 0/301CE00 for Database/DROP: dir 1663/16384

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)

2024-01-18 Thread Dilip Kumar
On Wed, Nov 1, 2023 at 2:42 AM Matthias van de Meent
 wrote:
>
> Hi,
>
> Currently, nbtree code compares each and every column of an index
> tuple during the binary search on the index page. With large indexes
> that have many duplicate prefix column values (e.g. an index on (bool,
> bool, uuid) ) that means a lot of wasted time getting to the right
> column.
>
> The attached patch improves on that by doing per-page dynamic prefix
> truncation: If we know that on both the right and left side there are
> index tuples where the first two attributes are equal to the scan key,
> we skip comparing those attributes at the current index tuple and
> start with comparing attribute 3, saving two attribute compares. We
> gain performance whenever comparing prefixing attributes is expensive
> and when there are many tuples with a shared prefix - in unique
> indexes this doesn't gain much, but we also don't lose much in this
> case.
>
> This patch was originally suggested at [0], but it was mentioned that
> they could be pulled out into it's own thread. Earlier, the
> performance gains were not clearly there for just this patch, but
> after further benchmarking this patch stands on its own for
> performance: it sees no obvious degradation of performance, while
> gaining 0-5% across various normal indexes on the cc-complete sample
> dataset, with the current worst-case index shape getting a 60%+
> improved performance on INSERTs in the tests at [0].

+1 for the idea, I have some initial comments while reading through the patch.

1.
Commit message refers to a non-existing reference '(see [0])'.


2.
+When we do a binary search on a sorted set (such as a BTree), we know that a
+tuple will be smaller than its left neighbour, and larger than its right
+neighbour.

I think this should be 'larger than left neighbour and smaller than
right neighbour' instead of the other way around.

3.
+With the above optimizations, dynamic prefix truncation improves the worst
+case complexity of indexing from O(tree_height * natts * log(tups_per_page))
+to O(tree_height * (3*natts + log(tups_per_page)))

Where do the 3*natts come from?  Is it related to setting up the
dynamic prefix at each level?

4.
+ /*
+ * All tuple attributes are equal to the scan key, only later attributes
+ * could potentially not equal the scan key.
+ */
+ *compareattr = ntupatts + 1;

Can you elaborate on this more? If all tuple attributes are equal to
the scan key then what do those 'later attributes' mean?


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Commitfest manager January 2024

2024-01-18 Thread vignesh C
On Fri, 19 Jan 2024 at 09:19, Peter Smith  wrote:
>
> Hi Vignesh,
>
> If you would like any assistance processing the 100s of CF entries I
> am happy to help in some way.

Thanks Peter, that will be great.

Regards,
Vignesh




Re: Fix incorrect format placeholders in walreceiver.c

2024-01-18 Thread Michael Paquier
On Thu, Jan 18, 2024 at 09:32:55PM +0800, Yongtao Huang wrote:
> I think the correct placeholder for var *startoff* should be *%d*.
> Thanks for your time.

Thanks, fixed.
--
Michael


signature.asc
Description: PGP signature


Re: Add \syncpipeline command to pgbench

2024-01-18 Thread Michael Paquier
On Thu, Jan 18, 2024 at 09:48:28AM +0100, Anthonin Bonnefoy wrote:
> PQsendSyncMessage() was added in
> https://commitfest.postgresql.org/46/4262/. It allows users to add a
> Sync message without flushing the buffer.
> 
> As a follow-up, this change adds an additional meta-command to
> pgbench, \syncpipeline, which will call PQsendSyncMessage(). This will
> make it possible to benchmark impact and improvements of using
> PQsendSyncMessage through pgbench.

Thanks for sending that as a separate patch.

As a matter of fact, I have already looked at what you are proposing
here for the sake of the other thread when checking the difference in
numbers with PQsendSyncMessage().  The logic looks sound, but I have a
comment about the docs: could it be better to group \syncpipeline with
\startpipeline and \endpipeline?  \syncpipeline requires a pipeline to
work.
--
Michael


signature.asc
Description: PGP signature


Re: Commitfest manager January 2024

2024-01-18 Thread Peter Smith
Hi Vignesh,

If you would like any assistance processing the 100s of CF entries I
am happy to help in some way.

==
Kind Regards,,
Peter Smith.
Fujitsu Australia




Re: ALTER ROLE documentation improvement

2024-01-18 Thread Nathan Bossart
On Thu, Jan 18, 2024 at 02:44:35PM -0700, David G. Johnston wrote:
> LGTM too.  I didn't go looking for anything else related to this, but the
> proposed changes all look needed.

Committed.  Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Synchronizing slots from primary to standby

2024-01-18 Thread shveta malik
On Thu, Jan 18, 2024 at 10:31 AM Peter Smith  wrote:
>
> I have one question about the new code in v63-0002.
>
> ==
> src/backend/replication/logical/slotsync.c
>
> 1. ReplSlotSyncWorkerMain
>
> + Assert(SlotSyncWorker->pid == InvalidPid);
> +
> + /*
> + * Startup process signaled the slot sync worker to stop, so if meanwhile
> + * postmaster ended up starting the worker again, exit.
> + */
> + if (SlotSyncWorker->stopSignaled)
> + {
> + SpinLockRelease(&SlotSyncWorker->mutex);
> + proc_exit(0);
> + }
>
> Can we be sure a worker crash can't occur (in ShutDownSlotSync?) in
> such a way that SlotSyncWorker->stopSignaled was already assigned
> true, but SlotSyncWorker->pid was not yet reset to InvalidPid;
>
> e.g. Is the Assert above still OK?

We are good with the Assert here. I tried below cases:

1) When slotsync worker is say killed using 'kill', it is considered
as SIGTERM; slot sync worker invokes 'slotsync_worker_onexit()' before
going down and thus sets SlotSyncWorker->pid = InvalidPid. This means
when it is restarted (considering we have put the breakpoints in such
a way that postmaster had already reached do_start_bgworker() before
promotion finished), it is able to see stopSignaled set but pid is
InvalidPid and thus we are good.

2) Another case is when we kill slot sync worker using 'kill -9' (or
say we make it crash), in such a case, postmaster signals each sibling
process to quit (including startup process) and cleans up the shared
memory used by each (including SlotSyncWorker). In such a case
promotion fails. And if slot sync worker is started again, it will
find pid as InvalidPid. So we are good.

thanks
Shveta




Re: Network failure may prevent promotion

2024-01-18 Thread Michael Paquier
On Thu, Jan 18, 2024 at 03:42:28PM +0200, Heikki Linnakangas wrote:
> Given that commit 728f86fec6 that introduced this issue was not strictly
> required, perhaps we should just revert it for v16.

Is there a point in keeping 728f86fec6 as well on HEAD?  That does not
strike me as wise to keep that in the tree for now.  If it needs to be
reworked, looking at this problem from scratch would be a safer
approach.
--
Michael


signature.asc
Description: PGP signature


Re: UUID v7

2024-01-18 Thread David G. Johnston
On Thu, Jan 18, 2024 at 11:31 AM Andrey Borodin 
wrote:

>
> Now I'm completely lost in time... I've set local time to NY (UTC-5).
>
> postgres=# select TIMESTAMP WITH TIME ZONE '2022-02-22 14:22:22-05' -
> TIMESTAMP WITH TIME ZONE 'Tuesday, February 22, 2022 2:22:22.00 PM
> GMT-05:00';
>  ?column?
> --
>  10:00:00
> (1 row)
>
>
You are mixing POSIX and ISO-8601 conventions and, as noted in our
appendix, they disagree on the direction that is positive.

https://www.postgresql.org/docs/current/datetime-posix-timezone-specs.html

The offset fields specify the hours, and optionally minutes and seconds,
difference from UTC. They have the format hh[:mm[:ss]] optionally with a
leading sign (+ or -). The positive sign is used for zones west of
Greenwich. (Note that this is the opposite of the ISO-8601 sign convention
used elsewhere in PostgreSQL.)

David J.


Re: remaining sql/json patches

2024-01-18 Thread Amit Langote
On Fri, Jan 19, 2024 at 2:11 AM Alvaro Herrera  wrote:
> On 2024-Jan-18, Alvaro Herrera wrote:
>
> > commands/explain.c (Hmm, I think this is a preexisting bug actually)
> >
> > 3893  18 : case T_TableFuncScan:
> > 3894  18 : Assert(rte->rtekind == RTE_TABLEFUNC);
> > 3895  18 : if (rte->tablefunc)
> > 3896   0 : if (rte->tablefunc->functype == 
> > TFT_XMLTABLE)
> > 3897   0 : objectname = "xmltable";
> > 3898 : else/* Must be 
> > TFT_JSON_TABLE */
> > 3899   0 : objectname = "json_table";
> > 3900 : else
> > 3901  18 : objectname = NULL;
> > 3902  18 : objecttag = "Table Function Name";
> > 3903  18 : break;
>
> Indeed -- the problem seems to be that add_rte_to_flat_rtable is
> creating a new RTE and zaps the ->tablefunc pointer for it.  So when
> EXPLAIN goes to examine the struct, there's a NULL pointer there and
> nothing is printed.

Ah yes.

> One simple fix is to change add_rte_to_flat_rtable so that it doesn't
> zero out the tablefunc pointer, but this is straight against what that
> function is trying to do, namely to remove substructure.

Yes.

>  Which means
> that we need to preserve the name somewhere else.  I added a new member
> to RangeTblEntry for this, which perhaps is a little ugly.  So here's
> the patch for that.
>
>  (I also added an alias to one XMLTABLE invocation
> under EXPLAIN, to show what it looks like when an alias is specified.
> Otherwise they're always shown as "XMLTABLE" "xmltable" which is a bit
> dumb).

Thanks for the patch.  Seems alright to me.

> Another possible way out is to decide that we don't want the
> "objectname" to be reported here.  I admit it's perhaps redundant.  In
> this case we'd just remove lines 3896-3899 shown above and let it be
> NULL.

Showing the function's name spelled out in the query (XMLTABLE /
JSON_TABLE) seems fine to me, even though maybe a bit redundant, yes.

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




Re: Strange Bitmapset manipulation in DiscreteKnapsack()

2024-01-18 Thread Andy Fan


David Rowley  writes:

> On Fri, 19 Jan 2024 at 01:07, Andy Fan  wrote:
>> I find the following code in DiscreteKnapsack is weird.
>>
>>
>> for (i = 0; i <= max_weight; ++i)
>> {
>> values[i] = 0;
>>
>> ** memory allocation here, and the num_items bit is removed later **
>>
>> sets[i] = bms_make_singleton(num_items);
>> }
>>
>>
>> ** num_items bit is removed here **
>> result = bms_del_member(bms_copy(sets[max_weight]), num_items);
>
> It does not seem weird to me.  If the set is going to have multiple
> words then adding a member 1 higher than the highest we'll ever add
> ensures the set has enough words and we don't need to repalloc to grow
> the set when we bms_add_member().

Hmm, I missed this part, thanks for the explaination. If bitset feature
can get in someday, the future user case like this can use bitset
directly to avoid this trick method. 

-- 
Best Regards
Andy Fan





Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Andy Fan


Hi,

Here is the summary of the open-items, it would be great that Andres and
Matthias have a look at this when they have time.

1. Shall we treat the LockBufHdr as a SpinLock usage.

>> The LockBufHdr also used init_local_spin_delay / perform_spin_delay
>> infrastruce and then it has the same issue like ${subject}, it is pretty
>> like the code in s_lock; Based on my current knowledge, I think we
>> should add the check there.
>
> I'd like to hear from Andres, if possible. @Andres: Should these
> sanity checks apply only to spin locks per se, or also to buffer
> header locks?
>

2. Where shall we put the START/END_SPIN_LOCK() into? Personally I'd
like spin.h. One of the reasons is 'misc' usually makes me think they
are something not well categoried, and hence many different stuffs are
put together. 

>> they were put into spin.h in v1 but later move to miscadmin.h at [1].
>> [1]
>> https://www.postgresql.org/message-id/CAEze2WggP-2Dhocmdhp-LxBzic%3DMXRgGA_tmv1G_9n-PDt2MQg%40mail.gmail.com
>
> I'm not entirely sure what the right thing to do is here, and the
> answer may depend on the previous question. But I disagree with
> Matthias -- I don't think miscadmin.h can be the right answer
> regardless.
>

-- 
Best Regards
Andy Fan





Re: UUID v7

2024-01-18 Thread Lukas Fittl
On Thu, Jan 18, 2024 at 5:18 AM Andrey Borodin  wrote:

> Current patch version attached. I've addressed all other requests:
> function renames, aliases, multiple functions instead of optional params,
> cleaner catalog definitions, not throwing error when [var,ver,time] value
> is unknown.
> What is left: deal with timezones, improve documentation.
>

I've done a test of the v10 patch, and ran into an interesting behavior
when passing in a timestamp to the function (which, as a side note, is
actually very useful to have as a feature, to support creating time-based
range partitions on UUIDv7 fields):

postgres=# SELECT uuid_extract_time(uuidv7());
 uuid_extract_time
---
 2024-01-18 18:49:00.01-08
(1 row)

postgres=# SELECT uuid_extract_time(uuidv7('2024-04-01'));
   uuid_extract_time

 2024-04-01 00:00:00-07
(1 row)

postgres=# SELECT uuid_extract_time(uuidv7());
   uuid_extract_time

 2024-04-01 00:00:00-07
(1 row)

Note how calling the uuidv7 function again after having called it with a
fixed future timestamp, returns the future timestamp, even though it should
return the current time.

I believe this is caused by incorrectly re-using the cached
previous_timestamp. In the second call here (with a fixed future
timestamp), we end up setting ts and tms to 2024-04-01, with
increment_counter = false, which leads us to set previous_timestamp to the
passed in timestamp (else branch of the second if in uuidv7). When we then
call the function again without an argument, we end up getting a new
timestamp from gettimeofday, but because we try to detect backwards leaps,
we set increment_counter to true, and thus end up reusing the previous
(future) timestamp here:

/* protection from leap backward */
tms = previous_timestamp;

Not sure how to fix this, but clearly something is amiss here.

Thanks,
Lukas

-- 
Lukas Fittl


Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Melanie Plageman
On Thu, Jan 18, 2024 at 3:20 PM Robert Haas  wrote:
>
> On Thu, Jan 18, 2024 at 10:09 AM Robert Haas  wrote:
> > Oh, I see. Good catch.
> >
> > I've now committed 0001.
>
> I have now also committed 0002 and 0003. I made some modifications to
> 0003. Specifically:
>
> - I moved has_lpdead_items inside the loop over blocks instead of
> putting it at the function toplevel.
> - I adjusted the comments in lazy_scan_prune() because it seemed to me
> that the comment about "Now scan the page..." had gotten too far
> separated from the loop where that happens.
> - I combined two lines in an if-test because one of them was kinda short.
>
> Hope that's OK with you.

Awesome, thanks!

I have attached a rebased version of the former 0004 as v11-0001.

- Melanie
From e8028cce937a2f2b3e5fccb43c6a20868a3d3314 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 16 Jan 2024 17:28:07 -0500
Subject: [PATCH v11] Combine FSM updates for prune and no prune cases

lazy_scan_prune() and lazy_scan_noprune() update the freespace map with
identical conditions -- both with the goal of doing so once for each
page vacuumed. Combine both of their FSM updates. This consolidation is
easier now that cb970240f13df2 moved visibility map updates into
lazy_scan_prune().

While combining the FSM updates, simplify the logic for calling
lazy_scan_new_or_empty() and lazy_scan_noprune().

Discussion: https://postgr.es/m/CA%2BTgmoaLTvipm%3Dxx4rJLr07m908PCu%3DQH3uCjD1UOn8YaEuO2g%40mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 89 +++-
 1 file changed, 33 insertions(+), 56 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0fb3953513..f9ce555fde 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -838,6 +838,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		Page		page;
 		bool		all_visible_according_to_vm;
 		bool		has_lpdead_items;
+		bool		got_cleanup_lock = false;
 
 		if (blkno == next_unskippable_block)
 		{
@@ -940,54 +941,28 @@ lazy_scan_heap(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
  vacrel->bstrategy);
 		page = BufferGetPage(buf);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
-			/* Check for new or empty pages before lazy_scan_noprune call */
-			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
-	   vmbuffer))
-			{
-/* Processed as new/empty page (lock and pin released) */
-continue;
-			}
+		got_cleanup_lock = ConditionalLockBufferForCleanup(buf);
 
-			/*
-			 * Collect LP_DEAD items in dead_items array, count tuples,
-			 * determine if rel truncation is safe
-			 */
-			if (lazy_scan_noprune(vacrel, buf, blkno, page, &has_lpdead_items))
-			{
-Size		freespace = 0;
-bool		recordfreespace;
+		if (!got_cleanup_lock)
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
-/*
- * We processed the page successfully (without a cleanup
- * lock).
- *
- * Update the FSM, just as we would in the case where
- * lazy_scan_prune() is called. Our goal is to update the
- * freespace map the last time we touch the page. If the
- * relation has no indexes, or if index vacuuming is disabled,
- * there will be no second heap pass; if this particular page
- * has no dead items, the second heap pass will not touch this
- * page. So, in those cases, update the FSM now.
- *
- * After a call to lazy_scan_prune(), we would also try to
- * adjust the page-level all-visible bit and the visibility
- * map, but we skip that step in this path.
- */
-recordfreespace = vacrel->nindexes == 0
-	|| !vacrel->do_index_vacuuming
-	|| !has_lpdead_items;
-if (recordfreespace)
-	freespace = PageGetHeapFreeSpace(page);
-UnlockReleaseBuffer(buf);
-if (recordfreespace)
-	RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-continue;
-			}
+		/* Check for new or empty pages before lazy_scan_[no]prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, !got_cleanup_lock,
+   vmbuffer))
+		{
+			/* Processed as new/empty page (lock and pin released) */
+			continue;
+		}
 
+		/*
+		 * Collect LP_DEAD items in dead_items array, count tuples, determine
+		 * if rel truncation is safe. If lazy_scan_noprune() returns false, we
+		 * must get a cleanup lock and call lazy_scan_prune().
+		 */
+		if (!got_cleanup_lock &&
+			!lazy_scan_noprune(vacrel, buf, blkno, page, &has_lpdead_items))
+		{
 			/*
 			 * lazy_scan_noprune could not do all required processing.  Wait
 			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
@@ -995,13 +970,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
-		}
-
-		/* Check for new or empty pages before lazy_scan_prune call */
-		if (lazy_scan_new_or_empty(vacrel, 

Re: subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo

2024-01-18 Thread Peter Smith
On Thu, Jan 18, 2024 at 8:54 PM Amit Kapila  wrote:
>
> On Thu, Jan 18, 2024 at 11:15 AM Peter Smith  wrote:
> >
> > On Thu, Jan 18, 2024 at 12:55 PM Masahiko Sawada  
> > wrote:
> > >
> > ...
> > >
> > > Although we can improve it to handle this case too, I'm not sure it's
> > > a bug. The doc says[1]:
> > >
> > > Specifies whether the subscription should be automatically disabled if
> > > any errors are detected by subscription workers during data
> > > replication from the publisher.
> > >
> > > When an apply worker is trying to establish a connection, it's not
> > > replicating data from the publisher.
> > >
> > > Regards,
> > >
> > > [1] 
> > > https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR
> > >
> > > --
> > > Masahiko Sawada
> > > Amazon Web Services: https://aws.amazon.com
> >
> > Yeah, I had also seen that wording of those docs. And I agree it
> > leaves open some room for doubts because strictly from that wording it
> > can be interpreted that establishing the connection is not actually
> > "data replication from the publisher" in which case maybe there is no
> > bug.
> >
>
> As far as I remember that was the intention. The idea was if there is
> any conflict during apply that users manually need to fix, they have
> the provision to stop repeating the error. If we wish to extend the
> purpose of this option for another valid use case and there is a good
> way to achieve the same then we can discuss but I don't think we need
> to change it in back-branches.
>
> --

In case we want to proceed with this, here is a simple POC patch that
seems to do the job.

~~~

RESULT:

test_sub=# create subscription sub1 connection 'dbname=test_pub'
publication pub1;
NOTICE:  created replication slot "sub1" on publisher
CREATE SUBSCRIPTION
2024-01-19 11:50:33.385 AEDT [17905] LOG:  logical replication apply
worker for subscription "sub1" has started
2024-01-19 11:50:33.398 AEDT [17907] LOG:  logical replication table
synchronization worker for subscription "sub1", table "t1" has started
2024-01-19 11:50:33.481 AEDT [17907] LOG:  logical replication table
synchronization worker for subscription "sub1", table "t1" has
finished

test_sub=# alter subscription sub1 set (disable_on_error);
ALTER SUBSCRIPTION

test_sub=# alter subscription sub1 connection 'port=-1';
2024-01-19 11:51:00.696 AEDT [17905] LOG:  logical replication worker
for subscription "sub1" will restart because of a parameter change
ALTER SUBSCRIPTION
2024-01-19 11:51:00.704 AEDT [18649] LOG:  logical replication apply
worker for subscription "sub1" has started
2024-01-19 11:51:00.704 AEDT [18649] ERROR:  could not connect to the
publisher: invalid port number: "-1"
2024-01-19 11:51:00.705 AEDT [18649] LOG:  subscription "sub1" has
been disabled because of an error

==
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-POC-disable-subscription-for-bad-connections.patch
Description: Binary data


Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]

2024-01-18 Thread Kirk Wolak
On Tue, Jan 16, 2024 at 3:43 AM Daniel Gustafsson  wrote:

> > On 16 Jan 2024, at 02:53, Kirk Wolak  wrote:
> >
> > On Mon, Jan 15, 2024 at 9:03 AM Daniel Gustafsson  > wrote:
> > > On 15 Jan 2024, at 07:24, Kirk Wolak  wol...@gmail.com>> wrote:
> >...
> > Okay, I took the latest source off of git (17devel) and got it to work
> there in a VM.
> >
> > It appears this issue is fixed.  It must have been related to the issue
> originally tagged.
>
> Thanks for testing and confirming!  Testing pre-release builds on real life
> workloads is invaluable for the development of Postgres so thank you
> taking the
> time.

Daniel,
  I did a little more checking and the reason I did not see the link MIGHT
be because EXPLAIN did not show a JIT attempt.
I tried to use settings that FORCE a JIT...  But to no avail.

  I am now concerned that the problem is more hidden in my use case.
Meaning I CANNOT conclude it is fixed.
But I know of NO WAY to force a JIT (I lowered costs to 1, etc.  ).

  You don't know a way to force at least the JIT analysis to happen?
(because I already knew if JIT was off, the leak wouldn't happen).

Thanks,

Kirk Out!
PS: I assume there is no pg_jit(1) function I can call. LOL


Re: Sequence Access Methods, round two

2024-01-18 Thread Michael Paquier
On Thu, Jan 18, 2024 at 04:54:06PM +0100, Matthias van de Meent wrote:
> On Thu, 18 Jan 2024, 16:06 Peter Eisentraut,  wrote:
>> On 01.12.23 06:00, Michael Paquier wrote:
>>> Please find attached a patch set that aims at implementing sequence
>>> access methods, with callbacks following a model close to table and
>>> index AMs, with a few cases in mind:
>>> - Global sequences (including range-allocation, local caching).
>>> - Local custom computations (a-la-snowflake).
>>
>> That's a lot of code, but the use cases are summarized in two lines?!?
>>
>> I would like to see a lot more elaboration what these uses cases are (I
>> recognize the words, but do we have the same interpretation of them?)
>> and how they would be addressed by what you are proposing, and better
>> yet an actual implementation of something useful, rather than just a
>> dummy test module.
> 
> At $prevjob we had a use case for PRNG to generate small,
> non-sequential "random" numbers without the birthday problem occurring
> in sqrt(option space) because that'd increase the printed length of
> the numbers beyond a set limit. The sequence API proposed here
> would've been a great alternative to the solution we found, as it
> would allow a sequence to be backed by an Linear Congruential
> Generator directly, rather than the implementation of our own
> transactional random_sequence table.

Interesting.

Yes, one of the advantages of this API layer is that all the
computation is hidden behind a sequence object at the PostgreSQL
level, hence applications just need to set a GUC to select a given
computation method *while* still using the same DDLs from their
application, or just append USING to their CREATE SEQUENCE but I've
heard that applications would just do the former and forget about it.

The reason why this stuff has bumped into my desk is that we have no
good solution in-core for globally-distributed transactions for
active-active deployments.  First, anything we have needs to be
plugged into default expressions of attributes like with [1] or [2],
or a tweak is to use sequence values that are computed with different
increments to avoid value overlaps across nodes.  Both of these
require application changes, which is meh for a bunch of users.  The
second approach with integer-based values can be become particularly a
pain if one has to fix value conflicts across nodes as they'd usually
require extra tweaks with the sequence definitions, especially if it
blocks applications in the middle of the night.  Sequence AMs offer
more control on that.  For example, snowflake IDs can rely on a GUC to
set a specific machine ID to force some of the bits of a 64-bit
integer to be the same for a single node in an active-active
deployment, ensuring that any value computed across *all* the nodes of
a cluster are always unique, while being maintained behind a sequence
object in-core.  (I can post a module to demonstrate that based on the
sequence AM APIs, just wait a min..  Having more than a test module
and/or a contrib is a separate discussion.)

By the way, patches 0001 to 0004 are just refactoring pieces.
Particularly, 0001 redesigns pg_sequence_last_value() to work across
the board for upgrades and dumps, while avoiding a scan of the
sequence "heap" relation in pg_dump.  These are improvements for the
core code in any case.

[1]: https://github.com/pgEdge/snowflake
[2]: 
https://www.postgresql.org/message-id/TY3PR01MB988983D23E4F1DA10567BC5BF5B9A%40TY3PR01MB9889.jpnprd01.prod.outlook.com
--
Michael


signature.asc
Description: PGP signature


Re: Synchronizing slots from primary to standby

2024-01-18 Thread Peter Smith
On Tue, Jan 9, 2024 at 11:15 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, January 9, 2024 9:17 AM Peter Smith  wrote:
> >
...
> >
> > 2. ALTER_REPLICATION_SLOT ... FAILOVER
> >
> > +  
> > +   
> > +FAILOVER [  > class="parameter">boolean ]
> > +
> > + 
> > +  If true, the slot is enabled to be synced to the physical
> > +  standbys so that logical replication can be resumed after 
> > failover.
> > + 
> > +
> > +   
> > +  
> >
> > This syntax says passing the boolean value is optional. So it needs to be
> > specified here in the docs that not passing a value would be the same as
> > passing the value true.
>
> The behavior that "not passing a value would be the same as passing the value
> true " is due to the rule of defGetBoolean(). And all the options of commands
> in this document behave the same in this case, therefore I think we'd better
> add document for it in a general place in a separate patch/thread instead of
> mentioning this in each option's paragraph.
>

Hi Hou-san,

I did as suggested and posted a patch for this in another thread [1].
Please see if it is OK.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPtDWSmW8uiRJF1LfGQJikmo7V2jdysLuRmtsanNZc7fNw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Add PQsendSyncMessage() to libpq

2024-01-18 Thread Anton Kirilov

Hello,

On 17/01/2024 07:30, Michael Paquier wrote:

Thanks for double-checking.  Done.

Thank you very much for taking care of my patch!

One thing that I noticed is that the TODO list on the PostgreSQL Wiki 
still contained an entry ( https://wiki.postgresql.org/wiki/Todo#libpq ) 
about adding pipelining support to libpq - perhaps it ought to be updated?


Best wishes,
Anton Kirilov




Re: minor replication slot docs edits

2024-01-18 Thread Peter Smith
On Thu, Jan 18, 2024 at 9:49 PM Alvaro Herrera  wrote:
...
>
> Another thing I noticed is that we could change all (or most of) the
>  tags to , but it's also a much larger
> change.  Having (some of?) these variable names be links would be useful
> IMO.
>

+1 to do this.

IMO these should all be coded like XXX, because the resulting
rendering looks much better with the GUC name using a varname font
instead of just plain text that  gives.

I am happy to take on the task if nobody else wants to.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: 039_end_of_wal: error in "xl_tot_len zero" test

2024-01-18 Thread Thomas Munro
On Fri, Jan 19, 2024 at 1:47 AM Anton Voloshin
 wrote:
> I believe there is a small problem in the 039_end_of_wal.pl's
> "xl_tot_len zero" test. It supposes that after immediate shutdown the
> server, upon startup recovery, should always produce a message matching
> "invalid record length at .*: wanted 24, got 0". However, if the
> circumstances are just right and we happened to hit exactly on the edge
> of WAL page, then the message on startup recovery would be "invalid
> magic number  in log segment .*, offset .*". The test does not take
> that into account.

Hi Anton,

Thanks for figuring this out!  Right, I see.  I will look more closely
when I'm back from summer vacation in a few days, but first reaction:

> Now, reproducing this is somewhat tricky, because exact position in WAL
> at the test time depends on what exactly initdb did, and that not only
> differs in different major verisons, but also depends on e.g. username
> length, locales available, and, perhaps, more. Even though originally
> this problem was found "in the wild" on one particular system on one
> particular code branch, I've written small helper patch to make
> reproduction on master easier, see
> 0001-repro-for-039_end_of_wal-s-problem-with-page-end.patch.
>
> This patch adds single emit_message of (hopefully) the right size to
> make sure we hit end of WAL block right by the time we call
> $node->stop('immediate') in "xl_tot_len zero" test. With this patch,
> "xl_tot_len zero" test fails every time because the server writes
> "invalid magic number  in log segment" while the test still only
> expects "invalid record length at .*: wanted 24, got 0". If course, this
> 0001 patch is *not* meant to be committed, but only as an issue
> reproduction helper.
>
> I can think of two possible fixes:
>
> 1. Update advance_out_of_record_splitting_zone to also avoid stopping at
> exactly the block end:
>
>   my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
> -while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold)
> +while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold ||
> $page_offset <= $SizeOfXLogShortPHD)
>   {
> see 0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patch
>
> We need to compare with $SizeOfXLogShortPHD (and not with zero) because
> at that point, even though we didn't actually write out new WAL page
> yet, it's header is already in place in memory and taken in account
> for LSN reporting.

I like the fact that this preserves the same end-of-WAL case that
we're trying to test.  I don't yet have an opinion on the best way to
do it though.  Would it be enough to add emit_message($node, 0) after
advance_out_of_record_splitting_zone()?  The thing about this one
specific test that is different from the later ones is that it doesn't
actually write a record header at all, it was relying purely on
pre-existing trailing zeroes, but it assumed the page header would be
valid.  As you figured out, that isn't true if we were right on the
page boundary.  Perhaps advance_out_of_record_splitting_zone()
followed by emit_message(0) would make that always true, even then?

> 2. Alternatively, amend "xl_tot_len zero" test to expect "invalid magic
> number  in WAL segment" message as well:
>
>   $node->start;
>   ok( $node->log_contains(
> +"invalid magic number  in WAL segment|" .
>   "invalid record length at .*: expected at least 24, got 0",
> $log_size
>   ),
> see 0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patch

Tolerating the two different messages would weaken the test.

> I think it makes sense to backport whatever the final change would be to
> all branches with 039_end_of_wal (REL_12+).

+1




Re: Built-in CTYPE provider

2024-01-18 Thread Jeff Davis
On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote:
> I think that would be a terrible direction to take, because it would 
> regress the default sort order from "correct" to "useless".

I don't agree that the current default is "correct". There are a lot of
ways it can be wrong:

  * the environment variables at initdb time don't reflect what the
users of the database actually want
  * there are so many different users using so many different
applications connected to the database that no one "correct" sort order
exists
  * libc has some implementation quirks
  * the version of Unicode that libc is based on is not what you expect
  * the version of libc is not what you expect

>   Aside from 
> the overall message this sends about how PostgreSQL cares about
> locales 
> and Unicode and such.

Unicode is primarily about the semantics of characters and their
relationships. The patches I propose here do a great job of that.

Collation (relationships between *strings*) is a part of Unicode, but
not the whole thing or even the main thing.

> Maybe you don't intend for this to be the default provider?

I am not proposing that this provider be the initdb-time default.

>   But then
> who would really use it? I mean, sure, some people would, but how
> would 
> you even explain, in practice, the particular niche of users or use
> cases?

It's for users who want to respect Unicode support text from
international sources in their database; but are not experts on the
subject and don't know precisely what they want or understand the
consequences. If and when such users do notice a problem with the sort
order, they'd handle it at that time (perhaps with a COLLATE clause, or
sorting in the application).

> Maybe if this new provider would be called "minimal", it might
> describe 
> the purpose better.

"Builtin" communicates that it's available everywhere (not a
dependency), that specific behaviors can be documented and tested, and
that behavior doesn't change within a major version. I want to
communicate all of those things.

> I could see a use for this builtin provider if it also included the 
> default UCA collation (what COLLATE UNICODE does now).

I won't rule that out, but I'm not proposing that right now and my
proposal is already offering useful functionality.

> There would still be a risk with that approach, since it would 
> permanently marginalize ICU functionality

Yeah, ICU already does a good job offering the root collation. I don't
think the builtin provider needs to do so.

> I would be curious what your overall vision is here?

Vision:

* The builtin provider will offer Unicode character semantics, basic
collation, platform-independence, and high performance. It can be used
on its own or in combination with ICU via the COLLATE clause.

* ICU offers COLLATE UNICODE, locale tailoring, case-insensitive
matching, and customization with rules. It's the solution for
everything from "slightly more advanced" to "very advanced".

* libc would be for databases serving applications on the same machine
where a matching sort order is helpful, risks to indexes are
acceptable, and performance is not important.

>   Is switching the 
> default to ICU still your goal?  Or do you want the builtin provider
> to 
> be the default?

It's hard to answer this question while initdb chooses the database
default collation based on the environment. Neither ICU nor the builtin
provider can reasonably handle whatever those environment variables
might be set to.

Stepping back from the focus on what initdb does, we should be
providing the right encouragement in documentation and packaging to
guide users toward the right provider based their needs and the vision
outlined above.

Regards,
Jeff Davis





Re: Strange Bitmapset manipulation in DiscreteKnapsack()

2024-01-18 Thread David Rowley
On Thu, 18 Jan 2024 at 16:24, David Rowley  wrote:
> On Thu, 18 Jan 2024 at 15:22, Richard Guo  wrote:
> > Do you think we can use 'memcpy(a, b, BITMAPSET_SIZE(b->nwords))'
> > directly in the new bms_replace_members() instead of copying the
> > bitmapwords one by one, like:
> >
> > -   i = 0;
> > -   do
> > -   {
> > -   a->words[i] = b->words[i];
> > -   } while (++i < b->nwords);
> > -
> > -   a->nwords = b->nwords;
> > +   memcpy(a, b, BITMAPSET_SIZE(b->nwords));
> >
> > But I'm not sure if this is an improvement or not.
>
> I considered this earlier but felt it was going against the method
> used in other places in the file. However, on relooking I do see
> bms_copy() does a memcpy().

I feel it's not worth debating the memcpy thing any further, so I've
pushed the v2 patch.

Thanks for reviewing.

David




Re: ALTER ROLE documentation improvement

2024-01-18 Thread David G. Johnston
On Sun, Jan 14, 2024 at 6:59 PM Nathan Bossart 
wrote:

> On Sun, Jan 14, 2024 at 04:17:41PM +0530, vignesh C wrote:
> > The attached v3 version patch has the changes for the same.
>
> LGTM.  I'll wait a little while longer for additional feedback, but if none
> materializes, I'll commit this soon.
>
>
LGTM too.  I didn't go looking for anything else related to this, but the
proposed changes all look needed.

David J.


Re: Strange Bitmapset manipulation in DiscreteKnapsack()

2024-01-18 Thread David Rowley
On Fri, 19 Jan 2024 at 01:07, Andy Fan  wrote:
> I find the following code in DiscreteKnapsack is weird.
>
>
> for (i = 0; i <= max_weight; ++i)
> {
> values[i] = 0;
>
> ** memory allocation here, and the num_items bit is removed later **
>
> sets[i] = bms_make_singleton(num_items);
> }
>
>
> ** num_items bit is removed here **
> result = bms_del_member(bms_copy(sets[max_weight]), num_items);

It does not seem weird to me.  If the set is going to have multiple
words then adding a member 1 higher than the highest we'll ever add
ensures the set has enough words and we don't need to repalloc to grow
the set when we bms_add_member().

David




Re: UUID v7

2024-01-18 Thread Przemysław Sztoch

We are not allowed to consider any time other than UTC.

You need to write to the authors of the standard. I suppose this is a 
mistake.


I know from experience that errors in such standards most often appear 
in examples.

Nobody detects them at first.
Everyone reads and checks ideas, not calculations.
Then developers during implementation tears out their hair.

Andrey Borodin wrote on 1/18/2024 4:39 PM:



On 18 Jan 2024, at 19:20, Aleksander Alekseev  wrote:

Timestamp and TimestampTz are absolutely the same thing.

My question is not about Postgres data types. I'm asking about examples in the 
standard.

There's an example 017F22E2-79B0-7CC3-98C4-DC0C0C07398F. It is expected to be generated 
on "Tuesday, February 22, 2022 2:22:22.00 PM GMT-05:00".
It's exaplained to be 16455577420ns after 1582-10-15 00:00:00 UTC.

But 16455577420ns after 1582-10-15 00:00:00 UTC  was  2022-02-22 19:22:22 
UTC. And that was 2022-02-23 00:22:22 in UTC-05.


Best regards, Andrey Borodin.


--
Przemysław Sztoch | Mobile +48 509 99 00 66


Re: UUID v7

2024-01-18 Thread Przemysław Sztoch

Aleksander Alekseev wrote on 1/18/2024 3:20 PM:

Hi,


Another question: how did you choose between using TimestampTz and
Timestamp types? I realize that internally it's all the same. Maybe
Timestamp will be slightly better since the way it is displayed
doesn't depend on the session settings. Many people I talked to find
this part of TimestampTz confusing.

timstamptz internally always store UTC.
I believe that in SQL, when operating with time in UTC, you should always use 
timestamptz.
timestamp is theoretically the same thing. But internally it does not convert 
time to UTC and will lead to incorrect use.

No.

Timestamp and TimestampTz are absolutely the same thing. The only
difference is how they are shown to the user. TimestampTz uses session
context in order to be displayed in the TZ chosen by the user. Thus
typically it is somewhat more confusing to the users and thus I asked
whether there was a good reason to choose TimestampTz over Timestamp.



Theoretically, you're right. But look at this example:

SET timezone TO 'Europe/Warsaw';
SELECT extract(epoch from '2024-01-18 9:27:30'::timestamp), 
extract(epoch from '2024-01-18 9:27:30'::timestamptz);


 date_part  | date_part
+
 1705570050 | 1705566450
(1 row)

In my opinion, timestamptz gives greater guarantees that the time 
internally is in UTC and the user gets the time in his/her time zone.


In the case of timestamp, it is never certain whether it keeps time in 
UTC or in the local zone.


In the case of argument's type, there would be no problem because we 
could create two functions.

Of course timestamp would be treated the same as timestamptz.
But here we have a problem with the function return type, which can only 
be one. And since the time returned is in UTC, it should be timestamptz.


--
Przemysław Sztoch | Mobile +48 509 99 00 66


Re: UUID v7

2024-01-18 Thread Przemysław Sztoch
Using localtime would be absurd. Especially since time goes back during 
summer time change.
I believe our implementation should use UTC. No one forbids us from 
assuming that our local time for generating uuid is UTC.


Andrey Borodin wrote on 1/18/2024 2:17 PM:



On 17 Jan 2024, at 02:19, Jelte Fennema-Nio  wrote:

I want to ask Kyzer or Brad, I hope they will see this message. I'm working on 
the patch for time extraction for v1, v6 and v7.

Do I understand correctly, that UUIDs contain local time, not UTC time? For examples in 
[0] I see that "A.6. Example of a UUIDv7 Value" I see that February 22, 2022 
2:22:22.00 PM GMT-05:00 results in unix_ts_ms = 0x017F22E279B0, which is not UTC, but 
local time.
Is it intentional? Section "5.1. UUID Version 1" states otherwise.

If so, I should swap signatures of functions from TimestampTz to Timestamp.
I'm hard-coding examples from this standard to tests, so I want to be precise...

If I follow the standard I see this in tests:
+-- extract UUID v1, v6 and v7 timestamp
+SELECT uuid_extract_time('C232AB00-9414-11EC-B3C8-9F6BDECED846') at time zone 
'GMT-05';
+ timezone
+--
+ Wed Feb 23 00:22:22 2022
+(1 row)

Current patch version attached. I've addressed all other requests: function 
renames, aliases, multiple functions instead of optional params, cleaner 
catalog definitions, not throwing error when [var,ver,time] value is unknown.
What is left: deal with timezones, improve documentation.


Best regards, Andrey Borodin.

[0] 
https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-rfc4122bis#name-example-of-a-uuidv1-value


--
Przemysław Sztoch | Mobile +48 509 99 00 66


Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 10:09 AM Robert Haas  wrote:
> Oh, I see. Good catch.
>
> I've now committed 0001.

I have now also committed 0002 and 0003. I made some modifications to
0003. Specifically:

- I moved has_lpdead_items inside the loop over blocks instead of
putting it at the function toplevel.
- I adjusted the comments in lazy_scan_prune() because it seemed to me
that the comment about "Now scan the page..." had gotten too far
separated from the loop where that happens.
- I combined two lines in an if-test because one of them was kinda short.

Hope that's OK with you.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Built-in CTYPE provider

2024-01-18 Thread Daniel Verite
Peter Eisentraut wrote:

> > If the Postgres default was bytewise sorting+locale-agnostic
> > ctype functions directly derived from Unicode data files,
> > as opposed to libc/$LANG at initdb time, the main
> > annoyance would be that "ORDER BY textcol" would no
> > longer be the human-favored sort.
> 
> I think that would be a terrible direction to take, because it would 
> regress the default sort order from "correct" to "useless".  Aside from 
> the overall message this sends about how PostgreSQL cares about
> locales and Unicode and such.

Well, offering a viable solution to avoid as much as possible
the dreaded:

"WARNING: collation "xyz" has version mismatch
... HINT: Rebuild all objects affected by this collation..."

that doesn't sound like a bad message to send. 

Currently, to have in codepoint order the indexes that don't need a
linguistic order, you're supposed to use collate "C", which then means
that upper(), lower() etc.. don't work beyond ASCII.
Here our Unicode support is not good enough, and the proposal
addresses that.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite




Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 1:30 PM Andy Fan  wrote:
> > You added an #include to dynahash.c despite making no other changes to
> > the file.
>
> That's mainly because I put the code into miscadmin.h and spin.h depends
> on miscadmin.h with MACROs.

That's not a good reason. Headers need to include headers on which
they depend; a .c file shouldn't be required to include one header
because it includes another.

> The LockBufHdr also used init_local_spin_delay / perform_spin_delay
> infrastruce and then it has the same issue like ${subject}, it is pretty
> like the code in s_lock; Based on my current knowledge, I think we
> should add the check there.

I'd like to hear from Andres, if possible. @Andres: Should these
sanity checks apply only to spin locks per se, or also to buffer
header locks?

> they were put into spin.h in v1 but later move to miscadmin.h at [1].
> [1]
> https://www.postgresql.org/message-id/CAEze2WggP-2Dhocmdhp-LxBzic%3DMXRgGA_tmv1G_9n-PDt2MQg%40mail.gmail.com

I'm not entirely sure what the right thing to do is here, and the
answer may depend on the previous question. But I disagree with
Matthias -- I don't think miscadmin.h can be the right answer
regardless.

> START_SPIN_LOCK need to be macro since it use __FILE__ and __LINE__ to
> note where the SpinLock is held. for others, just for consistent
> purpose. I think they can be changed to inline function, at least for
> VerifyNoSpinLocksHeld.

Good point about __FILE__ and __LINE__.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: UUID v7

2024-01-18 Thread Sergey Prokhorenko
Hi Andrey,

You'd better generate a test UUIDv7 for midnight 1 Jan 1970 UTC. In this case, 
the timestamp in UUIDv7 according to the new RFC must be filled with zeros. By 
extracting the timestamp from this test UUIDv7, you should get exactly midnight 
1 Jan 1970 UTC.
I also recommend this article: https://habr.com/ru/articles/772954/


Sergey Prokhorenko
sergeyprokhore...@yahoo.com.au 

On Thursday, 18 January 2024 at 09:31:16 pm GMT+3, Andrey Borodin 
 wrote:  
 
 

> On 18 Jan 2024, at 20:39, Andrey Borodin  wrote:
> 
> But 16455577420ns after 1582-10-15 00:00:00 UTC  was  2022-02-22 19:22:22 
> UTC. And that was 2022-02-23 00:22:22 in UTC-05.


'2022-02-22 19:22:22 UTC' is exactly that moment which was encoded into example 
UUIDs. It's not '2022-02-23 00:22:22 in UTC-05' as I thought.
I got confused by "at timezone" changes which in fact removes timezone 
information. And that's per SQL standard...

Now I'm completely lost in time... I've set local time to NY (UTC-5).

postgres=# select TIMESTAMP WITH TIME ZONE '2022-02-22 14:22:22-05' - TIMESTAMP 
WITH TIME ZONE 'Tuesday, February 22, 2022 2:22:22.00 PM GMT-05:00';
 ?column? 
--
 10:00:00
(1 row)

postgres=# select TIMESTAMP WITH TIME ZONE 'Tuesday, February 22, 2022 
2:22:22.00 PM GMT-05:00';
      timestamptz      

 2022-02-22 04:22:22-05
(1 row)

I cannot wrap my mind around it... Any pointers would be appreciated.
I'm certain that code extracted UTC time correctly, I just want a reliable test 
that verifies timestamp constant (+ I understand what is going on).


Best regards, Andrey Borodin.  

Re: UUID v7

2024-01-18 Thread Andrey Borodin



> On 18 Jan 2024, at 20:39, Andrey Borodin  wrote:
> 
> But 16455577420ns after 1582-10-15 00:00:00 UTC  was  2022-02-22 19:22:22 
> UTC. And that was 2022-02-23 00:22:22 in UTC-05.


'2022-02-22 19:22:22 UTC' is exactly that moment which was encoded into example 
UUIDs. It's not '2022-02-23 00:22:22 in UTC-05' as I thought.
I got confused by "at timezone" changes which in fact removes timezone 
information. And that's per SQL standard...

Now I'm completely lost in time... I've set local time to NY (UTC-5).

postgres=# select TIMESTAMP WITH TIME ZONE '2022-02-22 14:22:22-05' - TIMESTAMP 
WITH TIME ZONE 'Tuesday, February 22, 2022 2:22:22.00 PM GMT-05:00';
 ?column? 
--
 10:00:00
(1 row)

postgres=# select TIMESTAMP WITH TIME ZONE 'Tuesday, February 22, 2022 
2:22:22.00 PM GMT-05:00';
  timestamptz   

 2022-02-22 04:22:22-05
(1 row)

I cannot wrap my mind around it... Any pointers would be appreciated.
I'm certain that code extracted UTC time correctly, I just want a reliable test 
that verifies timestamp constant (+ I understand what is going on).


Best regards, Andrey Borodin.



Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Andy Fan


Hi Robert,

Thanks for your attention!

> On Thu, Jan 18, 2024 at 7:56 AM Andy Fan  wrote:
>> Do you still have interest with making some progress on this topic?
>
> Some, but it's definitely not my top priority. I wish I could give as
> much attention as everyone would like to everyone's patches, but I
> can't even come close.

Your point is fair enough.

After reading your comments, I decide to talk more before sending the
next version.

>
> I think that the stack that you set up in START_SPIN_LOCK() is crazy.
> That would only be necessary if it were legal to acquire multiple
> spinlocks at the same time, which it definitely isn't. Also, doing
> memory allocation and deallocation here appears highly undesirable -
> even if we did need to support multiple spinlocks, it would be better
> to handle this using something like the approach we already use for
> lwlocks, where there is a fixed size array and we blow up if it
> overflows.

I wanted to disallow to acquire multiple spinlocks at the same time in
the first version, but later I thought that is beyond of the scope of
this patch. Now I prefer to disallow that. if there is no objection in
the following days, I will do this in next version. After this, we don't
need malloc at all.

>
> ASSERT_NO_SPIN_LOCK() looks odd, because I would expect it to be
> spelled Assert(!AnySpinLockHeld()). But looking deeper, I see that it
> doesn't internally Assert() but rather does something else. Maybe the
> name needs some workshopping. SpinLockMustNotBeHeldHere()?
> VerifyNoSpinLocksHeld()?

Yes, it is not a Assert since I want to provide more information about
where the SpinLock was held. Assert doesn't have such capacity but
elog(PANIC, ...) can put more information before the PANIC.

VerifyNoSpinLocksHeld looks much more professional than
ASSERT_NO_SPIN_LOCK; I will use this in the next version.


> I think we should check that no spinlock is held in a few additional
> places: the start of SpinLockAcquire(), and the start of errstart().

Agreed.

> You added an #include to dynahash.c despite making no other changes to
> the file.

That's mainly because I put the code into miscadmin.h and spin.h depends
on miscadmin.h with MACROs.

>
> I don't know whether the choice to treat buffer header locks as
> spinlocks is correct. It seems like it at least deserves a comment,
> and possibly some discussion on this mailing list about whether that's
> the right idea. I'm not sure that we have all the same restrictions
> for buffer header locks as we do for spinlocks in general, but I'm
> also not sure that we don't.

The LockBufHdr also used init_local_spin_delay / perform_spin_delay
infrastruce and then it has the same issue like ${subject}, it is pretty
like the code in s_lock; Based on my current knowledge, I think we
should add the check there.

>
> On a related note, the patch overall has 0 comments. I don't know that
> it needs a lot, but 0 isn't many at all.

hmm, I tried to write a good commit message, but comments do need some
improvement, thanks for highlighting this!

>
> miscadmin.h doesn't seem like a good place for this. It's a
> widely-included header file and these checks should be needed in
> relatively few places; also, they're not really related to most of
> what's in that file, IIRC.

they were put into spin.h in v1 but later move to miscadmin.h at [1]. 

> I also wonder why we're using macros instead of static inline
> functions.

START_SPIN_LOCK need to be macro since it use __FILE__ and __LINE__ to
note where the SpinLock is held. for others, just for consistent
purpose. I think they can be changed to inline function, at least for
VerifyNoSpinLocksHeld. 

[1]
https://www.postgresql.org/message-id/CAEze2WggP-2Dhocmdhp-LxBzic%3DMXRgGA_tmv1G_9n-PDt2MQg%40mail.gmail.com

-- 
Best Regards
Andy Fan





Re: UUID v7

2024-01-18 Thread Sergey Prokhorenko
Hi Andrey,

Aleksander Alekseev wrote: "If this is the case, I think the example is indeed 
wrong". 

This is one of the reasons why I was categorically against any examples of 
implementation in the new RFC. The examples have been very poorly studied and 
discussed, and therefore it is better not to use them at all. But the text of 
the RFC itself clearly refers to UTC, and not at all about local time: "UUID 
version 7 features a time-ordered value field derived from the widely 
implemented and well known Unix Epoch timestamp source, the number of 
milliseconds since midnight 1 Jan 1970 UTC, leap seconds excluded". The main 
reason for using UTC is so that UUIDv7's, generated approximately 
simultaneously in different time zones, are correctly ordered in time when they 
get into one database.


Sergey Prokhorenko
sergeyprokhore...@yahoo.com.au 

On Thursday, 18 January 2024 at 07:22:05 pm GMT+3, Aleksander Alekseev 
 wrote:  
 
 Hi Andrey,

> > Timestamp and TimestampTz are absolutely the same thing.
> My question is not about Postgres data types. I'm asking about examples in 
> the standard.
>
> There's an example 017F22E2-79B0-7CC3-98C4-DC0C0C07398F. It is expected to be 
> generated on "Tuesday, February 22, 2022 2:22:22.00 PM GMT-05:00".
> It's exaplained to be 16455577420ns after 1582-10-15 00:00:00 UTC.
>
> But 16455577420ns after 1582-10-15 00:00:00 UTC  was  2022-02-22 19:22:22 
> UTC. And that was 2022-02-23 00:22:22 in UTC-05.

Not 100% sure which text you are referring to exactly, but I'm
guessing it's section B.2 of [1]

"""
This example UUIDv7 test vector utilizes a well-known 32 bit Unix
epoch with additional millisecond precision to fill the first 48 bits
[...]
The timestamp is Tuesday, February 22, 2022 2:22:22.00 PM GMT-05:00
represented as 0x17F22E279B0 or 1645557742000
"""

If this is the case, I think the example is indeed wrong:

```
=# select extract(epoch from 'Tuesday, February 22, 2022 2:22:22.00 PM
GMT-05:00' :: timestamptz)*1000;
      ?column?
--
 1645521742000.00
(1 row)
```

And the difference between the value in the text and the actual value
is 10 hours as you pointed out.

Also you named the date 1582-10-15 00:00:00 UTC. Maybe you actually
meant 1970-01-01 00:00:00 UTC?

[1]: 
https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-04.html

-- 
Best regards,
Aleksander Alekseev
  

BUG: Former primary node might stuck when started as a standby

2024-01-18 Thread Alexander Lakhin

Hello hackers,

[ reporting this bug here due to limitations of the bug reporting form ]

When a node, that acted as a primary, becomes a standby as in the
following script:
[ ... some WAL-logged activity ... ]
$primary->teardown_node;
$standby->promote;

$primary->enable_streaming($standby);
$primary->start;

it might not go online, due to the error:
new timeline N forked off current database system timeline M before current 
recovery point X/X

A complete TAP test is attached.
I put it in src/test/recovery/t, run as follows:
for i in `seq 100`; do echo "iteration $i"; timeout 60 make -s check -C src/test/recovery PROVE_TESTS="t/099*" || break; 
done

and get:
...
iteration 7
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. ok
All tests successful.
Files=1, Tests=20, 14 wallclock secs ( 0.01 usr  0.00 sys +  4.20 cusr  4.75 
csys =  8.96 CPU)
Result: PASS
iteration 8
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. 9/? make: *** [Makefile:23: check] Terminated

With wal_debug enabled (and log_min_messages=DEBUG2, log_statement=all),
I see the following in the _node1.log:
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG: INSERT @ 0/304DBF0:  - Transaction/COMMIT: 2024-01-18 
15:21:02.258739+00

2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO 
t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG:  xlog flush 
request 0/304DBF0; write 0/0; flush 0/0
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO 
t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DB78 apply 0/304DB78 reply_time 2024-01-18 
15:21:02.2588+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DBF0 apply 0/304DB78 reply_time 2024-01-18 
15:21:02.258809+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DBF0 apply 0/304DBF0 reply_time 2024-01-18 
15:21:02.258864+00

2024-01-18 15:21:02.259 UTC [663563] DEBUG:  server process (PID 663701) exited 
with exit code 0
2024-01-18 15:21:02.260 UTC [663563] DEBUG:  forked new backend, pid=663704 
socket=8
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: statement: INSERT INTO t VALUES (1000 * 1 + 608, 
'background activity');

2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 
0/304DC40:  - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');

2024-01-18 15:21:02.261 UTC [663563] DEBUG:  postmaster received shutdown 
request signal
2024-01-18 15:21:02.261 UTC [663563] LOG:  received immediate shutdown request
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC68:  - Transaction/COMMIT: 2024-01-18 
15:21:02.261828+00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');

2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG:  xlog flush 
request 0/304DC68; write 0/0; flush 0/0
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');

...
2024-01-18 15:21:02.262 UTC [663563] LOG:  database system is shut down
...
2024-01-18 15:21:02.474 UTC [663810] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit

...
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DBC8; LSN 0/304DBF0: prev 0/304DB78; xid 898; len 8 - 
Transaction/COMMIT: 2024-01-18 15:21:02.258739+00
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DBF0; LSN 0/304DC40: prev 0/304DBC8; xid 899; len 3; blkref #0: 
rel 1663/5/16384, blk 1 - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DC40; LSN 0/304DC68: prev 0/304DBF0; xid 899; len 8 - 
Transaction/COMMIT: 2024-01-18 15:21:02.261828+00

...
2024-01-18 15:21:02.481 UTC [663819] LOG:  fetching timeline history file for 
timeline 20 from primary server
2024-01-18 15:21:02.481 UTC [663819] LOG:  started streaming WAL from primary 
at 0/300 on timeline 19
...
2024-01-18 15:21:02.481 UTC [663819] DETAIL:  End of WAL reached on timeline 19 
at 0/304DBF0.
...
2024-01-18 15:21:02.481 UTC [663816] LOG:  new timeline 20 forked off current database system timeline 19 before current 
recovery point 0/304DC68


In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
only record 0/304DBF0, then node2, being promoted to primary, forked a next
timeline from it, but when node1 was started as a standby, it first
replayed 0/304DC68 from WAL, and then could not switch to the new timeline
starting from the previous position.

Reproduced on REL_12_STABLE .. master.

Best regards,
Alexander

099_change_roles.pl
Description: Perl program


Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 12:15 PM Peter Geoghegan  wrote:
> It's not okay if you fail to update the FSM a second time in the
> second heap pass -- at least in some cases. It's reasonably frequent
> for a page that has 0 usable free space when lazy_scan_prune returns
> to go on to have almost BLCKSZ free space once lazy_vacuum_heap_page()
> is done with it.

Oh, good point. That's an important subtlety I had missed.

> That's a part of the problem too, I guess.
>
> The actual available free space on each page is literally changing all
> the time, when measured at FSM_CATEGORIES-wise granularity -- which
> leads to a mad dash among backends that all need the same amount of
> free space for their new tuple. One reason why other systems pretty
> much require coarse-grained increments of free space is the need to
> manage the WAL overhead for a crash-safe FSM/free list structure.

Interesting.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Peter Geoghegan
On Thu, Jan 18, 2024 at 11:46 AM Robert Haas  wrote:
> On Thu, Jan 18, 2024 at 11:17 AM Peter Geoghegan  wrote:
> > True. But the way that PageGetHeapFreeSpace() returns 0 for a page
> > with 291 LP_DEAD stubs is a much older behavior. When that happens it
> > is literally true that the page has lots of free space. And yet it's
> > not free space we can actually use. Not until those LP_DEAD items are
> > marked LP_UNUSED.
>
> To me, this is just accurate reporting. What we care about in this
> context is the amount of free space on the page that can be used to
> store a new tuple. When there are no line pointers available to be
> allocated, that amount is 0.

I agree. All I'm saying is this (can't imagine you'll disagree):

It's not okay if you fail to update the FSM a second time in the
second heap pass -- at least in some cases. It's reasonably frequent
for a page that has 0 usable free space when lazy_scan_prune returns
to go on to have almost BLCKSZ free space once lazy_vacuum_heap_page()
is done with it.

While I am sympathetic to the argument that LP_DEAD item space just
isn't that important in general, that doesn't apply with this one
special case. This is a "step function" behavior, and is seen whenever
VACUUM runs following bulk deletes of tuples -- a rather common case.
Clearly the FSM shouldn't show that pages that are actually completely
empty at the end of VACUUM as having no available free space after a
VACUUM finishes (on account of how they looked immediately after
lazy_scan_prune ran). That'd just be wrong.

> > Another big source of inaccuracies here is that we don't credit
> > RECENTLY_DEAD tuple space with being free space. Maybe that isn't a
> > huge problem, but it makes it even harder to believe that precision in
> > FSM accounting is an intrinsic good.
>
> The difficulty here is that we don't know how long it will be before
> that space can be reused. Those recently dead tuples could become dead
> within a few milliseconds or stick around for hours. I've wondered
> about the merits of some FSM that had built-in visibility awareness,
> i.e. the capability to record something like "page X currently has Y
> space free and after XID Z is all-visible it will have Y' space free".
> That seems complex, but without it, we either have to bet that the
> space will actually become free before anyone tries to use it, or that
> it won't. If whatever guess we make is wrong, bad things happen.

All true -- it is rather complex.

Other systems with a heap table access method based on a foundation of
2PL (Oracle, DB2) literally need a transactionally consistent FSM
structure. In fact I believe that Oracle literally requires the
equivalent of an MVCC snapshot read (a "consistent get") to be able to
access what seems like it ought to be strictly a physical data
structure correctly. Everything needs to work in the rollback path,
independent of whatever else may happen to the page before an xact
rolls back (i.e. independently of what other xacts might end up doing
with the page). This requires very tight coordination to avoid bugs
where a transaction cannot roll back due to not having enough free
space to restore the original tuple during UNDO.

I don't think it's desirable to have anything as delicate as that
here. But some rudimentary understanding of free space being
allocated/leased to certain transactions and/or backends does seem
like a good idea. There is some intrinsic value to these sorts of
behaviors, even in a system without any UNDO segments, where it is
never strictly necessary.

> I think that the completely deterministic nature of the computation is
> a mistake regardless of anything else. That serves to focus contention
> rather than spreading it out, which is dumb, and would still be dumb
> with any other number of FSM_CATEGORIES.

That's a part of the problem too, I guess.

The actual available free space on each page is literally changing all
the time, when measured at FSM_CATEGORIES-wise granularity -- which
leads to a mad dash among backends that all need the same amount of
free space for their new tuple. One reason why other systems pretty
much require coarse-grained increments of free space is the need to
manage the WAL overhead for a crash-safe FSM/free list structure.

> Yeah. I'm not sure we're actually going to change that right now, but
> I agree with the high-level point regardless, which I would summarize
> like this: The current system provides more precision about available
> free space than we actually need, while failing to provide some other
> things that we really do need. We need not agree today on exactly what
> those other things are or how best to get them in order to agree that
> the current system has significant flaws, and we do agree that it
> does.

I agree with this.

-- 
Peter Geoghegan




Re: remaining sql/json patches

2024-01-18 Thread Alvaro Herrera
On 2024-Jan-18, Alvaro Herrera wrote:

> commands/explain.c (Hmm, I think this is a preexisting bug actually)
> 
> 3893  18 : case T_TableFuncScan:
> 3894  18 : Assert(rte->rtekind == RTE_TABLEFUNC);
> 3895  18 : if (rte->tablefunc)
> 3896   0 : if (rte->tablefunc->functype == 
> TFT_XMLTABLE)
> 3897   0 : objectname = "xmltable";
> 3898 : else/* Must be 
> TFT_JSON_TABLE */
> 3899   0 : objectname = "json_table";
> 3900 : else
> 3901  18 : objectname = NULL;
> 3902  18 : objecttag = "Table Function Name";
> 3903  18 : break;

Indeed -- the problem seems to be that add_rte_to_flat_rtable is
creating a new RTE and zaps the ->tablefunc pointer for it.  So when
EXPLAIN goes to examine the struct, there's a NULL pointer there and
nothing is printed.

One simple fix is to change add_rte_to_flat_rtable so that it doesn't
zero out the tablefunc pointer, but this is straight against what that
function is trying to do, namely to remove substructure.  Which means
that we need to preserve the name somewhere else.  I added a new member
to RangeTblEntry for this, which perhaps is a little ugly.  So here's
the patch for that.  (I also added an alias to one XMLTABLE invocation
under EXPLAIN, to show what it looks like when an alias is specified.
Otherwise they're always shown as "XMLTABLE" "xmltable" which is a bit
dumb).

Another possible way out is to decide that we don't want the
"objectname" to be reported here.  I admit it's perhaps redundant.  In
this case we'd just remove lines 3896-3899 shown above and let it be
NULL.

Thoughts?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
>From 00d2885ddd0475f787e0d64807f8eb2858c95ff0 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 18 Jan 2024 18:07:30 +0100
Subject: [PATCH] Show function name in TableFuncScan

Previously we were only showing the user-specified alias, but this is
clearly not the code's intent.
---
 src/backend/commands/explain.c  |  2 +-
 src/backend/nodes/outfuncs.c|  1 +
 src/backend/nodes/readfuncs.c   |  1 +
 src/backend/parser/parse_relation.c |  4 ++--
 src/include/nodes/parsenodes.h  |  1 +
 src/test/regress/expected/xml_1.out | 22 +++---
 src/test/regress/sql/xml.sql|  2 +-
 7 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 3d590a6b9f..4df715e344 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3892,7 +3892,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 			break;
 		case T_TableFuncScan:
 			Assert(rte->rtekind == RTE_TABLEFUNC);
-			objectname = "xmltable";
+			objectname = rte->tablefunc_name;
 			objecttag = "Table Function Name";
 			break;
 		case T_ValuesScan:
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 296ba84518..b42daaba53 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -531,6 +531,7 @@ _outRangeTblEntry(StringInfo str, const RangeTblEntry *node)
 			break;
 		case RTE_TABLEFUNC:
 			WRITE_NODE_FIELD(tablefunc);
+			WRITE_STRING_FIELD(tablefunc_name);
 			break;
 		case RTE_VALUES:
 			WRITE_NODE_FIELD(values_lists);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 1624b34581..925192cb07 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -528,6 +528,7 @@ _readRangeTblEntry(void)
 			break;
 		case RTE_TABLEFUNC:
 			READ_NODE_FIELD(tablefunc);
+			READ_STRING_FIELD(tablefunc_name);
 			/* The RTE must have a copy of the column type info, if any */
 			if (local_node->tablefunc)
 			{
diff --git a/src/backend/parser/parse_relation.c b/src/backend/parser/parse_relation.c
index 34a0ec5901..65e54abdd1 100644
--- a/src/backend/parser/parse_relation.c
+++ b/src/backend/parser/parse_relation.c
@@ -2073,17 +2073,17 @@ addRangeTableEntryForTableFunc(ParseState *pstate,
 	Assert(list_length(tf->coltypmods) == list_length(tf->colnames));
 	Assert(list_length(tf->colcollations) == list_length(tf->colnames));
 
-	refname = alias ? alias->aliasname : pstrdup("xmltable");
-
 	rte->rtekind = RTE_TABLEFUNC;
 	rte->relid = InvalidOid;
 	rte->subquery = NULL;
 	rte->tablefunc = tf;
+	rte->tablefunc_name = pstrdup("XMLTABLE");
 	rte->coltypes = tf->coltypes;
 	rte->coltypmods = tf->coltypmods;
 	rte->colcollations = tf->colcollations;
 	rte->alias = alias;
 
+	refname = alias ? alias->aliasname : pstrdup("xmltable");
 	eref = alias ? copyObject(alias) : makeAlias(refname, NIL);
 	numaliases = list_length(eref->colnames);
 
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index b3181

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 11:17 AM Peter Geoghegan  wrote:
> True. But the way that PageGetHeapFreeSpace() returns 0 for a page
> with 291 LP_DEAD stubs is a much older behavior. When that happens it
> is literally true that the page has lots of free space. And yet it's
> not free space we can actually use. Not until those LP_DEAD items are
> marked LP_UNUSED.

To me, this is just accurate reporting. What we care about in this
context is the amount of free space on the page that can be used to
store a new tuple. When there are no line pointers available to be
allocated, that amount is 0.

> Another big source of inaccuracies here is that we don't credit
> RECENTLY_DEAD tuple space with being free space. Maybe that isn't a
> huge problem, but it makes it even harder to believe that precision in
> FSM accounting is an intrinsic good.

The difficulty here is that we don't know how long it will be before
that space can be reused. Those recently dead tuples could become dead
within a few milliseconds or stick around for hours. I've wondered
about the merits of some FSM that had built-in visibility awareness,
i.e. the capability to record something like "page X currently has Y
space free and after XID Z is all-visible it will have Y' space free".
That seems complex, but without it, we either have to bet that the
space will actually become free before anyone tries to use it, or that
it won't. If whatever guess we make is wrong, bad things happen.

> My remarks about "FSM_CATEGORIES-wise precision" were basically
> remarks about the fundamental problem with the free space map. Which
> is really that it's just a map of free space, that gives exactly zero
> thought to various high level things that *obviously* matter. I wasn't
> particularly planning on getting into the specifics of that with you
> now, on this thread.

Fair.

> A brief recap might be useful: other systems with a heap table AM free
> space management structure typically represent the free space
> available on each page using a far more coarse grained counter.
> Usually one with less than 10 distinct increments. The immediate
> problem with FSM_CATEGORIES having such a fine granularity is that it
> increases contention/competition among backends that need to find some
> free space for a new tuple. They'll all diligently try to find the
> page with the least free space that still satisfies their immediate
> needs -- there is no thought for the second-order effects, which are
> really important in practice.

I think that the completely deterministic nature of the computation is
a mistake regardless of anything else. That serves to focus contention
rather than spreading it out, which is dumb, and would still be dumb
with any other number of FSM_CATEGORIES.

> What I really wanted to convey is this: if you're going to go the
> route of ignoring LP_DEAD free space during vacuuming, you're
> conceding that having a high degree of precision about available free
> space isn't actually useful (or wouldn't be useful if it was actually
> possible at all). Which is something that I generally agree with. I'd
> just like it to be clear that you/Melanie are in fact taking one small
> step in that direction. We don't need to discuss possible later steps
> beyond that first step. Not right now.

Yeah. I'm not sure we're actually going to change that right now, but
I agree with the high-level point regardless, which I would summarize
like this: The current system provides more precision about available
free space than we actually need, while failing to provide some other
things that we really do need. We need not agree today on exactly what
those other things are or how best to get them in order to agree that
the current system has significant flaws, and we do agree that it
does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

2024-01-18 Thread Matthias van de Meent
On Tue, 16 Jan 2024 at 03:03, Peter Geoghegan  wrote:
>
> On Mon, Jan 15, 2024 at 2:32 PM Matthias van de Meent
>  wrote:
> > Can you pull these planner changes into their own commit(s)?
> > As mentioned upthread, it's a significant change in behavior that
> > should have separate consideration and reference in the commit log. I
> > really don't think it should be buried in the 5th paragraph of an
> > "Enhance nbtree ScalarArrayOp execution" commit. Additionally, the
> > changes of btree are arguably independent of the planner changes, as
> > the btree changes improve performance even if we ignore that it
> > implements strict result ordering.
>
> I'm not going to break out the planner changes, because they're *not*
> independent in any way.

The planner changes depend on the btree changes, that I agree with.
However, I don't think that the btree changes depend on the planner
changes.

> You could say the same thing about practically
> any work that changes the planner. They're "buried" in the 5th
> paragraph of the commit message. If an interested party can't even
> read that far to gain some understanding of a legitimately complicated
> piece of work such as this, I'm okay with that.

I would agree with you if this was about new APIs and features, but
here existing APIs are being repurposed without changing them. A
maintainer of index AMs would not look at the commit title 'Enhance
nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
my existing amsearcharray+amcanorder handling actually supports
non-prefix arrays keys and return data in index order".
There are also no changes in amapi.h that would signal any index AM
author that expectations have changed. I really don't think you can
just ignore all that, and I believe this to also be the basis of
Heikki's first comment.

> As I said to Heikki, thinking about this some more is on my todo list.
> I mean the way that this worked had substantial revisions on HEAD
> right before I posted v9. v9 was only to fix the bit rot that that
> caused.

Then I'll be waiting for that TODO item to be resolved.

> > > +++ b/src/backend/access/nbtree/nbtutils.c
> > > +/*
> > > + * _bt_merge_arrays() -- merge together duplicate array keys
> > > + *
> > > + * Both scan keys have array elements that have already been sorted and
> > > + * deduplicated.
> > > + */
> >
> > As I mentioned upthread, I find this function to be very wasteful, as
> > it uses N binary searches to merge join two already sorted arrays,
> > resulting in a O(n log(m)) complexity, whereas a normal merge join
> > should be O(n + m) once the input datasets are sorted.
>
> And as I mentioned upthread, I think that you're making a mountain out
> of a molehill here. This is not a merge join.

We're merging two sorted sets and want to retain only the matching
entries, while retaining the order of the data. AFAIK the best
algorithm available for this is a sort-merge join.
Sure, it isn't a MergeJoin plan node, but that's not what I was trying to argue.

> Even single digit
> thousand of array elements should be considered huge. Plus this code
> path is only hit with poorly written queries.

Would you accept suggestions?

> > Please fix this, as it shows up in profiling of large array merges.
> > Additionally, as it merges two arrays of unique items into one,
> > storing only matching entries, I feel that it is quite wasteful to do
> > this additional allocation here. Why not reuse the original allocation
> > immediately?
>
> Because that's adding more complexity for a code path that will hardly
> ever be used in practice. For something that happens once, during a
> preprocessing step.

Or many times, when we're in a parameterized loop, as was also pointed
out by Heikki. While I do think it is rare, the existence of this path
that merges these arrays implies the need for merging these arrays,
which thus

> > > +_bt_tuple_before_array_skeys(IndexScanDesc scan, BTReadPageState *pstate,
> > > + IndexTuple tuple, int sktrig, bool 
> > > validtrig)
> >
> > I don't quite understand what the 'validtrig' argument is used for.
> > There is an assertion that it is false under some conditions in this
> > code, but it's not at all clear to me why that would have to be the
> > case - it is called with `true` in one of the three call sites. Could
> > the meaning of this be clarified?
>
> Sure, I'll add a comment.

Thanks!

> > I also feel that this patch includes several optimizations such as
> > this sktrig argument which aren't easy to understand. Could you pull
> > that into a separately reviewable patch?
>
> It probably makes sense to add the extra preprocessing stuff out into
> its own commit, since I tend to agree that that's an optimization that
> can be treated as unrelated (and isn't essential to the main thrust of
> the patch).
>
> However, the sktrig thing isn't really like that. We need to do things
> that way for required inequality scan keys. It doesn't make sense to
>

Re: Increasing IndexTupleData.t_info from uint16 to uint32

2024-01-18 Thread Matthias van de Meent
On Thu, 18 Jan 2024 at 13:41, Montana Low  wrote:
>
> The overall trend in machine learning embedding sizes has been growing 
> rapidly over the last few years from 128 up to 4K dimensions yielding 
> additional value and quality improvements. It's not clear when this trend in 
> growth will ease. The leading text embedding models generate now exceeds the 
> index storage available in IndexTupleData.t_info.
>
> The current index tuple size is stored in 13 bits of IndexTupleData.t_info, 
> which limits the max size of an index tuple to 2^13 = 8129 bytes. Vectors 
> implemented by pgvector currently use a 32 bit float for elements, which 
> limits vector size to 2K dimensions, which is no longer state of the art.
>
> I've attached a patch that increases  IndexTupleData.t_info from 16bits to 
> 32bits allowing for significantly larger index tuple sizes. I would guess 
> this patch is not a complete implementation that allows for migration from 
> previous versions, but it does compile and initdb succeeds. I'd be happy to 
> continue work if the core team is receptive to an update in this area, and 
> I'd appreciate any feedback the community has on the approach.

I'm not sure why this is needed.
Vector data indexing generally requires bespoke index methods, which
are not currently available in the core PostgreSQL repository, and
indexes are not at all required to utilize the IndexTupleData format
for their data tuples (one example of this being BRIN).
The only hard requirement in AMs which use Postgres' relfile format is
that they follow the Page layout and optionally the pd_linp/ItemId
array, which limit the size of Page tuples to 2^15-1 (see
ItemIdData.lp_len) and ~2^16-bytes
(PageHeaderData.pd_pagesize_version).

Next, the only non-internal use of IndexTuple is in IndexOnlyScans.
However, here the index may fill the scandesc->xs_hitup with a heap
tuple instead, which has a length stored in uint32, too. So, I don't
quite see why this would be required for all indexes.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: UUID v7

2024-01-18 Thread Aleksander Alekseev
Hi Andrey,

> > Timestamp and TimestampTz are absolutely the same thing.
> My question is not about Postgres data types. I'm asking about examples in 
> the standard.
>
> There's an example 017F22E2-79B0-7CC3-98C4-DC0C0C07398F. It is expected to be 
> generated on "Tuesday, February 22, 2022 2:22:22.00 PM GMT-05:00".
> It's exaplained to be 16455577420ns after 1582-10-15 00:00:00 UTC.
>
> But 16455577420ns after 1582-10-15 00:00:00 UTC  was  2022-02-22 19:22:22 
> UTC. And that was 2022-02-23 00:22:22 in UTC-05.

Not 100% sure which text you are referring to exactly, but I'm
guessing it's section B.2 of [1]

"""
This example UUIDv7 test vector utilizes a well-known 32 bit Unix
epoch with additional millisecond precision to fill the first 48 bits
[...]
The timestamp is Tuesday, February 22, 2022 2:22:22.00 PM GMT-05:00
represented as 0x17F22E279B0 or 1645557742000
"""

If this is the case, I think the example is indeed wrong:

```
=# select extract(epoch from 'Tuesday, February 22, 2022 2:22:22.00 PM
GMT-05:00' :: timestamptz)*1000;
   ?column?
--
 1645521742000.00
(1 row)
```

And the difference between the value in the text and the actual value
is 10 hours as you pointed out.

Also you named the date 1582-10-15 00:00:00 UTC. Maybe you actually
meant 1970-01-01 00:00:00 UTC?

[1]: 
https://www.ietf.org/archive/id/draft-peabody-dispatch-new-uuid-format-04.html

-- 
Best regards,
Aleksander Alekseev




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Peter Geoghegan
On Thu, Jan 18, 2024 at 10:43 AM Robert Haas  wrote:
> I think we're agreeing but I want to be sure. If we only set LP_DEAD
> items to LP_UNUSED, that frees no space. But if doing so allows us to
> truncate the line pointer array, that that frees a little bit of
> space. Right?

That's part of it, yes.

> One problem with using this as a justification for the status quo is
> that truncating the line pointer array is a relatively recent
> behavior. It's certainly much newer than the choice to have VACUUM
> touch the FSM in the second page than the first page.

True. But the way that PageGetHeapFreeSpace() returns 0 for a page
with 291 LP_DEAD stubs is a much older behavior. When that happens it
is literally true that the page has lots of free space. And yet it's
not free space we can actually use. Not until those LP_DEAD items are
marked LP_UNUSED.

> Another problem is that the amount of space that we're freeing up in
> the second pass is really quite minimal even when it's >0. Any tuple
> that actually contains any data at all is at least 32 bytes, and most
> of them are quite a bit larger. Item pointers are 2 bytes. To save
> enough space to fit even one additional tuple, we'd have to free *at
> least* 16 line pointers. That's going to be really rare.

I basically agree with this. I would still worry about the "291
LP_DEAD stubs makes PageGetHeapFreeSpace return 0" thing specifically,
though. It's sort of a special case.

> And even if it happens, is it even useful to advertise that free
> space? Do we want to cram one more tuple into a page that has a
> history of extremely heavy updates? Could it be that it's smarter to
> just forget about that free space?

I think so, yes.

Another big source of inaccuracies here is that we don't credit
RECENTLY_DEAD tuple space with being free space. Maybe that isn't a
huge problem, but it makes it even harder to believe that precision in
FSM accounting is an intrinsic good.

> > You'd likely prefer a simpler argument for doing this -- an argument
> > that doesn't require abandoning/discrediting the idea that a high
> > degree of FSM_CATEGORIES-wise precision is a valuable thing. Not sure
> > that that's possible -- the current design is at least correct on its
> > own terms. And what you propose to do will probably be less correct on
> > those same terms, silly though they are.
>
> I've never really understood why you think that the number of
> FSM_CATEGORIES is the problem. I believe I recall you endorsing a
> system where pages are open or closed, to try to achieve temporal
> locality of data.

My remarks about "FSM_CATEGORIES-wise precision" were basically
remarks about the fundamental problem with the free space map. Which
is really that it's just a map of free space, that gives exactly zero
thought to various high level things that *obviously* matter. I wasn't
particularly planning on getting into the specifics of that with you
now, on this thread.

A brief recap might be useful: other systems with a heap table AM free
space management structure typically represent the free space
available on each page using a far more coarse grained counter.
Usually one with less than 10 distinct increments. The immediate
problem with FSM_CATEGORIES having such a fine granularity is that it
increases contention/competition among backends that need to find some
free space for a new tuple. They'll all diligently try to find the
page with the least free space that still satisfies their immediate
needs -- there is no thought for the second-order effects, which are
really important in practice.

> But all of that is just an argument that reducing the number of
> FSM_CATEGORIES is *acceptable*; it doesn't amount to an argument that
> it's better. My current belief is that it isn't better, just a vehicle
> to do something else that maybe is better, like squeezing open/closed
> tracking or similar into the existing bit space. My understanding is
> that you think it would be better on its own terms, but I have not yet
> been able to grasp why that would be so.

I'm not really arguing that reducing FSM_CATEGORIES and changing
nothing else would be better on its own (it might be, but that's not
what I meant to convey).

What I really wanted to convey is this: if you're going to go the
route of ignoring LP_DEAD free space during vacuuming, you're
conceding that having a high degree of precision about available free
space isn't actually useful (or wouldn't be useful if it was actually
possible at all). Which is something that I generally agree with. I'd
just like it to be clear that you/Melanie are in fact taking one small
step in that direction. We don't need to discuss possible later steps
beyond that first step. Not right now.

-- 
Peter Geoghegan




Re: Increasing IndexTupleData.t_info from uint16 to uint32

2024-01-18 Thread Tom Lane
I wrote:
> On a micro level, this makes sizeof(IndexTupleData) be not maxaligned,
> which is likely to cause problems on alignment-picky hardware, or else
> result in space wastage if we were careful to MAXALIGN() everywhere.
> (Which we should have been, but I don't care to bet on it.)  A lot of
> people would be sad if their indexes got noticeably bigger when they
> weren't getting anything out of that.

After thinking about that a bit more, there might be a way out that
both avoids bloating index tuples that don't need it, and avoids
the compatibility problem.  How about defining that if the
INDEX_SIZE_MASK bits aren't zero, they are the tuple size as now;
but if they are zero, then the size appears in a separate uint16
field following the existing IndexTupleData fields.  We could perhaps
also rethink how the nulls bitmap storage works in this "v2"
index tuple header layout?  In any case, I'd expect to end up in
a place where (on 64-bit hardware) you pay an extra MAXALIGN quantum
for either an oversize tuple or a nulls bitmap, but only one quantum
when you have both, and nothing above today when the tuple is not
oversize.

This'd complicate tuple construction and inspection a bit, but
it would avoid building an enormous lot of infrastructure to deal
with transitioning to a not-upward-compatible definition.

regards, tom lane




Re: index prefetching

2024-01-18 Thread Tomas Vondra
On 1/17/24 09:45, Konstantin Knizhnik wrote:
> I have integrated your prefetch patch in Neon and it actually works!
> Moreover, I combined it with prefetch of leaf pages for IOS and it also
> seems to work.
> 

Cool! And do you think this is the right design/way to do this?

> Just small notice: you are reporting `blks_prefetch_rounds` in explain,
> but it is not incremented anywhere.
> Moreover, I do not precisely understand what it mean and wonder if such
> information is useful for analyzing query executing plan.
> Also your patch always report number of prefetched blocks (and rounds)
> if them are not zero.
> 

Right, this needs fixing.

> I think that adding new information to explain it may cause some
> problems because there are a lot of different tools which parse explain
> report to visualize it,
> make some recommendations top improve performance, ... Certainly good
> practice for such tools is to ignore all unknown tags. But I am not sure
> that everybody follow this practice.
> It seems to be more safe and at the same time convenient for users to
> add extra tag to explain to enable/disable prefetch info (as it was done
> in Neon).
> 

I think we want to add this info to explain, but maybe it should be
behind a new flag and disabled by default.

> Here we come back to my custom explain patch;) Actually using it is not
> necessary. You can manually add "prefetch" option to Postgres core (as
> it is currently done in Neon).
> 

Yeah, I think that's the right solution.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: index prefetching

2024-01-18 Thread Tomas Vondra
On 1/16/24 21:10, Konstantin Knizhnik wrote:
> 
> ...
> 
>> 4. I think that performing prefetch at executor level is really great
>>> idea and so prefetch can be used by all indexes, including custom
>>> indexes. But prefetch will be efficient only if index can provide fast
>>> access to next TID (located at the same page). I am not sure that it is
>>> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
>>> custom AM. I wonder if we should extend AM API to make index make a
>>> decision weather to perform prefetch of TIDs or not.
>> I'm not against having a flag to enable/disable prefetching, but the
>> question is whether doing prefetching for such indexes can be harmful.
>> I'm not sure about that.
> 
> I tend to agree with you - it is hard to imagine index implementation
> which doesn't win from prefetching heap pages.
> May be only the filtering case you have mentioned. But it seems to me
> that current B-Tree index scan (not IOS) implementation in Postgres
> doesn't try to use index tuple to check extra condition - it will fetch
> heap tuple in any case.
> 

That's true, but that's why I started working on this:

https://commitfest.postgresql.org/46/4352/

I need to think about how to combine that with the prefetching. The good
thing is that both changes require fetching TIDs, not slots. I think the
condition can be simply added to the prefetch callback.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Sequence Access Methods, round two

2024-01-18 Thread Matthias van de Meent
On Thu, 18 Jan 2024, 16:06 Peter Eisentraut,  wrote:
>
> On 01.12.23 06:00, Michael Paquier wrote:
> > Please find attached a patch set that aims at implementing sequence
> > access methods, with callbacks following a model close to table and
> > index AMs, with a few cases in mind:
> > - Global sequences (including range-allocation, local caching).
> > - Local custom computations (a-la-snowflake).
>
> That's a lot of code, but the use cases are summarized in two lines?!?
>
> I would like to see a lot more elaboration what these uses cases are (I
> recognize the words, but do we have the same interpretation of them?)
> and how they would be addressed by what you are proposing, and better
> yet an actual implementation of something useful, rather than just a
> dummy test module.

At $prevjob we had a use case for PRNG to generate small,
non-sequential "random" numbers without the birthday problem occurring
in sqrt(option space) because that'd increase the printed length of
the numbers beyond a set limit. The sequence API proposed here
would've been a great alternative to the solution we found, as it
would allow a sequence to be backed by an Linear Congruential
Generator directly, rather than the implementation of our own
transactional random_sequence table.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 10:42 AM Robert Haas  wrote:
> Now, that said, I suspect that we actually could reduce FSM_CATEGORIES
> somewhat without causing any real problems, because many tables are
> going to have tuples that are all about the same size, and even in a
> table where the sizes vary more than is typical, a single tuple can't
> consume more than a quarter of the page,

Actually, I think that's a soft limit, not a hard limit. But the
granularity above that level probably doesn't need to be very high, at
leat.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: More new SQL/JSON item methods

2024-01-18 Thread Peter Eisentraut

On 18.01.24 15:25, Jeevan Chalke wrote:
Peter, I didn't understand why the changes you did in your 0002 patch 
were required here. I did run the pgindent, and it didn't complain to 
me. So, just curious to know more about the changes. I have not merged 
those changes in this single patch. However, if needed it can be cleanly 
applied on top of this single patch.


I just thought it was a bit wasteful with vertical space.  It's not 
essential.





Re: Increasing IndexTupleData.t_info from uint16 to uint32

2024-01-18 Thread Tom Lane
Montana Low  writes:
> I've attached a patch that increases  IndexTupleData.t_info from 16bits to
> 32bits allowing for significantly larger index tuple sizes.

I fear this idea is a non-starter because it'd break on-disk
compatibility.  Certainly, if we were to try to pursue it, there'd
need to be an enormous amount of effort spent on dealing with existing
indexes and transition mechanisms.  I don't think you've made the case
why that would be time well spent.

On a micro level, this makes sizeof(IndexTupleData) be not maxaligned,
which is likely to cause problems on alignment-picky hardware, or else
result in space wastage if we were careful to MAXALIGN() everywhere.
(Which we should have been, but I don't care to bet on it.)  A lot of
people would be sad if their indexes got noticeably bigger when they
weren't getting anything out of that.

regards, tom lane




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 10:09 AM Peter Geoghegan  wrote:
> The problem with your justification for moving things in that
> direction (if any) is that it is occasionally not quite true: there
> are at least some cases where line pointer truncation after making a
> page's LP_DEAD items -> LP_UNUSED will actually matter. Plus
> PageGetHeapFreeSpace() will return 0 if and when
> "PageGetMaxOffsetNumber(page) > MaxHeapTuplesPerPage &&
> !PageHasFreeLinePointers(page)". Of course, nothing stops you from
> compensating for this by anticipating what will happen later on, and
> assuming that the page already has that much free space.

I think we're agreeing but I want to be sure. If we only set LP_DEAD
items to LP_UNUSED, that frees no space. But if doing so allows us to
truncate the line pointer array, that that frees a little bit of
space. Right?

One problem with using this as a justification for the status quo is
that truncating the line pointer array is a relatively recent
behavior. It's certainly much newer than the choice to have VACUUM
touch the FSM in the second page than the first page.

Another problem is that the amount of space that we're freeing up in
the second pass is really quite minimal even when it's >0. Any tuple
that actually contains any data at all is at least 32 bytes, and most
of them are quite a bit larger. Item pointers are 2 bytes. To save
enough space to fit even one additional tuple, we'd have to free *at
least* 16 line pointers. That's going to be really rare.

And even if it happens, is it even useful to advertise that free
space? Do we want to cram one more tuple into a page that has a
history of extremely heavy updates? Could it be that it's smarter to
just forget about that free space? You've written before about the
stupidity of cramming tuples of different generations into the same
page, and that concept seems to apply here. When we heap_page_prune(),
we don't know how much time has elapsed since the page was last
modified - but if we're lucky, it might not be very much. Updating the
FSM at that time gives us some shot of filling up the page with data
created around the same time as the existing page contents. By the
time we vacuum the indexes and come back, that temporal locality is
definitely lost.

> You'd likely prefer a simpler argument for doing this -- an argument
> that doesn't require abandoning/discrediting the idea that a high
> degree of FSM_CATEGORIES-wise precision is a valuable thing. Not sure
> that that's possible -- the current design is at least correct on its
> own terms. And what you propose to do will probably be less correct on
> those same terms, silly though they are.

I've never really understood why you think that the number of
FSM_CATEGORIES is the problem. I believe I recall you endorsing a
system where pages are open or closed, to try to achieve temporal
locality of data. I agree that such a system could work better than
what we have now. I think there's a risk that such a system could
create pathological cases where the behavior is much worse than what
we have today, and I think we'd need to consider carefully what such
cases might exist and what mitigation strategies might make sense.
However, I don't see a reason why such a system should intrinsically
want to reduce FSM_CATEGORIES. If we have two open pages and one of
them has enough space for the tuple we're now trying to insert and the
other doesn't, we'd still like to avoid having the FSM hand us the one
that doesn't.

Now, that said, I suspect that we actually could reduce FSM_CATEGORIES
somewhat without causing any real problems, because many tables are
going to have tuples that are all about the same size, and even in a
table where the sizes vary more than is typical, a single tuple can't
consume more than a quarter of the page, so granularity above that
point seems completely useless. So if we needed some bitspace to track
the open/closed status of pages or similar, I suspect we could find
that in the existing FSM byte per page without losing anything. But
all of that is just an argument that reducing the number of
FSM_CATEGORIES is *acceptable*; it doesn't amount to an argument that
it's better. My current belief is that it isn't better, just a vehicle
to do something else that maybe is better, like squeezing open/closed
tracking or similar into the existing bit space. My understanding is
that you think it would be better on its own terms, but I have not yet
been able to grasp why that would be so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: UUID v7

2024-01-18 Thread Andrey Borodin



> On 18 Jan 2024, at 19:20, Aleksander Alekseev  
> wrote:
> 
> Timestamp and TimestampTz are absolutely the same thing.
My question is not about Postgres data types. I'm asking about examples in the 
standard.

There's an example 017F22E2-79B0-7CC3-98C4-DC0C0C07398F. It is expected to be 
generated on "Tuesday, February 22, 2022 2:22:22.00 PM GMT-05:00".
It's exaplained to be 16455577420ns after 1582-10-15 00:00:00 UTC.

But 16455577420ns after 1582-10-15 00:00:00 UTC  was  2022-02-22 19:22:22 
UTC. And that was 2022-02-23 00:22:22 in UTC-05.


Best regards, Andrey Borodin.



Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 9:53 AM Melanie Plageman
 wrote:
> I believe I've done this in attached v10.

Oh, I see. Good catch.

I've now committed 0001.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Peter Geoghegan
On Thu, Jan 18, 2024 at 8:52 AM Robert Haas  wrote:
> But I also said one more thing that I'd still like to hear your
> thoughts about, which is: why is it right to update the FSM after the
> second heap pass rather than the first one? I can't help but suspect
> this is an algorithmic holdover from pre-HOT days, when VACUUM's first
> heap pass was read-only and all the work happened in the second pass.
> Now, nearly all of the free space that will ever become free becomes
> free in the first pass, so why not advertise it then, instead of
> waiting?

I don't think that doing everything FSM-related in the first heap pass
is a bad idea -- especially not if it buys you something elsewhere.

The problem with your justification for moving things in that
direction (if any) is that it is occasionally not quite true: there
are at least some cases where line pointer truncation after making a
page's LP_DEAD items -> LP_UNUSED will actually matter. Plus
PageGetHeapFreeSpace() will return 0 if and when
"PageGetMaxOffsetNumber(page) > MaxHeapTuplesPerPage &&
!PageHasFreeLinePointers(page)". Of course, nothing stops you from
compensating for this by anticipating what will happen later on, and
assuming that the page already has that much free space.

It might even be okay to just not try to compensate for anything,
PageGetHeapFreeSpace-wise -- just do all FSM stuff in the first heap
pass, and ignore all this. I happen to believe that a FSM_CATEGORIES
of 256 is way too much granularity to be useful in practice -- I just
don't have any faith in the idea that that kind of granularity is
useful (it's quite the opposite).

A further justification might be what we already do in the heapam.c
REDO routines: the way that we use XLogRecordPageWithFreeSpace already
operates with far less precision that corresponding code from
vacuumlazy.c. heap_xlog_prune() already has recovery do what you
propose to do during original execution; it doesn't try to avoid
duplicating an anticipated call to XLogRecordPageWithFreeSpace that'll
take place when heap_xlog_vacuum() runs against the same page a bit
later on.

You'd likely prefer a simpler argument for doing this -- an argument
that doesn't require abandoning/discrediting the idea that a high
degree of FSM_CATEGORIES-wise precision is a valuable thing. Not sure
that that's possible -- the current design is at least correct on its
own terms. And what you propose to do will probably be less correct on
those same terms, silly though they are.

--
Peter Geoghegan




Re: Sequence Access Methods, round two

2024-01-18 Thread Peter Eisentraut

On 01.12.23 06:00, Michael Paquier wrote:

Please find attached a patch set that aims at implementing sequence
access methods, with callbacks following a model close to table and
index AMs, with a few cases in mind:
- Global sequences (including range-allocation, local caching).
- Local custom computations (a-la-snowflake).


That's a lot of code, but the use cases are summarized in two lines?!?

I would like to see a lot more elaboration what these uses cases are (I 
recognize the words, but do we have the same interpretation of them?) 
and how they would be addressed by what you are proposing, and better 
yet an actual implementation of something useful, rather than just a 
dummy test module.






Re: System username in pg_stat_activity

2024-01-18 Thread Magnus Hagander
On Mon, Jan 15, 2024 at 11:17 AM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Fri, Jan 12, 2024 at 05:16:53PM +0100, Magnus Hagander wrote:
> > On Thu, Jan 11, 2024 at 5:55 PM Bertrand Drouvot
> >  wrote:
> > >
> > > I'm wondering if it would make sense to populate it for parallel workers 
> > > too.
> > > I think it's doable thanks to d951052, but I'm not sure it's worth it 
> > > (one could
> > > join based on the leader_pid though). OTOH that would be consistent with
> > > how the SYSTEM_USER behaves with parallel workers (it's populated).
> >
> > I guess one could conceptually argue that "authentication happens int
> > he leader". But we do populate it with the other user records, and
> > it'd be weird if this one was excluded.
> >
> > The tricky thing is that pgstat_bestart() is called long before we
> > deserialize the data. But from what I can tell it should be safe to
> > change it per the attached? That should be AFAICT an extremely short
> > window of time longer before we report it, not enough to matter.
>
> Thanks! Yeah, that seems reasonable to me. Also, I think we should remove the
> "MyProcPort" test here then (looking at v3):
>
> +   if (MyProcPort && MyClientConnectionInfo.authn_id)
> +   strlcpy(lbeentry.st_auth_identity, 
> MyClientConnectionInfo.authn_id, NAMEDATALEN);
> +   else
> +   MemSet(&lbeentry.st_auth_identity, 0, 
> sizeof(lbeentry.st_auth_identity));
>
> to get the st_auth_identity propagated to the parallel workers.

Yup, I had done that in v4 which as you noted further down, I forgot to post.


> > > - what about "Contains the same value as the identity part in  > > linkend="system-user" />"?
>
> Not sure, but looks like you missed this comment?

I did. Agree with your comment, and updated now.


> > > +# Users with md5 auth should show both auth method and name in 
> > > pg_stat_activity
> > >
> > > what about "show both auth method and identity"?
> >
> > Good spot, yeah, I changed it over to identity everywhere else so it
> > should be here as well.
>
> Did you forget to share the new revision (aka v4)? I can only see the
> "reorder_parallel_worker_bestart.patch" attached.

I did. Here it is, and also including that suggested docs fix as well
as a rebase on current master.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 210c7c0b02..8fa0e5474e 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -23462,7 +23462,7 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
 
   

-
+
  system_user
 
 system_user
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6e74138a69..25eb7a4bf4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -784,6 +784,28 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   
  
 
+ 
+  
+   auth_method text
+  
+  
+   The authentication method used for authenticating the connection, or
+   NULL for background processes.
+  
+ 
+
+ 
+  
+   auth_identity text
+  
+  
+   The identity (if any) that the user presented during the authentication
+   cycle before they were assigned a database role.  Contains the same
+   value as the identity part in , or NULL
+   for background processes.
+  
+ 
+
  
   
application_name text
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 849a03e4b6..269ab4e586 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -1529,6 +1529,9 @@ ParallelWorkerMain(Datum main_arg)
 	/* Attach to the leader's serializable transaction, if SERIALIZABLE. */
 	AttachSerializableXact(fps->serializable_xact_handle);
 
+	/* Report to the stats system that we are started */
+	pgstat_bestart();
+
 	/*
 	 * We've initialized all of our state now; nothing should change
 	 * hereafter.
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e43e36f5ac..b68c382de1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -873,6 +873,8 @@ CREATE VIEW pg_stat_activity AS
 S.leader_pid,
 S.usesysid,
 U.rolname AS usename,
+S.auth_method,
+S.auth_identity,
 S.application_name,
 S.client_addr,
 S.client_hostname,
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 9bbdc4beb0..6621969154 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -625,9 +625,20 @@ ClientAuthentication(Port *port)
 			status = CheckRADIUSAuth(port);
 			break;
 		case uaCert:
-			/* uaCert will be treated as if clientcert=verify-full (uaTrust) */
+
+			/*
+			 * uaCert will be treate

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-18 Thread jian he
Hi.
patch refactored based on "on_error {stop|ignore}"
doc changes:

--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -43,7 +43,7 @@ COPY { table_name [ ( column_name [, ...] ) | * }
 FORCE_NOT_NULL { ( column_name [, ...] ) | * }
 FORCE_NULL { ( column_name [, ...] ) | * }
-SAVE_ERROR_TO 'location'
+ON_ERROR 'error_action'
 ENCODING 'encoding_name'
 
  
@@ -375,20 +375,20 @@ COPY { table_name [ ( 


-SAVE_ERROR_TO
+ON_ERROR
 
  
-  Specifies to save error information to 
-  location when there is malformed data in the input.
-  Currently, only error (default) and
none
+  Specifies which 
+  error_action to perform when there is malformed
data in the input.
+  Currently, only stop (default) and
ignore
   values are supported.
-  If the error value is specified,
+  If the stop value is specified,
   COPY stops operation at the first error.
-  If the none value is specified,
+  If the ignore value is specified,
   COPY skips malformed data and continues copying data.
   The option is allowed only in COPY FROM.
-  The none value is allowed only when
-  not using binary format.
+  Only stop value is allowed only when
+  using binary format.
  
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 85881ca0..c30baec1 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -43,7 +43,7 @@ COPY { table_name [ ( column_name [, ...] ) | * }
 FORCE_NOT_NULL { ( column_name [, ...] ) | * }
 FORCE_NULL { ( column_name [, ...] ) | * }
-SAVE_ERROR_TO 'location'
+ON_ERROR 'error_action'
 ENCODING 'encoding_name'
 
  
@@ -375,20 +375,20 @@ COPY { table_name [ ( 
 

-SAVE_ERROR_TO
+ON_ERROR
 
  
-  Specifies to save error information to 
-  location when there is malformed data in the input.
-  Currently, only error (default) and none
+  Specifies which 
+  error_action to perform when there is malformed data in the input.
+  Currently, only stop (default) and ignore
   values are supported.
-  If the error value is specified,
+  If the stop value is specified,
   COPY stops operation at the first error.
-  If the none value is specified,
+  If the ignore value is specified,
   COPY skips malformed data and continues copying data.
   The option is allowed only in COPY FROM.
-  The none value is allowed only when
-  not using binary format.
+  Only stop value is allowed only when
+  using binary format.
  
 

@@ -577,7 +577,7 @@ COPY count
 

 COPY stops operation at the first error when
-SAVE_ERROR_TO is not specified. This
+ON_ERROR is not specified. This
 should not lead to problems in the event of a COPY
 TO, but the target table will already have received
 earlier rows in a COPY FROM. These rows will not
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index c36d7f1d..cc0786c6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -395,39 +395,39 @@ defGetCopyHeaderChoice(DefElem *def, bool is_from)
 }
 
 /*
- * Extract a CopySaveErrorToChoice value from a DefElem.
+ * Extract a CopyOnErrorChoice value from a DefElem.
  */
-static CopySaveErrorToChoice
-defGetCopySaveErrorToChoice(DefElem *def, ParseState *pstate, bool is_from)
+static CopyOnErrorChoice
+defGetCopyOnErrorChoice(DefElem *def, ParseState *pstate, bool is_from)
 {
 	char	   *sval;
 
 	if (!is_from)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("COPY SAVE_ERROR_TO cannot be used with COPY TO"),
+ errmsg("COPY ON_ERROR cannot be used with COPY TO"),
  parser_errposition(pstate, def->location)));
 
 	/*
 	 * If no parameter value given, assume the default value.
 	 */
 	if (def->arg == NULL)
-		return COPY_SAVE_ERROR_TO_ERROR;
+		return COPY_ON_ERROR_STOP;
 
 	/*
-	 * Allow "error", or "none" values.
+	 * Allow "stop", or "ignore" values.
 	 */
 	sval = defGetString(def);
-	if (pg_strcasecmp(sval, "error") == 0)
-		return COPY_SAVE_ERROR_TO_ERROR;
-	if (pg_strcasecmp(sval, "none") == 0)
-		return COPY_SAVE_ERROR_TO_NONE;
+	if (pg_strcasecmp(sval, "stop") == 0)
+		return COPY_ON_ERROR_STOP;
+	if (pg_strcasecmp(sval, "ignore") == 0)
+		return COPY_ON_ERROR_IGNORE;
 
 	ereport(ERROR,
 			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-			 errmsg("COPY save_error_to \"%s\" not recognized", sval),
+			 errmsg("COPY ON_ERROR \"%s\" not recognized", sval),
 			 parser_errposition(pstate, def->location)));
-	return COPY_SAVE_ERROR_TO_ERROR;	/* keep compiler quiet */
+	return COPY_ON_ERROR_STOP;	/* keep compiler quiet */
 }
 
 /*
@@ -455,7 +455,7 @@ ProcessCopyOptions(ParseState *pstate,
 	bool		format_specified = false;
 	bool		freeze_specified = false;
 	bool		header_specified = false;
-	bool		save_error_to_specified = false;
+	bool		on_error_specified = false;
 	

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Melanie Plageman
On Wed, Jan 17, 2024 at 5:33 PM Melanie Plageman
 wrote:
>
> This does mean that something is not quite right with 0001 as well as
> 0004. We'd end up checking if we are at 8GB much more often. I should
> probably find a way to replicate the cadence on master.

I believe I've done this in attached v10.

- Melanie
From 23fe74086fadf6ed4c7697e7b35b3ff9e0a5f87a Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 16 Jan 2024 17:28:07 -0500
Subject: [PATCH v10 4/4] Combine FSM updates for prune and no prune cases

The goal is to update the FSM once for each page vacuumed. The logic for
ensuring this is the same for lazy_scan_prune() and lazy_scan_noprune().
Combine those FSM updates. While doing this, de-indent the logic for
calling lazy_scan_noprune() and the first invocation of
lazy_scan_new_or_empty(). With these changes, it is easier to see that
lazy_scan_new_or_empty() is called once before invoking
lazy_scan_noprune() and lazy_scan_prune().

Discussion: https://postgr.es/m/CA%2BTgmoaLTvipm%3Dxx4rJLr07m908PCu%3DQH3uCjD1UOn8YaEuO2g%40mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 89 +++-
 1 file changed, 33 insertions(+), 56 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 32631d27c5..703136413f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -838,6 +838,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		Buffer		buf;
 		Page		page;
 		bool		all_visible_according_to_vm;
+		bool		got_cleanup_lock = false;
 
 		if (blkno == next_unskippable_block)
 		{
@@ -940,54 +941,28 @@ lazy_scan_heap(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
  vacrel->bstrategy);
 		page = BufferGetPage(buf);
-		if (!ConditionalLockBufferForCleanup(buf))
-		{
-			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
-			/* Check for new or empty pages before lazy_scan_noprune call */
-			if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, true,
-	   vmbuffer))
-			{
-/* Processed as new/empty page (lock and pin released) */
-continue;
-			}
+		got_cleanup_lock = ConditionalLockBufferForCleanup(buf);
 
-			/*
-			 * Collect LP_DEAD items in dead_items array, count tuples,
-			 * determine if rel truncation is safe
-			 */
-			if (lazy_scan_noprune(vacrel, buf, blkno, page, &has_lpdead_items))
-			{
-Size		freespace = 0;
-bool		recordfreespace;
+		if (!got_cleanup_lock)
+			LockBuffer(buf, BUFFER_LOCK_SHARE);
 
-/*
- * We processed the page successfully (without a cleanup
- * lock).
- *
- * Update the FSM, just as we would in the case where
- * lazy_scan_prune() is called. Our goal is to update the
- * freespace map the last time we touch the page. If the
- * relation has no indexes, or if index vacuuming is disabled,
- * there will be no second heap pass; if this particular page
- * has no dead items, the second heap pass will not touch this
- * page. So, in those cases, update the FSM now.
- *
- * After a call to lazy_scan_prune(), we would also try to
- * adjust the page-level all-visible bit and the visibility
- * map, but we skip that step in this path.
- */
-recordfreespace = vacrel->nindexes == 0
-	|| !vacrel->do_index_vacuuming
-	|| !has_lpdead_items;
-if (recordfreespace)
-	freespace = PageGetHeapFreeSpace(page);
-UnlockReleaseBuffer(buf);
-if (recordfreespace)
-	RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
-continue;
-			}
+		/* Check for new or empty pages before lazy_scan_[no]prune call */
+		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, !got_cleanup_lock,
+   vmbuffer))
+		{
+			/* Processed as new/empty page (lock and pin released) */
+			continue;
+		}
 
+		/*
+		 * Collect LP_DEAD items in dead_items array, count tuples, determine
+		 * if rel truncation is safe. If lazy_scan_noprune() returns false, we
+		 * must get a cleanup lock and call lazy_scan_prune().
+		 */
+		if (!got_cleanup_lock &&
+			!lazy_scan_noprune(vacrel, buf, blkno, page, &has_lpdead_items))
+		{
 			/*
 			 * lazy_scan_noprune could not do all required processing.  Wait
 			 * for a cleanup lock, and call lazy_scan_prune in the usual way.
@@ -995,13 +970,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			Assert(vacrel->aggressive);
 			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 			LockBufferForCleanup(buf);
-		}
-
-		/* Check for new or empty pages before lazy_scan_prune call */
-		if (lazy_scan_new_or_empty(vacrel, buf, blkno, page, false, vmbuffer))
-		{
-			/* Processed as new/empty page (lock and pin released) */
-			continue;
+			got_cleanup_lock = true;
 		}
 
 		/*
@@ -1014,9 +983,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * tuple headers of remaining items with storage. It also determines
 		 * if truncating this block is safe.
 		 */
-		lazy_scan_prune(vacrel, buf, blkno, page,
-		vmbuffer, all_visible_according_to_vm

Re: cleanup patches for incremental backup

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 4:50 AM Matthias van de Meent
 wrote:
> Sure, that's fine with me.

OK, committed that way.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: UUID v7

2024-01-18 Thread Aleksander Alekseev
Hi,

> Another question: how did you choose between using TimestampTz and
> Timestamp types? I realize that internally it's all the same. Maybe
> Timestamp will be slightly better since the way it is displayed
> doesn't depend on the session settings. Many people I talked to find
> this part of TimestampTz confusing.
>
> timstamptz internally always store UTC.
> I believe that in SQL, when operating with time in UTC, you should always use 
> timestamptz.
> timestamp is theoretically the same thing. But internally it does not convert 
> time to UTC and will lead to incorrect use.

No.

Timestamp and TimestampTz are absolutely the same thing. The only
difference is how they are shown to the user. TimestampTz uses session
context in order to be displayed in the TZ chosen by the user. Thus
typically it is somewhat more confusing to the users and thus I asked
whether there was a good reason to choose TimestampTz over Timestamp.

-- 
Best regards,
Aleksander Alekseev




Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

2024-01-18 Thread Bertrand Drouvot
Hi,

On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote:
> IIUC, the issue is that we terminate the process holding the
> replication slot, and the conflict cause that we recorded before
> terminating the process changes in the next iteration due to the
> advancement in effective_xmin and/or effective_catalog_xmin.

Thanks for looking at it!

Yeah, and that could lead to no conflict detected anymore (like in the
case [2] reported up-thread).

> FWIW, a test code something like [1], can help detect above race issues, 
> right?

I think so and I've added it in v2 attached (except that it uses the new
"terminated" variable, see below), thanks!

> Some comments on the patch:
> 
> 1.
>  last_signaled_pid = active_pid;
> +terminated = true;
>  }
> 
> Why is a separate variable needed? Can't last_signaled_pid != 0 enough
> to determine that a process was terminated earlier?

Yeah probably, I thought about it but preferred to add a new variable for this 
purpose for clarity and avoid race conditions (in case futur patches "touch" the
last_signaled_pid anyhow). I was thinking that this part of the code is already
not that easy.

> 2. If my understanding of the racy behavior is right, can the same
> issue happen due to the advancement in restart_lsn?

I'm not sure as I never saw it but it should not hurt to also consider this
"potential" case so it's done in v2 attached.

> I'm not sure if it
> can happen at all, but I think we can rely on previous conflict reason
> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.

I'm not sure what you mean here. I think we should still keep the "initial" LSN
so that the next check done with it still makes sense. The previous conflict
reason as you're proposing also makes sense to me but for another reason: PANIC
in case the issue still happen (for cases we did not think about, means not
covered by what the added previous LSNs are covering).

> 3. Is there a way to reproduce this racy issue, may be by adding some
> sleeps or something? If yes, can you share it here, just for the
> records and verify the whatever fix provided actually works?

Alexander was able to reproduce it on a slow machine and the issue was not there
anymore with v1 in place. I think it's tricky to reproduce as it would need the
slot to advance between the 2 checks.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 4665945cba163bcb4beadc6391ee65c755e566d8 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 11 Jan 2024 13:57:53 +
Subject: [PATCH v2] Fix race condition in InvalidatePossiblyObsoleteSlot()

In case of an active slot we proceed in 2 steps:
- terminate the backend holding the slot
- report the slot as obsolete

This is racy because between the two we release the mutex on the slot, which
means the effective_xmin and effective_catalog_xmin could advance during that time.

Holding the mutex longer is not an option (given what we'd to do while holding it)
so record the previous LSNs instead that were used during the backend termination.
---
 src/backend/replication/slot.c | 37 --
 1 file changed, 31 insertions(+), 6 deletions(-)
 100.0% src/backend/replication/

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 52da694c79..f136916285 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1352,6 +1352,11 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 {
 	int			last_signaled_pid = 0;
 	bool		released_lock = false;
+	bool		terminated = false;
+	XLogRecPtr	initial_effective_xmin = InvalidXLogRecPtr;
+	XLogRecPtr	initial_catalog_effective_xmin = InvalidXLogRecPtr;
+	XLogRecPtr	initial_restart_lsn = InvalidXLogRecPtr;
+	ReplicationSlotInvalidationCause conflict_prev = RS_INVAL_NONE;
 
 	for (;;)
 	{
@@ -1386,11 +1391,18 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 		 */
 		if (s->data.invalidated == RS_INVAL_NONE)
 		{
+			if (!terminated)
+			{
+initial_restart_lsn = s->data.restart_lsn;
+initial_effective_xmin = s->effective_xmin;
+initial_catalog_effective_xmin = s->effective_catalog_xmin;
+			}
+
 			switch (cause)
 			{
 case RS_INVAL_WAL_REMOVED:
-	if (s->data.restart_lsn != InvalidXLogRecPtr &&
-		s->data.restart_lsn < oldestLSN)
+	if (initial_restart_lsn != InvalidXLogRecPtr &&
+		initial_restart_lsn < oldestLSN)
 		conflict = cause;
 	break;
 case RS_INVAL_HORIZON:
@@ -1399,12 +1411,12 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 	/* invalid DB oid signals a shared relation */
 	if (dboid != InvalidOid && dboid != s->data.database)
 		break;
-	if (TransactionIdIsValid(s->effective_xmin) &&
-		TransactionIdPrecedesOrEquals(s->effective_xmin,
+	if (TransactionIdIsValid(initial_eff

Re: POC: GROUP BY optimization

2024-01-18 Thread Andrei Lepikhov

Just forgotten second attachment.

--
regards,
Andrei Lepikhov
Postgres Professional
diff --git a/src/backend/optimizer/path/pathkeys.c 
b/src/backend/optimizer/path/pathkeys.c
index 1095b73dac..b612420547 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -432,6 +432,21 @@ group_keys_reorder_by_pathkeys(List *pathkeys, List 
**group_pathkeys,
return n;
 }
 
+static bool
+duplicated_pathkey_combination(List *infos, List *pathkeys)
+{
+   ListCell *lc;
+
+   foreach (lc, infos)
+   {
+   PathKeyInfo *info = lfirst_node(PathKeyInfo, lc);
+
+   if (compare_pathkeys(pathkeys, info->pathkeys) == 
PATHKEYS_EQUAL)
+   return true;
+   }
+   return false;
+}
+
 /*
  * get_useful_group_keys_orderings
  * Determine which orderings of GROUP BY keys are potentially 
interesting.
@@ -491,7 +506,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path 
*path)
 
if (n > 0 &&
(enable_incremental_sort || n == 
root->num_groupby_pathkeys) &&
-   compare_pathkeys(pathkeys, root->group_pathkeys) != 
PATHKEYS_EQUAL)
+   !duplicated_pathkey_combination(infos, pathkeys))
{
info = makeNode(PathKeyInfo);
info->pathkeys = pathkeys;
@@ -514,8 +529,9 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path 
*path)

   &clauses,

   root->num_groupby_pathkeys);
 
-   if (n > 0 && compare_pathkeys(pathkeys, root->group_pathkeys) 
!= PATHKEYS_EQUAL &&
-   (enable_incremental_sort || n == 
list_length(root->sort_pathkeys)))
+   if (n > 0 &&
+   (enable_incremental_sort || n == 
list_length(root->sort_pathkeys)) &&
+   !duplicated_pathkey_combination(infos, pathkeys))
{
info = makeNode(PathKeyInfo);
info->pathkeys = pathkeys;


Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Robert Haas
On Thu, Jan 18, 2024 at 7:56 AM Andy Fan  wrote:
> Do you still have interest with making some progress on this topic?

Some, but it's definitely not my top priority. I wish I could give as
much attention as everyone would like to everyone's patches, but I
can't even come close.

I think that the stack that you set up in START_SPIN_LOCK() is crazy.
That would only be necessary if it were legal to acquire multiple
spinlocks at the same time, which it definitely isn't. Also, doing
memory allocation and deallocation here appears highly undesirable -
even if we did need to support multiple spinlocks, it would be better
to handle this using something like the approach we already use for
lwlocks, where there is a fixed size array and we blow up if it
overflows.

ASSERT_NO_SPIN_LOCK() looks odd, because I would expect it to be
spelled Assert(!AnySpinLockHeld()). But looking deeper, I see that it
doesn't internally Assert() but rather does something else. Maybe the
name needs some workshopping. SpinLockMustNotBeHeldHere()?
VerifyNoSpinLocksHeld()?

I think we should check that no spinlock is held in a few additional
places: the start of SpinLockAcquire(), and the start of errstart().

You added an #include to dynahash.c despite making no other changes to the file.

I don't know whether the choice to treat buffer header locks as
spinlocks is correct. It seems like it at least deserves a comment,
and possibly some discussion on this mailing list about whether that's
the right idea. I'm not sure that we have all the same restrictions
for buffer header locks as we do for spinlocks in general, but I'm
also not sure that we don't.

On a related note, the patch overall has 0 comments. I don't know that
it needs a lot, but 0 isn't many at all.

miscadmin.h doesn't seem like a good place for this. It's a
widely-included header file and these checks should be needed in
relatively few places; also, they're not really related to most of
what's in that file, IIRC. I also wonder why we're using macros
instead of static inline functions.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-18 Thread Robert Haas
On Wed, Jan 17, 2024 at 4:31 PM Peter Geoghegan  wrote:
> Actually, I suppose that we couldn't apply it independently of
> nindexes==0. Then we'd call FreeSpaceMapVacuumRange() before our
> second pass over the heap takes place for those LP_DEAD-containing
> heap pages scanned since the last round of index/heap vacuuming took
> place (or since VACUUM began). We need to make sure that the FSM has
> the most recent possible information known to VACUUM, which would
> break if we applied VACUUM_FSM_EVERY_PAGES rules when nindexes > 0.
>
> Even still, the design of VACUUM_FSM_EVERY_PAGES seems questionable to me.

I agree with all of this. I thought I'd said all of this, actually, in
my prior email, but perhaps it wasn't as clear as it needed to be.

But I also said one more thing that I'd still like to hear your
thoughts about, which is: why is it right to update the FSM after the
second heap pass rather than the first one? I can't help but suspect
this is an algorithmic holdover from pre-HOT days, when VACUUM's first
heap pass was read-only and all the work happened in the second pass.
Now, nearly all of the free space that will ever become free becomes
free in the first pass, so why not advertise it then, instead of
waiting?

Admittedly, HOT is not yet 15 years old, so maybe it's too soon to
adapt our VACUUM algorithm for it. *wink*

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Network failure may prevent promotion

2024-01-18 Thread Heikki Linnakangas

On 18/01/2024 10:26, Kyotaro Horiguchi wrote:

At Sun, 31 Dec 2023 20:07:41 +0900 (JST), Kyotaro Horiguchi 
 wrote in

We've noticed that when walreceiver is waiting for a connection to
complete, standby does not immediately respond to promotion
requests. In PG14, upon receiving a promotion request, walreceiver
terminates instantly, but in PG16, it waits for connection
timeout. This behavior is attributed to commit 728f86fec65, where a
part of libpqrcv_connect was simply replaced with a call to
libpqsrc_connect_params. This behavior can be verified by simply
dropping packets from the standby to the primary.


Apologize for the inconvenience on my part, but I need to fix this
behavior. To continue this discussion, I'm providing a repro script
here.


Thanks for script, I can repro this with it.

Given that commit 728f86fec6 that introduced this issue was not strictly 
required, perhaps we should just revert it for v16.


In your patch, there's one more stray reference to 
ProcessWalRcvInterrupts() in the comment above libpqrcv_PQexec. That 
makes me wonder, why didn't commit 728f86fec6 go all the way and also 
replace libpqrcv_PQexec and libpqrcv_PQgetResult with libpqsrv_exec and 
libpqsrv_get_result?


--
Heikki Linnakangas
Neon (https://neon.tech)





Fix incorrect format placeholders in walreceiver.c

2024-01-18 Thread Yongtao Huang
Hi all,

I think the correct placeholder for var *startoff* should be *%d*.
Thanks for your time.

Best

Yongtao Huang


0001-Fix-incorrect-format-placeholders-in-walreceiver.c.patch
Description: Binary data


Re: UUID v7

2024-01-18 Thread Andrey Borodin


> On 17 Jan 2024, at 02:19, Jelte Fennema-Nio  wrote:

I want to ask Kyzer or Brad, I hope they will see this message. I'm working on 
the patch for time extraction for v1, v6 and v7.

Do I understand correctly, that UUIDs contain local time, not UTC time? For 
examples in [0] I see that "A.6. Example of a UUIDv7 Value" I see that February 
22, 2022 2:22:22.00 PM GMT-05:00 results in unix_ts_ms = 0x017F22E279B0, which 
is not UTC, but local time.
Is it intentional? Section "5.1. UUID Version 1" states otherwise.

If so, I should swap signatures of functions from TimestampTz to Timestamp.
I'm hard-coding examples from this standard to tests, so I want to be precise...

If I follow the standard I see this in tests:
+-- extract UUID v1, v6 and v7 timestamp
+SELECT uuid_extract_time('C232AB00-9414-11EC-B3C8-9F6BDECED846') at time zone 
'GMT-05';
+ timezone 
+--
+ Wed Feb 23 00:22:22 2022
+(1 row)

Current patch version attached. I've addressed all other requests: function 
renames, aliases, multiple functions instead of optional params, cleaner 
catalog definitions, not throwing error when [var,ver,time] value is unknown.
What is left: deal with timezones, improve documentation.


Best regards, Andrey Borodin.

[0] 
https://datatracker.ietf.org/doc/html/draft-ietf-uuidrev-rfc4122bis#name-example-of-a-uuidv1-value


v10-0001-Implement-UUID-v7-as-per-IETF-draft.patch
Description: Binary data


Re: the s_lock_stuck on perform_spin_delay

2024-01-18 Thread Andy Fan

Hi Matthias / Robert:

Do you still have interest with making some progress on this topic?

> Robert Haas  writes:
>
>> On Wed, Jan 10, 2024 at 10:17 PM Andy Fan  wrote:
>>> fixed in v2.
>>
>> Timing the spinlock wait seems like a separate patch from the new sanity 
>> checks.
>
> Yes, a separate patch would be better, so removed it from v4.
>
>> I suspect that the new sanity checks should only be armed in
>> assert-enabled builds.
>
> There are 2 changes in v4. a). Make sure every code is only armed in
> assert-enabled builds. Previously there was some counter++ in non
> assert-enabled build. b). Record the location of spin lock so that
> whenever the Assert failure, we know which spin lock it is. In our
> internal testing, that helps a lot.

v5 attached for fix the linking issue on Windows.

-- 
Best Regards
Andy Fan

>From d52d7d9e5e826c4e0c0cf6d0eb5cb72a26a37de5 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Mon, 15 Jan 2024 13:18:34 +0800
Subject: [PATCH v5 1/1] Detect more misuse of spin lock automatically

Spin lock are intended for *very* short-term locks, but it is possible
to be misused in many cases. e.g. Acquiring another LWLocks or regular
locks, memory allocation. In this patch, all of such cases may increase
the timing of holding the spin lock. In our culture, if any spin lock
can't be acquired for some-time, a crash should happen. See the
s_lock_stuck in perform_spin_delay.

CHECK_FOR_INTERRUPTS should be avoided as well when holding a spin lock.
Depends on what signals are left to handle, PG may raise error/fatal
which would cause the code jump to some other places which is hardly to
release the spin lock anyway. Because of this, elog(...) is not allowed
since it calls CHECK_FOR_INTERRUPTS.
---
 src/backend/access/table/tableam.c  |  1 +
 src/backend/storage/buffer/bufmgr.c |  1 +
 src/backend/storage/ipc/barrier.c   |  1 +
 src/backend/storage/ipc/shm_toc.c   |  1 +
 src/backend/storage/lmgr/lock.c |  2 ++
 src/backend/storage/lmgr/lwlock.c   |  2 ++
 src/backend/utils/hash/dynahash.c   |  1 +
 src/backend/utils/init/globals.c|  1 +
 src/backend/utils/mmgr/mcxt.c   |  8 +
 src/include/miscadmin.h | 56 +
 src/include/storage/buf_internals.h |  1 +
 src/include/storage/spin.h  | 16 +++--
 src/tools/pgindent/typedefs.list|  2 ++
 13 files changed, 90 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 6ed8cca05a..6c6a65764c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -24,6 +24,7 @@
 #include "access/syncscan.h"
 #include "access/tableam.h"
 #include "access/xact.h"
+#include "miscadmin.h"
 #include "optimizer/plancat.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7d601bef6d..c600a113cf 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5409,6 +5409,7 @@ LockBufHdr(BufferDesc *desc)
 
 	init_local_spin_delay(&delayStatus);
 
+	START_SPIN_LOCK();
 	while (true)
 	{
 		/* set BM_LOCKED flag */
diff --git a/src/backend/storage/ipc/barrier.c b/src/backend/storage/ipc/barrier.c
index 5b52e060ba..63167a04bb 100644
--- a/src/backend/storage/ipc/barrier.c
+++ b/src/backend/storage/ipc/barrier.c
@@ -85,6 +85,7 @@
 
 #include "postgres.h"
 #include "storage/barrier.h"
+#include "miscadmin.h"
 
 static inline bool BarrierDetachImpl(Barrier *barrier, bool arrive);
 
diff --git a/src/backend/storage/ipc/shm_toc.c b/src/backend/storage/ipc/shm_toc.c
index 8db9d25aac..9cdec41054 100644
--- a/src/backend/storage/ipc/shm_toc.c
+++ b/src/backend/storage/ipc/shm_toc.c
@@ -13,6 +13,7 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
 #include "port/atomics.h"
 #include "storage/shm_toc.h"
 #include "storage/spin.h"
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index c70a1adb9a..3b069b233a 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -776,6 +776,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	bool		found_conflict;
 	bool		log_lock = false;
 
+	ASSERT_NO_SPIN_LOCK();
+
 	if (lockmethodid <= 0 || lockmethodid >= lengthof(LockMethods))
 		elog(ERROR, "unrecognized lock method: %d", lockmethodid);
 	lockMethodTable = LockMethods[lockmethodid];
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index b4b989ac56..6ef753f670 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1205,6 +1205,8 @@ LWLockAcquire(LWLock *lock, LWLockMode mode)
 
 	Assert(mode == LW_SHARED || mode == LW_EXCLUSIVE);
 
+	ASSERT_NO_SPIN_LOCK();
+
 	PRINT_LWDEBUG("LWLockAcquire", lock, mode);
 
 #ifdef LWLOCK_STATS
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index a4152080b5..504dae4d6c 100644
--- a/src/backend/utils/ha

RE: Synchronizing slots from primary to standby

2024-01-18 Thread Zhijie Hou (Fujitsu)
On Wednesday, January 17, 2024 6:30 PM shveta malik  
wrote:
> PFA v63.

I analyzed the security of the slotsync worker and replication connection a bit,
and didn't find issue. Here is detail: 

1) data security

First, we are using the role used in primary_conninfo, the role used here is
requested to have REPLICATION or SUPERUSER privilege[1] which means it is
reasonable for the role to modify and read replication slots on the primary.

On the primary, the slotsync worker only queries the pg_replication_view which
doesn't contain any system or user table access, so I think it's safe.

On the standby server, the slot sync worker will not read/write any user table 
as
well, thus we don't have the risk of executing arbitrary codes in trigger.

2) privilege check

The SQL query of the slotsync worker will take common privilege check on the
primary. If I revoke the function execution privilege on
pg_get_replication_slots from replication user, then the slotsync worker
won't be able to query the pg_replication_slots view. Same is true for the
pg_is_in_recovery function. The slotsync worker will keep reporting ERROR after
revoking which is as expected.

Based on above, I didn't see some security issues for slotsync worker.

[1] 
https://www.postgresql.org/docs/16/runtime-config-replication.html#GUC-PRIMARY-CONNINFO

Best Regards,
Hou zj


Re: Built-in CTYPE provider

2024-01-18 Thread Peter Eisentraut

On 12.01.24 03:02, Jeff Davis wrote:

New version attached. Changes:

  * Named collation object PG_C_UTF8, which seems like a good idea to
prevent name conflicts with existing collations. The locale name is
still C.UTF-8, which still makes sense to me because it matches the
behavior of the libc locale of the same name so closely.


I am catching up on this thread.  The discussions have been very 
complicated, so maybe I didn't get it all.


The patches look pretty sound, but I'm questioning how useful this 
feature is and where you plan to take it.


Earlier in the thread, the aim was summarized as

> If the Postgres default was bytewise sorting+locale-agnostic
> ctype functions directly derived from Unicode data files,
> as opposed to libc/$LANG at initdb time, the main
> annoyance would be that "ORDER BY textcol" would no
> longer be the human-favored sort.

I think that would be a terrible direction to take, because it would 
regress the default sort order from "correct" to "useless".  Aside from 
the overall message this sends about how PostgreSQL cares about locales 
and Unicode and such.


Maybe you don't intend for this to be the default provider?  But then 
who would really use it?  I mean, sure, some people would, but how would 
you even explain, in practice, the particular niche of users or use cases?


Maybe if this new provider would be called "minimal", it might describe 
the purpose better.


I could see a use for this builtin provider if it also included the 
default UCA collation (what COLLATE UNICODE does now).  Then it would 
provide a "common" default behavior out of the box, and if you want more 
fine-tuning, you can go to ICU.  There would still be some questions 
about making sure the builtin behavior and the ICU behavior are 
consistent (different Unicode versions, stock UCA vs CLDR, etc.).  But 
for practical purposes, it might work.


There would still be a risk with that approach, since it would 
permanently marginalize ICU functionality, in the sense that only some 
locales would need ICU, and so we might not pay the same amount of 
attention to the ICU functionality.


I would be curious what your overall vision is here?  Is switching the 
default to ICU still your goal?  Or do you want the builtin provider to 
be the default?  Or something else?






039_end_of_wal: error in "xl_tot_len zero" test

2024-01-18 Thread Anton Voloshin

Hello, hackers,

I believe there is a small problem in the 039_end_of_wal.pl's 
"xl_tot_len zero" test. It supposes that after immediate shutdown the 
server, upon startup recovery, should always produce a message matching 
"invalid record length at .*: wanted 24, got 0". However, if the 
circumstances are just right and we happened to hit exactly on the edge 
of WAL page, then the message on startup recovery would be "invalid 
magic number  in log segment .*, offset .*". The test does not take 
that into account.


Now, reproducing this is somewhat tricky, because exact position in WAL 
at the test time depends on what exactly initdb did, and that not only 
differs in different major verisons, but also depends on e.g. username 
length, locales available, and, perhaps, more. Even though originally 
this problem was found "in the wild" on one particular system on one 
particular code branch, I've written small helper patch to make 
reproduction on master easier, see 
0001-repro-for-039_end_of_wal-s-problem-with-page-end.patch.


This patch adds single emit_message of (hopefully) the right size to 
make sure we hit end of WAL block right by the time we call 
$node->stop('immediate') in "xl_tot_len zero" test. With this patch, 
"xl_tot_len zero" test fails every time because the server writes 
"invalid magic number  in log segment" while the test still only 
expects "invalid record length at .*: wanted 24, got 0". If course, this 
0001 patch is *not* meant to be committed, but only as an issue 
reproduction helper.


I can think of two possible fixes:

1. Update advance_out_of_record_splitting_zone to also avoid stopping at
   exactly the block end:

 my $page_offset = $end_lsn % $WAL_BLOCK_SIZE;
-while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold)
+while ($page_offset >= $WAL_BLOCK_SIZE - $page_threshold || 
$page_offset <= $SizeOfXLogShortPHD)

 {
see 0002-fix-xl_tot_len-zero-test-amend-advance_out_of.patch

We need to compare with $SizeOfXLogShortPHD (and not with zero) because 
at that point, even though we didn't actually write out new WAL page

yet, it's header is already in place in memory and taken in account
for LSN reporting.

2. Alternatively, amend "xl_tot_len zero" test to expect "invalid magic
   number  in WAL segment" message as well:

 $node->start;
 ok( $node->log_contains(
+"invalid magic number  in WAL segment|" .
 "invalid record length at .*: expected at least 24, got 0", 
$log_size

 ),
see 0003-alt.fix-for-xl_tot_len-zero-test-accept-invalid.patch

I think it makes sense to backport whatever the final change would be to 
all branches with 039_end_of_wal (REL_12+).


Any thoughts?

Anton Voloshin
Postgres Professional, The Russian Postgres Company
https://postgrespro.ruFrom 5f12139816f6c1bc7d625ba8007aedb8f5d04a71 Mon Sep 17 00:00:00 2001
From: Anton Voloshin 
Date: Thu, 18 Jan 2024 12:45:12 +0300
Subject: [PATCH 1/3] repro for 039_end_of_wal's problem with page
 end

Tags: commitfest_hotfix
---
 src/test/recovery/t/039_end_of_wal.pl | 20 
 1 file changed, 20 insertions(+)

diff --git a/src/test/recovery/t/039_end_of_wal.pl b/src/test/recovery/t/039_end_of_wal.pl
index f9acc83c7d0..ecf9b6089de 100644
--- a/src/test/recovery/t/039_end_of_wal.pl
+++ b/src/test/recovery/t/039_end_of_wal.pl
@@ -36,6 +36,8 @@ my $WAL_SEGMENT_SIZE;
 my $WAL_BLOCK_SIZE;
 my $TLI;
 
+my $SizeOfXLogShortPHD = 24; # rounded up to 8 bytes
+
 # Build path of a WAL segment.
 sub wal_segment_path
 {
@@ -258,9 +260,27 @@ my $prev_lsn;
 note "Single-page end-of-WAL detection";
 ###
 
+my $lsn = get_insert_lsn($node);
+my $lsn_offset = $lsn % $WAL_BLOCK_SIZE;
+my $empty_msg_size = ( ( ($ENV{EMPTY_MSG_SIZE} || 51) + 7) / 8) * 8;
+my $commit_msg_size = ( (34 + 7) / 8) * 8;
+# empty logical msg and commit message take this many bytes of WAL:
+my $empty_msg_overhead = $empty_msg_size + $commit_msg_size;
+# cound overhead twice to account for two emit_message calls below:
+# we want to hit the page edge after the second call.
+my $target_lsn_offset = $WAL_BLOCK_SIZE * 2 - $empty_msg_overhead * 2;
+my $msg_size = ($target_lsn_offset - $lsn_offset) % $WAL_BLOCK_SIZE;
+# If we happen to switch to the next WAL block mid-message, reduce the message
+# by $SizeOfXLogShortPHD (minimal page header) to hit the same target.
+if ($lsn_offset + $msg_size >= $WAL_BLOCK_SIZE) { $msg_size -= $SizeOfXLogShortPHD; }
+print "QWE0: $lsn WAL_BLOCK_SIZE='$WAL_BLOCK_SIZE', lsn_offset='$lsn_offset' target_lsn_offset='$target_lsn_offset' msg_size='$msg_size'\n";
+emit_message($node, $msg_size);
+printf "QWE1: %s - after corrective msg\n", $node->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
 # xl_tot_len is 0 (a common case, we hit trailing zeroes).
 emit_message($node, 0);
+printf "QWE2: %s - after zero-length message\n", $node->safe_psql('postgres', "SELECT pg_current_wal_lsn()");
 $end_lsn

Increasing IndexTupleData.t_info from uint16 to uint32

2024-01-18 Thread Montana Low
The overall trend in machine learning embedding sizes has been growing
rapidly over the last few years from 128 up to 4K dimensions yielding
additional value and quality improvements. It's not clear when this trend
in growth will ease. The leading text embedding models

generate
now exceeds the index storage available in IndexTupleData.t_info.

The current index tuple size is stored in 13 bits of IndexTupleData.t_info,
which limits the max size of an index tuple to 2^13 = 8129 bytes. Vectors
implemented by pgvector

currently use
a 32 bit float for elements, which limits vector size to 2K
dimensions, which is no longer state of the art.

I've attached a patch that increases  IndexTupleData.t_info from 16bits to
32bits allowing for significantly larger index tuple sizes. I would guess
this patch is not a complete implementation that allows for migration from
previous versions, but it does compile and initdb succeeds. I'd be happy to
continue work if the core team is receptive to an update in this area, and
I'd appreciate any feedback the community has on the approach.

I imagine it might be worth hiding this change behind a compile time
configuration parameter similar to blocksize. I'm sure there are
implications I'm unaware of with this change, but I wanted to start the
discussion around a bit of code to see how much would actually need to
change.

Also, I believe this is my first mailing list post in a decade or 2, so let
me know if I've missed something important. BTW, thanks for all your work
over the decades!


32bit_index_info.patch
Description: Binary data


Make a Bitset which is resetable

2024-01-18 Thread Andy Fan

Hi,

In [1], David and I talked about a requirement that a user just want
to unset the all bits in a Bitmapset but keep the allocated memory
un-deallocated for later use. It is impossible for the current
Bitmapset. So David suggested a Bitset struct for this purpose. I start
this new thread so that the original thread can focus on its own
purpose.

commit 0ee7e4789e58d6820e4c1ff62979910c0b01cdbb (HEAD -> s_stuck_v2)
Author: yizhi.fzh 
Date:   Thu Jan 18 16:52:30 2024 +0800

Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.


[1]
https://www.postgresql.org/message-id/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com

Any feedback is welcome.

-- 
Best Regards
Andy Fan

>From 0ee7e4789e58d6820e4c1ff62979910c0b01cdbb Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 18 Jan 2024 16:52:30 +0800
Subject: [PATCH v1 1/1] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.
---
 src/backend/nodes/bitmapset.c| 180 +--
 src/include/nodes/bitmapset.h|  27 +
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 199 insertions(+), 9 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index f4b61085be..ade77afbae 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1224,23 +1224,18 @@ bms_join(Bitmapset *a, Bitmapset *b)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-int
-bms_next_member(const Bitmapset *a, int prevbit)
+
+static int
+bms_next_member_internal(int nwords, const bitmapword *words, int prevbit)
 {
-	int			nwords;
 	int			wordnum;
 	bitmapword	mask;
 
-	Assert(a == NULL || IsA(a, Bitmapset));
-
-	if (a == NULL)
-		return -2;
-	nwords = a->nwords;
 	prevbit++;
 	mask = (~(bitmapword) 0) << BITNUM(prevbit);
 	for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
 	{
-		bitmapword	w = a->words[wordnum];
+		bitmapword	w = words[wordnum];
 
 		/* ignore bits before prevbit */
 		w &= mask;
@@ -1257,9 +1252,21 @@ bms_next_member(const Bitmapset *a, int prevbit)
 		/* in subsequent words, consider all bits */
 		mask = (~(bitmapword) 0);
 	}
+
 	return -2;
 }
 
+int
+bms_next_member(const Bitmapset *a, int prevbit)
+{
+	Assert(a == NULL || IsA(a, Bitmapset));
+
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
 /*
  * bms_prev_member - find prev member of a set
  *
@@ -1365,3 +1372,158 @@ bitmap_match(const void *key1, const void *key2, Size keysize)
 	return !bms_equal(*((const Bitmapset *const *) key1),
 	  *((const Bitmapset *const *) key2));
 }
+
+/*
+ * bitset_init - create a Bitset. the set will be round up to nwords;
+ */
+Bitset *
+bitset_init(size_t size)
+{
+	int			nword = size + BITS_PER_BITMAPWORD - 1 / BITS_PER_BITMAPWORD;
+	Bitset	   *result;
+
+	Assert(size > 0);
+
+	result = (Bitset *) palloc0(sizeof(Bitset *) + nword * sizeof(bitmapword));
+	result->nwords = nword;
+
+	return result;
+}
+
+/*
+ * bitset_clear - clear the bits only, but the memory is still there.
+ */
+void
+bitset_clear(Bitset *a)
+{
+	Assert(a != NULL);
+	memset(a->words, 0, sizeof(bitmapword) * a->nwords);
+}
+
+void
+bitset_free(Bitset *a)
+{
+	Assert(a != NULL);
+	pfree(a);
+}
+
+void
+bitset_add_member(Bitset *a, int x)
+{
+	int			wordnum,
+bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	a->words[wordnum] |= ((bitmapword) 1 << bitnum);
+}
+
+void
+bitset_del_member(Bitset *a, int x)
+{
+	int			wordnum,
+bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	a->words[wordnum] &= ~((bitmapword) 1 << bitnum);
+}
+
+int
+bitset_is_member(Bitset *a, int x)
+{
+	int			wordnum,
+bitnum;
+
+	/* used in expression engine */
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	return (a->words[wordnum] & ((bitma

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-18 Thread torikoshia

On 2024-01-18 16:59, Alexander Korotkov wrote:
On Thu, Jan 18, 2024 at 4:16 AM torikoshia  
wrote:

On 2024-01-18 10:10, jian he wrote:
> On Thu, Jan 18, 2024 at 8:57 AM Masahiko Sawada 
> wrote:
>> On Thu, Jan 18, 2024 at 6:38 AM Tom Lane  wrote:
>> > Kyotaro-san's suggestion isn't bad, though I might shorten it to
>> > error_action {error|ignore|log} (or perhaps "stop" instead of "error")?
>> > You will need a separate parameter anyway to specify the destination
>> > of "log", unless "none" became an illegal table name when I wasn't
>> > looking.  I don't buy that one parameter that has some special values
>> > while other values could be names will be a good design.  Moreover,
>> > what if we want to support (say) log-to-file along with log-to-table?
>> > Trying to distinguish a file name from a table name without any other
>> > context seems impossible.
>>
>> I've been thinking we can add more values to this option to log errors
>> not only to the server logs but also to the error table (not sure
>> details but I imagined an error table is created for each table on
>> error), without an additional option for the destination name. The
>> values would be like error_action {error|ignore|save-logs|save-table}.
>>
>
> another idea:
> on_error {error|ignore|other_future_option}
> if not specified then by default ERROR.
> You can also specify ERROR or IGNORE for now.
>
> I agree, the parameter "error_action" is better than "location".

I'm not sure whether error_action or on_error is better, but either 
way

"error_action error" and "on_error error" seems a bit odd to me.
I feel "stop" is better for both cases as Tom suggested.


OK.  What about this?
on_error {stop|ignore|other_future_option}
where other_future_option might be compound like "file 'copy.log'" or
"table 'copy_log'".


Thanks, also +1 from me.

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: Strange Bitmapset manipulation in DiscreteKnapsack()

2024-01-18 Thread Andy Fan


Hi,

David Rowley  writes:

> Given that the code original code was written in a very deliberate way
> to avoid reallocations, I now think that we should maintain that.
>
> I've attached a patch which adds bms_replace_members(). It's basically
> like bms_copy() but attempts to reuse the member from another set. I
> considered if the new function should be called bms_copy_inplace(),
> but left it as bms_replace_members() for now.

I find the following code in DiscreteKnapsack is weird.


for (i = 0; i <= max_weight; ++i)
{
values[i] = 0;

** memory allocation here, and the num_items bit is removed later **

sets[i] = bms_make_singleton(num_items);
}


** num_items bit is removed here **
result = bms_del_member(bms_copy(sets[max_weight]), num_items);

I can't access the github.com now so I can't test my idea, but basiclly
I think we may need some improvement here. like 'sets[i] = NULL;' at the
first and remove the bms_del_member later.

-- 
Best Regards
Andy Fan





Re: Report planning memory in EXPLAIN ANALYZE

2024-01-18 Thread Alvaro Herrera
On 2024-Jan-18, Ashutosh Bapat wrote:

> The EXPLAIN output produces something like below
>explain_filter
>   -
>Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N)
>   Planning:
> Memory: used=N bytes, allocated=N bytes
>  (3 rows)
> 
> but function explain_filter(), defined in explain.sql, removes line
> containing Planning: and we end up with output
>explain_filter
>   -
>Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N)
> Memory: used=N bytes, allocated=N bytes
>  (2 rows)
> 
> Hence this weird difference.

Ah, okay, that makes sense.  (I actually think this is a bit insane, and
I would hope that we can revert commit eabba4a3eb71 eventually, but I
don't think that needs to hold up your proposed patch.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"At least to kernel hackers, who really are human, despite occasional
rumors to the contrary" (LWN.net)




Re: remaining sql/json patches

2024-01-18 Thread Alvaro Herrera
I've been eyeballing the coverage report generated after applying all
patches (but I only checked the code added by the 0008 patch).  AFAICS
the coverage is pretty good.  Some uncovered paths:

commands/explain.c (Hmm, I think this is a preexisting bug actually)

3893  18 : case T_TableFuncScan:
3894  18 : Assert(rte->rtekind == RTE_TABLEFUNC);
3895  18 : if (rte->tablefunc)
3896   0 : if (rte->tablefunc->functype == 
TFT_XMLTABLE)
3897   0 : objectname = "xmltable";
3898 : else/* Must be 
TFT_JSON_TABLE */
3899   0 : objectname = "json_table";
3900 : else
3901  18 : objectname = NULL;
3902  18 : objecttag = "Table Function Name";
3903  18 : break;

parser/gram.y:

   16940 : json_table_plan_cross:
   16941 : json_table_plan_primary CROSS 
json_table_plan_primary
   16942  39 : { $$ = 
makeJsonTableJoinedPlan(JSTPJ_CROSS, $1, $3, @1); }
   16943 : | json_table_plan_cross CROSS 
json_table_plan_primary
   16944   0 : { $$ = 
makeJsonTableJoinedPlan(JSTPJ_CROSS, $1, $3, @1); }
Not really sure how critical this one is TBH.


utils/adt/jsonpath_exec.c:

3492 : /* Recursively reset scan and its child nodes */
3493 : static void
3494 120 : JsonTableRescanRecursive(JsonTablePlanState * state)
3495 : {
3496 120 : if (state->type == JSON_TABLE_JOIN_STATE)
3497 : {
3498   0 : JsonTableJoinState *join = (JsonTableJoinState 
*) state;
3499 : 
3500   0 : JsonTableRescanRecursive(join->left);
3501   0 : JsonTableRescanRecursive(join->right);
3502   0 : join->advanceRight = false;
3503 : }

I think this one had better be covered.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"The saddest aspect of life right now is that science gathers knowledge faster
 than society gathers wisdom."  (Isaac Asimov)




Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

2024-01-18 Thread Bharath Rupireddy
On Mon, Jan 15, 2024 at 1:18 PM Bertrand Drouvot
 wrote:
>
> Hi hackers,
>
> While working on [1], we discovered (thanks Alexander for the testing) that an
> conflicting active logical slot on a standby could be "terminated" without
> leading to an "obsolete" message (see [2]).
>
> Indeed, in case of an active slot we proceed in 2 steps in
> InvalidatePossiblyObsoleteSlot():
>
> - terminate the backend holding the slot
> - report the slot as obsolete
>
> This is racy because between the two we release the mutex on the slot, which
> means that the slot's effective_xmin and effective_catalog_xmin could advance
> during that time (leading to exit the loop).
>
> I think that holding the mutex longer is not an option (given what we'd to do
> while holding it) so the attached proposal is to record the effective_xmin and
> effective_catalog_xmin instead that was used during the backend termination.
>
> [1]: 
> https://www.postgresql.org/message-id/flat/bf67e076-b163-9ba3-4ade-b9fc51a3a8f6%40gmail.com
> [2]: 
> https://www.postgresql.org/message-id/7c025095-5763-17a6-8c80-899b35bd0459%40gmail.com
>
> Looking forward to your feedback,

IIUC, the issue is that we terminate the process holding the
replication slot, and the conflict cause that we recorded before
terminating the process changes in the next iteration due to the
advancement in effective_xmin and/or effective_catalog_xmin.

FWIW, a test code something like [1], can help detect above race issues, right?

Some comments on the patch:

1.
 last_signaled_pid = active_pid;
+terminated = true;
 }

Why is a separate variable needed? Can't last_signaled_pid != 0 enough
to determine that a process was terminated earlier?

2. If my understanding of the racy behavior is right, can the same
issue happen due to the advancement in restart_lsn? I'm not sure if it
can happen at all, but I think we can rely on previous conflict reason
instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.

3. Is there a way to reproduce this racy issue, may be by adding some
sleeps or something? If yes, can you share it here, just for the
records and verify the whatever fix provided actually works?

[1]
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 52da694c79..d020b038bc 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1352,6 +1352,7 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 {
int last_signaled_pid = 0;
boolreleased_lock = false;
+   ReplicationSlotInvalidationCause conflict_prev = RS_INVAL_NONE;

for (;;)
{
@@ -1417,6 +1418,18 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
}
}

+   /*
+* Check if the conflict cause recorded previously
before we terminate
+* the process changed now for any reason.
+*/
+   if (conflict_prev != RS_INVAL_NONE &&
+   last_signaled_pid != 0 &&
+   conflict_prev != conflict)
+   elog(PANIC, "conflict cause recorded before
terminating process %d has been changed; previous cause: %d, current
cause: %d",
+last_signaled_pid,
+conflict_prev,
+conflict);
+
/* if there's no conflict, we're done */
if (conflict == RS_INVAL_NONE)
{
@@ -1499,6 +1512,7 @@
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
(void) kill(active_pid, SIGTERM);

last_signaled_pid = active_pid;
+   conflict_prev = conflict;
}

/* Wait until the slot is released. */

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Report planning memory in EXPLAIN ANALYZE

2024-01-18 Thread Ashutosh Bapat
On Thu, Jan 18, 2024 at 4:42 PM Alvaro Herrera  wrote:
>
> On 2024-Jan-18, Ashutosh Bapat wrote:
>
> Hmm ... TBH I don't like the "showed_planning" thing very much, but if
> we need to conditionalize the printing of "Planning:" on whether we
> print either of buffers or memory, maybe there's no way around something
> like what you propose.

right.

>
> However, I don't understand this output change, and I think it indicates
> a bug in the logic there:
>
> > diff --git a/src/test/regress/expected/explain.out 
> > b/src/test/regress/expected/explain.out
> > index 86bfdfd29e..55694505a7 100644
> > --- a/src/test/regress/expected/explain.out
> > +++ b/src/test/regress/expected/explain.out
> > @@ -331,15 +331,15 @@ select explain_filter('explain (memory) select * from 
> > int8_tbl i8');
> >   explain_filter
> >  -
> >   Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N)
> > - Planner Memory: used=N bytes allocated=N bytes
> > +   Memory: used=N bytes, allocated=N bytes
> >  (2 rows)
>
> Does this really make sense?

The EXPLAIN output produces something like below
   explain_filter
  -
   Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N)
  Planning:
Memory: used=N bytes, allocated=N bytes
 (3 rows)

but function explain_filter(), defined in explain.sql, removes line
containing Planning: and we end up with output
   explain_filter
  -
   Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N)
Memory: used=N bytes, allocated=N bytes
 (2 rows)

Hence this weird difference.

-- 
Best Wishes,
Ashutosh Bapat




Re: Synchronizing slots from primary to standby

2024-01-18 Thread Amit Kapila
On Wed, Jan 17, 2024 at 4:00 PM shveta malik  wrote:
>
> PFA v63.
>

1.
+ /* User created slot with the same name exists, raise ERROR. */
+ if (!synced)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("exiting from slot synchronization on receiving"
+" the failover slot \"%s\" from the primary server",
+remote_slot->name),
+ errdetail("A user-created slot with the same name already"
+   " exists on the standby."));

I think here primary error message should contain the reason for
failure. Something like: "exiting from slot synchronization because
same name slot already exists on standby" then we can add more details
in errdetail.

2.
+synchronize_one_slot(WalReceiverConn *wrconn, RemoteSlot *remote_slot)
{
...
+ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+ xmin_horizon = GetOldestSafeDecodingTransactionId(true);
+ SpinLockAcquire(&slot->mutex);
+ slot->data.catalog_xmin = xmin_horizon;
+ SpinLockRelease(&slot->mutex);
...
}

Here, why slot->effective_catalog_xmin is not updated? The same is
required by a later call to ReplicationSlotsComputeRequiredXmin(). I
see that the prior version v60-0002 has the corresponding change but
it is missing in the latest version. Any reason?

3.
+ * Return true either if the slot is marked as RS_PERSISTENT (sync-ready) or
+ * is synced periodically (if it was already sync-ready). Return false
+ * otherwise.
+ */
+static bool
+update_and_persist_slot(RemoteSlot *remote_slot)

The second part of the above comment (or is synced periodically (if it
was already sync-ready)) is not clear to me. Does it intend to
describe the case when we try to update the already created temp slot
in the last call. If so, that is not very clear because periodically
sounds like it can be due to repeated sync for sync-ready slot.

4.
+update_and_persist_slot(RemoteSlot *remote_slot)
{
...
+ (void) local_slot_update(remote_slot);
...
}

Can we write a comment to state the reason why we don't care about the
return value here?

-- 
With Regards,
Amit Kapila.




Re: Reduce useless changes before reassembly during logical replication

2024-01-18 Thread Bharath Rupireddy
On Thu, Jan 18, 2024 at 2:47 PM Amit Kapila  wrote:
>
> On Thu, Jan 18, 2024 at 12:12 PM Bharath Rupireddy
>  wrote:
> >
> > On Wed, Jan 17, 2024 at 11:45 AM li jie  wrote:
> > >
> > > Hi hackers,
> > >
> > > During logical replication, if there is a large write transaction, some
> > > spill files will be written to disk, depending on the setting of
> > > logical_decoding_work_mem.
> > >
> > > This behavior can effectively avoid OOM, but if the transaction
> > > generates a lot of change before commit, a large number of files may
> > > fill the disk. For example, you can update a TB-level table.
> > >
> > > However, I found an inelegant phenomenon. If the modified large table is 
> > > not
> > > published, its changes will also be written with a large number of spill 
> > > files.
> > > Look at an example below:
> >
> > Thanks. I agree that decoding and queuing the changes of unpublished
> > tables' data into reorder buffer is an unnecessary task for walsender.
> > It takes processing efforts (CPU overhead), consumes disk space and
> > uses memory configured via logical_decoding_work_mem for a replication
> > connection inefficiently.
> >
>
> This is all true but note that in successful cases (where the table is
> published) all the work done by FilterByTable(accessing caches,
> transaction-related stuff) can add noticeable overhead as anyway we do
> that later in pgoutput_change().

Right. Overhead for published tables need to be studied. A possible
way is to mark the checks performed in
FilterByTable/filter_by_table_cb and skip the same checks in
pgoutput_change. I'm not sure if this works without any issues though.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Report planning memory in EXPLAIN ANALYZE

2024-01-18 Thread Alvaro Herrera
On 2024-Jan-18, Ashutosh Bapat wrote:

Hmm ... TBH I don't like the "showed_planning" thing very much, but if
we need to conditionalize the printing of "Planning:" on whether we
print either of buffers or memory, maybe there's no way around something
like what you propose.

However, I don't understand this output change, and I think it indicates
a bug in the logic there:

> diff --git a/src/test/regress/expected/explain.out 
> b/src/test/regress/expected/explain.out
> index 86bfdfd29e..55694505a7 100644
> --- a/src/test/regress/expected/explain.out
> +++ b/src/test/regress/expected/explain.out
> @@ -331,15 +331,15 @@ select explain_filter('explain (memory) select * from 
> int8_tbl i8');
>   explain_filter  
>  -
>   Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N)
> - Planner Memory: used=N bytes allocated=N bytes
> +   Memory: used=N bytes, allocated=N bytes
>  (2 rows)

Does this really make sense?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Linux transformó mi computadora, de una `máquina para hacer cosas',
en un aparato realmente entretenido, sobre el cual cada día aprendo
algo nuevo" (Jaime Salinas)




  1   2   >