date:20240904

Re: Optimize mul_var() for var1ndigits >= 8

2024-09-04 Thread Dean Rasheed

On Tue, 3 Sept 2024 at 21:31, Tom Lane  wrote:
>
> Dean Rasheed  writes:
> > Ah, OK. I've pushed a fix.
>
> There is an open CF entry pointing at this thread [1].
> Shouldn't it be marked committed now?
>

Oops, yes I missed that CF entry. I've closed it now.

Joel, are you still planning to work on the Karatsuba multiplication
patch? If not, we should close that CF entry too.

Regards,
Dean

Re: Use XLOG_CONTROL_FILE macro everywhere?

2024-09-04 Thread Kyotaro Horiguchi

At Tue, 3 Sep 2024 12:02:26 +0300, "Anton A. Melnikov" 
 wrote in 
> In v2 removed XLOG_CONTROL_FILE from args and used it directly in the
> string.
> IMHO this makes the code more readable and will output the correct
> text if the macro changes.

Yeah, I had this in my mind. Looks good to me.

> 3)
..
> Maybe include/access/xlogdefs.h would be a good place for this?
> In v2 moved some definitions from xlog_internal to xlogdefs
> and removed including xlog_internal.h from the postmaster.c.
> Please, take a look on it.

The change can help avoid disrupting existing users of the
macro. However, the file is documented as follows:

>  * Postgres write-ahead log manager record pointer and
>  * timeline number definitions

We could modify the file definition, but I'm not sure if that's the
best approach. Instead, I'd like to propose separating the file and
path-related definitions from xlog_internal.h, as shown in the
attached first patch. This change would allow some modules to include
files without unnecessary details.

The second file is your patch, adjusted based on the first patch.

I’d appreciate hearing from others on whether they find the first
patch worthwhile. If it’s not considered worthwhile, then I believe
having postmaster include xlog_internal.h would be the best approach.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 22b71830ff601717d01dc66be95bafb930b1a0ea Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 4 Sep 2024 16:15:16 +0900
Subject: [PATCH 1/2] Split file-related parts from xlog_internal.h

Some modules that require XLOG file manipulation had to include
xlog_internal.h, which contains excessively detailed internal
information. This commit separates the file and file path-related
components into xlogfilepaths.h, thereby avoiding the exposure of
unrelated details of the XLOG module.
---
 src/backend/access/transam/timeline.c |   1 -
 src/backend/postmaster/pgarch.c   |   1 -
 src/backend/utils/adt/genfile.c   |   1 -
 src/bin/initdb/initdb.c   |   1 +
 src/bin/pg_archivecleanup/pg_archivecleanup.c |   2 +-
 src/bin/pg_basebackup/pg_basebackup.c |   4 +-
 src/bin/pg_basebackup/pg_receivewal.c |   1 +
 src/bin/pg_basebackup/pg_recvlogical.c|   1 -
 src/bin/pg_basebackup/receivelog.c|   2 +-
 src/bin/pg_basebackup/streamutil.c|   2 +-
 src/bin/pg_controldata/pg_controldata.c   |   1 -
 src/bin/pg_rewind/pg_rewind.c |   2 +-
 src/include/access/xlog.h |  13 +-
 src/include/access/xlog_internal.h| 178 +-
 src/include/access/xlogfilepaths.h| 219 ++
 15 files changed, 230 insertions(+), 199 deletions(-)
 create mode 100644 src/include/access/xlogfilepaths.h

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 146751ae37..18df8fd326 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -38,7 +38,6 @@
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
 #include "access/xlogarchive.h"
-#include "access/xlogdefs.h"
 #include "pgstat.h"
 #include "storage/fd.h"
 
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index 02f91431f5..4971cd76ff 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -30,7 +30,6 @@
 #include 
 
 #include "access/xlog.h"
-#include "access/xlog_internal.h"
 #include "archive/archive_module.h"
 #include "archive/shell_archive.h"
 #include "lib/binaryheap.h"
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index 24b95c32b7..17be1756e2 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -21,7 +21,6 @@
 #include 
 
 #include "access/htup_details.h"
-#include "access/xlog_internal.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_tablespace_d.h"
 #include "catalog/pg_type.h"
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index f00718a015..95ec8d9c92 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -65,6 +65,7 @@
 #endif
 
 #include "access/xlog_internal.h"
+#include "lib/stringinfo.h"
 #include "catalog/pg_authid_d.h"
 #include "catalog/pg_class_d.h" /* pgrminclude ignore */
 #include "catalog/pg_collation_d.h"
diff --git a/src/bin/pg_archivecleanup/pg_archivecleanup.c b/src/bin/pg_archivecleanup/pg_archivecleanup.c
index 5a124385b7..f6b33de32a 100644
--- a/src/bin/pg_archivecleanup/pg_archivecleanup.c
+++ b/src/bin/pg_archivecleanup/pg_archivecleanup.c
@@ -15,7 +15,7 @@
 #include 
 #include 
 
-#include "access/xlog_internal.h"
+#include "access/xlogfilepaths.h"
 #include "common/logging.h"
 #include "getopt_long.h"
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index e41a6cfbda..f86af782dc 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src

Re: Typos in the code and README

2024-09-04 Thread Daniel Gustafsson

> On 4 Sep 2024, at 03:25, Michael Paquier  wrote:
> 
> On Tue, Sep 03, 2024 at 12:00:13PM +0200, Daniel Gustafsson wrote:
>> I see your v17 typo fixes, and raise you a few more.  Commit 31a98934d169 
>> from
>> just now contains 2 (out of 3) sets of typos introduced in v17 so they should
>> follow along when you push the ones mentioned here.
> 
> Is that really mandatory?  I tend to worry about back branches only
> when this stuff is user-visible, like in the docs or error messages.
> This opinion varies for each individual, of course.  That's just my
> lazy opinion.

Not mandatory at all, but since you were prepping a typo backpatch anyways I
figured these could join to put a small dent in reducing risks for future
backports.

--
Daniel Gustafsson

RecoveryTargetAction is left out in xlog_interna.h

2024-09-04 Thread Kyotaro Horiguchi

Hello,

While reviewing a patch, I noticed that enum RecoveryTargetAction is
still in xlog_internal.h, even though it seems like it should be in
xlogrecovery.h. Commit 70e81861fa separated out xlogrecovery.c/h and
moved several enums related to recovery targets to
xlogrecovery.h. However, it appears that enum RecoveryTargetAction was
inadvertently left behind in that commit.

Please find the attached patch, which addresses this oversight.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 4c938f473164d75761001b31fa85e5dc215fbd9a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 4 Sep 2024 17:12:45 +0900
Subject: [PATCH] Move enum RecoveryTargetAction to xlogrecovery.h

Commit 70e81861fa split out xlogrecovery.c/h and moved some enums
related to recovery targets to xlogrecovery.h. However, it seems that
the enum RecoveryTargetAction was inadvertently left out by that
commit. This commit moves it to xlogrecovery.h for consistency.
---
 src/include/access/xlog_internal.h | 10 --
 src/include/access/xlogrecovery.h  | 10 ++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index e5cdba0584..9ba01ec20e 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -316,16 +316,6 @@ typedef struct XLogRecData
 	uint32		len;			/* length of rmgr data to include */
 } XLogRecData;
 
-/*
- * Recovery target action.
- */
-typedef enum
-{
-	RECOVERY_TARGET_ACTION_PAUSE,
-	RECOVERY_TARGET_ACTION_PROMOTE,
-	RECOVERY_TARGET_ACTION_SHUTDOWN,
-}			RecoveryTargetAction;
-
 struct LogicalDecodingContext;
 struct XLogRecordBuffer;
 
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index c423464e8b..c09f84765b 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -40,6 +40,16 @@ typedef enum
 	RECOVERY_TARGET_TIMELINE_NUMERIC,
 } RecoveryTargetTimeLineGoal;
 
+/*
+ * Recovery target action.
+ */
+typedef enum
+{
+	RECOVERY_TARGET_ACTION_PAUSE,
+	RECOVERY_TARGET_ACTION_PROMOTE,
+	RECOVERY_TARGET_ACTION_SHUTDOWN,
+}			RecoveryTargetAction;
+
 /* Recovery pause states */
 typedef enum RecoveryPauseState
 {
-- 
2.43.5

Re: Commit Timestamp and LSN Inversion issue

2024-09-04 Thread Aleksander Alekseev

Hi Shveta,

> While discussing Logical Replication's Conflict Detection and
> Resolution (CDR) design in [1] , it came to  our notice that the
> commit LSN and timestamp may not correlate perfectly i.e. commits may
> happen with LSN1 < LSN2 but with Ts1 > Ts2. This issue may arise
> because, during the commit process, the timestamp (xactStopTimestamp)
> is captured slightly earlier than when space is reserved in the WAL.
>  [...]
> There was a suggestion in [3] to acquire the timestamp while reserving
> the space (because that happens in LSN order). The clock would need to
> be monotonic (easy enough with CLOCK_MONOTONIC), but also cheap. The
> main problem why it's being done outside the critical section, because
> gettimeofday() may be quite expensive. There's a concept of hybrid
> clock, combining "time" and logical counter, which might be useful
> independently of CDR.

I don't think you can rely on a system clock for conflict resolution.
In a corner case a DBA can move the clock forward or backward between
recordings of Ts1 and Ts2. On top of that there is no guarantee that
2+ servers have synchronised clocks. It seems to me that what you are
proposing will just hide the problem instead of solving it in the
general case.

This is the reason why {latest|earliest}_timestamp_wins strategies you
are proposing to use for CDR are poor strategies. In practice they
work as random_key_wins which is not extremely useful (what it does is
basically _deleting_ random data, not solving conflicts). On top of
that strategies like this don't take into account checks and
constraints the database may have, including high-level constraints
that may not be explicitly defined in the DBMS but may exist in the
application logic.

Unfortunately I can't reference a particular article or white paper on
the subject but I know that "last write wins" was widely criticized
back in the 2010s when people were building distributed systems on
commodity hardware. In this time period I worked on several projects
as a backend software engineer and I can assure you that LWW is not
something you want.

IMO the right approach to the problem would be defining procedures for
conflict resolution that may not only semi-randomly choose between two
tuples but also implement a user-defined logic. Similarly to INSERT
INTO ... ON CONFLICT ... semantics, or similar approaches from
long-lived and well-explored distributed system, e.g. Riak.
Alternatively / additionally we could support CRDTs in Postgres.

-- 
Best regards,
Aleksander Alekseev

Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes

2024-09-04 Thread Bertrand Drouvot

Hi,

On Thu, Aug 29, 2024 at 04:04:05PM +0200, Guillaume Lelarge wrote:
> Hello,
> 
> This patch was a bit discussed on [1], and with more details on [2]. It
> introduces four new columns in pg_stat_all_tables:
> 
> * parallel_seq_scan
> * last_parallel_seq_scan
> * parallel_idx_scan
> * last_parallel_idx_scan
> 
> and two new columns in pg_stat_all_indexes:
> 
> * parallel_idx_scan
> * last_parallel_idx_scan
> 
> As Benoit said yesterday, the intent is to help administrators evaluate the
> usage of parallel workers in their databases and help configuring
> parallelization usage.

Thanks for the patch. I think that's a good idea to provide more instrumentation
in this area. So, +1 regarding this patch.

A few random comments:

1 ===

+ 
+  
+   parallel_seq_scan bigint
+  
+  
+   Number of parallel sequential scans initiated on this table
+  
+ 

I wonder if we should not update the seq_scan too to indicate that it
includes the parallel_seq_scan.

Same kind of comment for last_seq_scan, idx_scan and last_idx_scan.

2 ===

@@ -410,6 +410,8 @@ initscan(HeapScanDesc scan, ScanKey key, bool 
keep_startblock)
 */
if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN)
pgstat_count_heap_scan(scan->rs_base.rs_rd);
+  if (scan->rs_base.rs_parallel != NULL)
+   pgstat_count_parallel_heap_scan(scan->rs_base.rs_rd);

Indentation seems broken.

Shouldn't the parallel counter relies on the "scan->rs_base.rs_flags & 
SO_TYPE_SEQSCAN"
test too?

What about to get rid of the pgstat_count_parallel_heap_scan and add an extra
bolean parameter to pgstat_count_heap_scan to indicate if 
counts.parallelnumscans
should be incremented too?

Something like:

pgstat_count_heap_scan(scan->rs_base.rs_rd, scan->rs_base.rs_parallel != NULL)

3 ===

Same comment for pgstat_count_index_scan (add an extra bolean parameter) and
get rid of pgstat_count_parallel_index_scan()).

I think that 2 === and 3 === would help to avoid missing increments should we
add those call to other places in the future.

4 ===

+   if (lstats->counts.numscans || lstats->counts.parallelnumscans)

Is it possible to have (lstats->counts.parallelnumscans) whithout having
(lstats->counts.numscans) ?

> First time I had to add new columns to a statistics catalog. I'm actually
> not sure that we were right to change pg_proc.dat manually.

I think that's the right way to do.

I don't see a CF entry for this patch. Would you mind creating one so that
we don't lost track of it?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: list of acknowledgments for PG17

2024-09-04 Thread Peter Eisentraut


On 03.09.24 19:59, Matthias van de Meent wrote:
I've done a small check of the list (search for first names in the 
document), and found the following curiosities:


I see various cases where what seem to be names with Chinese origin are 
styled differently between this list and the feature list. Specifically, 
"Acknowledged", vs "Feature List":


"Mingli Zhang", vs "Zhang Mingli"
"Zhijie Hou", vs "Hou Zhijie" (cc-ed)
"Zongliang Quan", vs " Quan Zongliang"

I don't know if one order is preferred over the other, but it seems 
inconsistent. Hou, do you have more info on this?


The list of acknowledgments consistently uses given name followed by 
family name.  I suspect the release notes are not as rigid about that.



Next, I noticed some other inconsistent names:

"Hubert Lubaczewski", vs "Hubert Depesz Lubaczewski"
"Paul Jungwirth", vs "Paul A. Jungwirth"


In some cases like that, I default to using the name variant that was 
used in past release notes, so it's consistent over time.

Re: POC: make mxidoff 64 bits

2024-09-04 Thread Maxim Orlov

On Tue, 3 Sept 2024 at 16:32, Alexander Korotkov 
wrote:

> I don't think you need to maintain CATALOG_VERSION_NO change in your
> patch for the exact reason you have mentioned: patch will get conflict
> each time CATALOG_VERSION_NO is advanced.  It's responsibility of
> committer to advance CATALOG_VERSION_NO when needed.
>

OK, I got it. My intention here was to help to test the patch. If someone
wants to have a
look at the patch, he won't need to make changes in the code. In the next
iteration, I'll
remove CATALOG_VERSION_NO version change.

-- 
Best regards,
Maxim Orlov.

Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-09-04 Thread jian he

On Wed, Jul 10, 2024 at 5:36 PM Tatsuo Ishii  wrote:
>
>
> Attached are the v2 patches. As suggested by David, I split them
> into multiple patches so that each patch implements the feature for
> each node. You need to apply the patches in the order of patch number
> (if you want to apply all of them, "git apply v2-*.patch" should
> work).
>
> v2-0001-Refactor-show_material_info.patch:
> This refactors show_material_info(). The guts are moved to new
> show_storage_info() so that it can be shared by not only Materialized
> node.
>
> v2-0002-Add-memory-disk-usage-for-CTE-Scan-nodes-in-EXPLA.patch:
> This adds memory/disk usage for CTE Scan nodes in EXPLAIN (ANALYZE) command.
>
> v2-0003-Add-memory-disk-usage-for-Table-Function-Scan-nod.patch:
> This adds memory/disk usage for Table Function Scan nodes in EXPLAIN 
> (ANALYZE) command.
>
> v2-0004-Add-memory-disk-usage-for-Recursive-Union-nodes-i.patch:
> This adds memory/disk usage for Recursive Union nodes in EXPLAIN
> (ANALYZE) command. Also show_storage_info() is changed so that it
> accepts int64 storage_used, char *storage_type arguments. They are
> used if the target node uses multiple tuplestores, in case a simple
> call to tuplestore_space_used() does not work. Such executor nodes
> need to collect storage_used while running the node. This type of node
> includes Recursive Union and Window Aggregate.
>
> v2-0005-Add-memory-disk-usage-for-Window-Aggregate-nodes-.patch: This
> adds memory/disk usage for Window Aggregate nodes in EXPLAIN (ANALYZE)
> command. Note that if David's proposal
> https://www.postgresql.org/message-id/CAHoyFK9n-QCXKTUWT_xxtXninSMEv%2BgbJN66-y6prM3f4WkEHw%40mail.gmail.com
> is committed, this will need to be adjusted.
>
> For a demonstration, how storage/memory usage is shown in EXPLAIN
> (notice "Storage: Memory Maximum Storage: 120kB" etc.). The script
> used is attached (test.sql.txt). The SQLs are shamelessly copied from
> David's example and the regression test (some of them were modified by
> me).
>

hi. I can roughly understand it.

I have one minor issue with the comment.

typedef struct RecursiveUnionState
{
PlanStateps;/* its first field is NodeTag */
boolrecursing;
boolintermediate_empty;
Tuplestorestate *working_table;
Tuplestorestate *intermediate_table;
int64storageSize;/* max storage size Tuplestore */
char*storageType;/* the storage type above */

}

"/* the storage type above */"
is kind of ambiguous, since there is more than one Tuplestorestate.

i think it roughly means: the storage type of working_table
while the max storage of working_table.



typedef struct WindowAggState
{
ScanStatess;/* its first field is NodeTag */

/* these fields are filled in by ExecInitExpr: */
List   *funcs;/* all WindowFunc nodes in targetlist */
intnumfuncs;/* total number of window functions */
intnumaggs;/* number that are plain aggregates */

WindowStatePerFunc perfunc; /* per-window-function information */
WindowStatePerAgg peragg;/* per-plain-aggregate information */
ExprState  *partEqfunction; /* equality funcs for partition columns */
ExprState  *ordEqfunction;/* equality funcs for ordering columns */
Tuplestorestate *buffer;/* stores rows of current partition */
int64storageSize;/* max storage size in buffer */
char*storageType;/* the storage type above */
}

" /* the storage type above */"
I think it roughly means:
" the storage type of WindowAggState->buffer while the max storage of
WindowAggState->buffer".

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-04 Thread shveta malik

On Wed, Sep 4, 2024 at 9:17 AM shveta malik  wrote:
>
> On Tue, Sep 3, 2024 at 3:01 PM shveta malik  wrote:
> >
> >

1)
It is related to one of my previous comments (pt 3 in [1]) where I
stated that inactive_since should not keep on changing once a slot is
invalidated.
Below is one side effect if inactive_since keeps on changing:

postgres=# SELECT * FROM pg_replication_slot_advance('mysubnew1_1',
pg_current_wal_lsn());
ERROR:  can no longer get changes from replication slot "mysubnew1_1"
DETAIL:  The slot became invalid because it was inactive since
2024-09-04 10:03:56.68053+05:30, which is more than 10 seconds ago.
HINT:  You might need to increase "replication_slot_inactive_timeout.".

postgres=# select now();
   now
-
 2024-09-04 10:04:00.26564+05:30

'DETAIL' gives wrong information, we are not past 10-seconds. This is
because inactive_since got updated even in ERROR scenario.

2)
One more issue in this message is, once I set
replication_slot_inactive_timeout to a bigger value, it becomes more
misleading. This is because invalidation was done in the past using
previous value while message starts showing new value:

ALTER SYSTEM SET replication_slot_inactive_timeout TO '36h';

--see 129600 secs in DETAIL and the current time.
postgres=# SELECT * FROM pg_replication_slot_advance('mysubnew1_1',
pg_current_wal_lsn());
ERROR:  can no longer get changes from replication slot "mysubnew1_1"
DETAIL:  The slot became invalid because it was inactive since
2024-09-04 10:06:38.980939+05:30, which is more than 129600 seconds
ago.
postgres=# select now();
   now
--
 2024-09-04 10:07:35.201894+05:30

I feel we should change this message itself.

~

When invalidation is due to wal_removed, we get a way simpler message:

newdb1=# SELECT * FROM pg_replication_slot_advance('mysubnew1_2',
pg_current_wal_lsn());
ERROR:  replication slot "mysubnew1_2" cannot be advanced
DETAIL:  This slot has never previously reserved WAL, or it has been
invalidated.

This message does not mention 'max_slot_wal_keep_size'. We should have
a similar message for our case. Thoughts?

[1]:  
https://www.postgresql.org/message-id/CAJpy0uC8Dg-0JS3NRUwVUemgz5Ar2v3_EQQFXyAigWSEQ8U47Q%40mail.gmail.com

thanks
Shveta

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-09-04 Thread Peter Eisentraut


On 03.09.24 22:56, Jacob Champion wrote:

The parse_strval field could use a better explanation.

I actually don't understand the need for this field.  AFAICT, this is
just used to record whether strval is valid.

No, it's meant to track the value of the need_escapes argument to the
constructor. I've renamed it and moved the assignment to hopefully
make that a little more obvious. WDYT?


Yes, this is clearer.

This patch (v28-0001) looks good to me now.

Re: scalability bottlenecks with (many) partitions (and more)

2024-09-04 Thread Jakub Wartak

Hi Tomas!

On Tue, Sep 3, 2024 at 6:20 PM Tomas Vondra  wrote:
>
> On 9/3/24 17:06, Robert Haas wrote:
> > On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra  wrote:
> >> The one argument to not tie this to max_locks_per_transaction is the
> >> vastly different "per element" memory requirements. If you add one entry
> >> to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
> >> OTOH one fast-path entry is ~5B, give or take. That's a pretty big
> >> difference, and it if the locks fit into the shared lock table, but
> >> you'd like to allow more fast-path locks, having to increase
> >> max_locks_per_transaction is not great - pretty wastefull.
> >>
> >> OTOH I'd really hate to just add another GUC and hope the users will
> >> magically know how to set it correctly. That's pretty unlikely, IMO. I
> >> myself wouldn't know what a good value is, I think.
> >>
> >> But say we add a GUC and set it to -1 by default, in which case it just
> >> inherits the max_locks_per_transaction value. And then also provide some
> >> basic metric about this fast-path cache, so that people can tune this?
> >
> > All things being equal, I would prefer not to add another GUC for
> > this, but we might need it.
> >
>
> Agreed.
>
> [..]
>
> So I think I'm OK with just tying this to max_locks_per_transaction.

If that matters then the SLRU configurability effort added 7 GUCs
(with 3 scaling up based on shared_buffers) just to give high-end
users some relief, so here 1 new shouldn't be that such a deal. We
could add to the LWLock/lock_manager wait event docs to recommend just
using known-to-be-good certain values from this $thread (or ask the
user to benchmark it himself).

> >> I think just knowing the "hit ratio" would be enough, i.e. counters for
> >> how often it fits into the fast-path array, and how often we had to
> >> promote it to the shared lock table would be enough, no?
> >
> > Yeah, probably. I mean, that won't tell you how big it needs to be,
> > but it will tell you whether it's big enough.
> >
>
> True, but that applies to all "cache hit ratio" metrics (like for our
> shared buffers). It'd be great to have something better, enough to tell
> you how large the cache needs to be. But we don't :-(

My $0.02 cents: the originating case that triggered those patches,
actually started with LWLock/lock_manager waits being the top#1. The
operator can cross check (join) that with a group by pg_locks.fastpath
(='f'), count(*). So, IMHO we have good observability in this case
(rare thing to say!)

> > I wonder if we should be looking at further improvements in the lock
> > manager of some kind. [..]
>
> Perhaps. I agree we'll probably need something more radical soon, not
> just changes that aim to fix some rare exceptional case (which may be
> annoying, but not particularly harmful for the complete workload).
>
> For example, if we did what you propose, that might help when very few
> transactions need a lot of locks. I don't mind saving memory in that
> case, ofc. but is it a problem if those rare cases are a bit slower?
> Shouldn't we focus more on cases where many locks are common? Because
> people are simply going to use partitioning, a lot of indexes, etc?
>
> So yeah, I agree we probably need a more fundamental rethink. I don't
> think we can just keep optimizing the current approach, there's a limit
> of fast it can be.

Please help me understand: so are You both discussing potential far
future further improvements instead of this one ? My question is
really about: is the patchset good enough or are you considering some
other new effort instead?

BTW some other random questions:
Q1. I've been lurking into
https://github.com/tvondra/pg-lock-scalability-results and those
shouldn't be used anymore for further discussions, as they contained
earlier patches (including
0003-Add-a-memory-pool-with-adaptive-rebalancing.patch) and they were
replaced by benchmark data in this $thread, right?
Q2. Earlier attempts did contain a mempool patch to get those nice
numbers (or was that jemalloc or glibc tuning). So were those recent
results in [1] collected with still 0003 or you have switched
completely to glibc/jemalloc tuning?

-J.

[1] - 
https://www.postgresql.org/message-id/b8c43eda-0c3f-4cb4-809b-841fa5c40ada%40vondra.me

Re: Add callbacks for fixed-numbered stats flush in pgstats

2024-09-04 Thread Kyotaro Horiguchi

At Wed, 4 Sep 2024 15:12:37 +0900, Michael Paquier  wrote 
in 
> On Wed, Sep 04, 2024 at 05:28:56AM +, Bertrand Drouvot wrote:
> > On Wed, Sep 04, 2024 at 02:05:46PM +0900, Kyotaro Horiguchi wrote:
> >> The generalization sounds good to me, and hiding the private flags in
> >> private places also seems good.
> > 
> > +1 on the idea.
> 
> Thanks for the feedback.
> 
> >> Regarding pgstat_io_flush_cb, I think it no longer needs to check the
> >> have_iostats variable, and that check should be moved to the new
> >> pgstat_flush_io(). The same applies to pgstat_wal_flush_cb() (and
> >> pgstat_flush_wal()). As for pgstat_slru_flush_cb, it simply doesn't
> >> need the check anymore.
> > 
> > Agree. Another option could be to add only one callback (the flush_fixed_cb 
> > one)
> > and get rid of the have_fixed_pending_cb one. Then add an extra parameter 
> > to the
> > flush_fixed_cb that would only check if there is pending stats (without
> > flushing them). I think those 2 callbacks are highly related that's why I 
> > think we could "merge" them, thoughts?
> 
> I would still value the shortcut that we can use if there is no
> activity to avoid the clock check with GetCurrentTimestamp(), though,
> which is why I'd rather stick with two callbacks as that can lead to a
> much cheaper path.

Yeah, I was wrong in this point. 

> Anyway, I am not sure to follow your point about removing the boolean
> checks in the flush callbacks.  It's surely always a better to never
> call LWLockAcquire() or LWLockConditionalAcquire() if we don't have
> to, and we would go through the whole flush dance as long as we detect
> some stats activity in at least one stats kind.  I mean, that's cheap
> enough to keep around.

I doubt that overprotection is always better, but in this case, it's
not overprotection at all. The flush callbacks are called
unconditionally once we decide to flush anything. Sorry for the noise.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add callback in pgstats for backend initialization

2024-09-04 Thread Kyotaro Horiguchi

At Wed, 4 Sep 2024 15:04:09 +0900, Michael Paquier  wrote 
in 
> On Wed, Sep 04, 2024 at 02:15:43PM +0900, Kyotaro Horiguchi wrote:
> > The name "init_backend" makes it sound like the function initializes
> > the backend. backend_init might be a better choice, but I'm not sure.
> 
> We (kind of) tend to prefer $verb_$thing-or-action_cb for the name of
> the callbacks, which is why I chose this naming.  If you feel strongly
> about "backend_init_cb", that's also fine by me; you are the original
> author of this code area with Andres.

More accurately, I believe this applies when the sentence follows a
verb-object structure. In this case, the function’s meaning is “pgstat
initialization on backend,” which seems somewhat different from the
policy you mentioned. However, it could also be interpreted as
“initialize backend status related to pgstat.” Either way, I’m okay
with the current name if you are, based on the discussion above.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: json_query conditional wrapper bug

2024-09-04 Thread Peter Eisentraut


On 28.08.24 11:21, Peter Eisentraut wrote:

These are ok:

select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' without wrapper);
  json_query

  42

select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' with 
unconditional wrapper);

  json_query

  [42]

But this appears to be wrong:

select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' with conditional 
wrapper);

  json_query

  [42]

This should return an unwrapped 42.


If I make the code change illustrated in the attached patch, then I get 
the correct result here.  And various regression test results change, 
which, to me, all look more correct after this patch.  I don't know what 
the code I removed was supposed to accomplish, but it seems to be wrong 
somehow.  In the current implementation, the WITH CONDITIONAL WRAPPER 
clause doesn't appear to work correctly in any case I could identify.
From ea8a36584abf6cea0a05ac43ad8d101012b14313 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 4 Sep 2024 12:09:35 +0200
Subject: [PATCH] WIP: Fix JSON_QUERY WITH CONDITIONAL WRAPPER

---
 src/backend/utils/adt/jsonpath_exec.c   |  5 +
 src/test/regress/sql/sqljson_queryfuncs.sql | 10 +-
 2 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/src/backend/utils/adt/jsonpath_exec.c 
b/src/backend/utils/adt/jsonpath_exec.c
index e3ee0093d4d..a55a34685f0 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -3957,10 +3957,7 @@ JsonPathQuery(Datum jb, JsonPath *jp, JsonWrapper 
wrapper, bool *empty,
else if (wrapper == JSW_UNCONDITIONAL)
wrap = true;
else if (wrapper == JSW_CONDITIONAL)
-   wrap = count > 1 ||
-   IsAJsonbScalar(singleton) ||
-   (singleton->type == jbvBinary &&
-JsonContainerIsScalar(singleton->val.binary.data));
+   wrap = count > 1;
else
{
elog(ERROR, "unrecognized json wrapper %d", (int) wrapper);
diff --git a/src/test/regress/sql/sqljson_queryfuncs.sql 
b/src/test/regress/sql/sqljson_queryfuncs.sql
index 21ff7787a28..a7342416a9c 100644
--- a/src/test/regress/sql/sqljson_queryfuncs.sql
+++ b/src/test/regress/sql/sqljson_queryfuncs.sql
@@ -146,11 +146,11 @@ CREATE DOMAIN rgb AS rainbow CHECK (VALUE IN ('red', 
'green', 'blue'));
 SELECT JSON_VALUE(NULL::jsonb, '$');
 
 SELECT
-   JSON_QUERY(js, '$'),
-   JSON_QUERY(js, '$' WITHOUT WRAPPER),
-   JSON_QUERY(js, '$' WITH CONDITIONAL WRAPPER),
-   JSON_QUERY(js, '$' WITH UNCONDITIONAL ARRAY WRAPPER),
-   JSON_QUERY(js, '$' WITH ARRAY WRAPPER)
+   JSON_QUERY(js, '$') AS "unspec",
+   JSON_QUERY(js, '$' WITHOUT WRAPPER) AS "without",
+   JSON_QUERY(js, '$' WITH CONDITIONAL WRAPPER) AS "with cond",
+   JSON_QUERY(js, '$' WITH UNCONDITIONAL ARRAY WRAPPER) AS "with uncond",
+   JSON_QUERY(js, '$' WITH ARRAY WRAPPER) AS "with"
 FROM
(VALUES
(jsonb 'null'),
-- 
2.46.0

Re: Virtual generated columns

2024-09-04 Thread Dean Rasheed

On Wed, 4 Sept 2024 at 09:40, Peter Eisentraut  wrote:
>
> On 21.08.24 12:51, Dean Rasheed wrote:
> >>
> > Well what I was thinking was that (in fireRIRrules()'s final loop over
> > relations in the rtable), if the relation had any virtual generated
> > columns, you'd build a targetlist containing a TLE for each one,
> > containing the generated expression. Then you could just call
> > ReplaceVarsFromTargetList() to replace any Vars in the query with the
> > corresponding generated expressions.
>
> Here is an implementation of this.  It's much nicer!  It also appears to
> fix all the additional test cases that have been presented.  (I haven't
> integrated them into the patch set yet.)
>
> I left the 0001 patch alone for now and put the new rewriting
> implementation into 0002.  (Unfortunately, the diff is kind of useless
> for visual inspection.)  Let me know if this matches what you had in
> mind, please.  Also, is this the right place in fireRIRrules()?

Yes, that's what I had in mind except that it has to be called from
the second loop in fireRIRrules(), after any RLS policies have been
added, because it's possible for a RLS policy expression to refer to
virtual generated columns. It's OK to do it in the same loop that
expands RLS policies, because such policies can only refer to columns
of the same relation, so once the RLS policies have been expanded for
a given relation, nothing else should get added to the query that can
refer to columns of that relation, at that query level, so at that
point it should be safe to expand virtual generated columns.

Regards,
Dean

Re: Use streaming read API in ANALYZE

2024-09-04 Thread Thomas Munro

Thanks for the explanation.  I think we should revert it.  IMHO it was
a nice clean example of a streaming transformation, but unfortunately
it transformed an API that nobody liked in the first place, and broke
some weird and wonderful workarounds.  Let's try again in 18.

Re: scalability bottlenecks with (many) partitions (and more)

2024-09-04 Thread Tomas Vondra

On 9/4/24 11:29, Jakub Wartak wrote:
> Hi Tomas!
> 
> On Tue, Sep 3, 2024 at 6:20 PM Tomas Vondra  wrote:
>>
>> On 9/3/24 17:06, Robert Haas wrote:
>>> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra  wrote:
 The one argument to not tie this to max_locks_per_transaction is the
 vastly different "per element" memory requirements. If you add one entry
 to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
 OTOH one fast-path entry is ~5B, give or take. That's a pretty big
 difference, and it if the locks fit into the shared lock table, but
 you'd like to allow more fast-path locks, having to increase
 max_locks_per_transaction is not great - pretty wastefull.

 OTOH I'd really hate to just add another GUC and hope the users will
 magically know how to set it correctly. That's pretty unlikely, IMO. I
 myself wouldn't know what a good value is, I think.

 But say we add a GUC and set it to -1 by default, in which case it just
 inherits the max_locks_per_transaction value. And then also provide some
 basic metric about this fast-path cache, so that people can tune this?
>>>
>>> All things being equal, I would prefer not to add another GUC for
>>> this, but we might need it.
>>>
>>
>> Agreed.
>>
>> [..]
>>
>> So I think I'm OK with just tying this to max_locks_per_transaction.
> 
> If that matters then the SLRU configurability effort added 7 GUCs
> (with 3 scaling up based on shared_buffers) just to give high-end
> users some relief, so here 1 new shouldn't be that such a deal. We
> could add to the LWLock/lock_manager wait event docs to recommend just
> using known-to-be-good certain values from this $thread (or ask the
> user to benchmark it himself).
> 

TBH I'm skeptical we'll be able to tune those GUCs. Maybe it was the
right thing for the SLRU thread, I don't know - I haven't been following
that very closely. But my impression is that we often add a GUC when
we're not quite sure how to pick a good value. So we just shift the
responsibility to someone else, who however also doesn't know.

I'd very much prefer not to do that here. Of course, it's challenging
because we can't easily resize these arrays, so even if we had some nice
heuristics to calculate the "optimal" number of fast-path slots, what
would we do with it ...

 I think just knowing the "hit ratio" would be enough, i.e. counters for
 how often it fits into the fast-path array, and how often we had to
 promote it to the shared lock table would be enough, no?
>>>
>>> Yeah, probably. I mean, that won't tell you how big it needs to be,
>>> but it will tell you whether it's big enough.
>>>
>>
>> True, but that applies to all "cache hit ratio" metrics (like for our
>> shared buffers). It'd be great to have something better, enough to tell
>> you how large the cache needs to be. But we don't :-(
> 
> My $0.02 cents: the originating case that triggered those patches,
> actually started with LWLock/lock_manager waits being the top#1. The
> operator can cross check (join) that with a group by pg_locks.fastpath
> (='f'), count(*). So, IMHO we have good observability in this case
> (rare thing to say!)
> 

That's a good point. So if you had to give some instructions to users
what to measure / monitor, and how to adjust the GUC based on that, what
would your instructions be?

>>> I wonder if we should be looking at further improvements in the lock
>>> manager of some kind. [..]
>>
>> Perhaps. I agree we'll probably need something more radical soon, not
>> just changes that aim to fix some rare exceptional case (which may be
>> annoying, but not particularly harmful for the complete workload).
>>
>> For example, if we did what you propose, that might help when very few
>> transactions need a lot of locks. I don't mind saving memory in that
>> case, ofc. but is it a problem if those rare cases are a bit slower?
>> Shouldn't we focus more on cases where many locks are common? Because
>> people are simply going to use partitioning, a lot of indexes, etc?
>>
>> So yeah, I agree we probably need a more fundamental rethink. I don't
>> think we can just keep optimizing the current approach, there's a limit
>> of fast it can be.
> 
> Please help me understand: so are You both discussing potential far
> future further improvements instead of this one ? My question is
> really about: is the patchset good enough or are you considering some
> other new effort instead?
> 

I think it was mostly a brainstorming about alternative / additional
improvements in locking. The proposed patch does not change the locking
in any fundamental way, it merely optimizes one piece - we still acquire
exactly the same set of locks, exactly the same way.

AFAICS there's an agreement the current approach has limits, and with
the growing number of partitions we're hitting them already. That may
need rethinking the fundamental approach, but I think that should not
block improvements to the current approach.

Not

Re: First draft of PG 17 release notes

2024-09-04 Thread jian he

hi.

Allow partitions to be merged using ALTER TABLE ... MERGE PARTITIONS
(Dmitry Koval)
Allow partitions to be split using ALTER TABLE ... SPLIT PARTITION
(Dmitry Koval)

also these two items got reverted? see
https://git.postgresql.org/cgit/postgresql.git/commit/?id=3890d90c1508125729ed20038d90513694fc3a7b

Re: Improving the latch handling between logical replication launcher and worker processes.

2024-09-04 Thread vignesh C

On Wed, 4 Sept 2024 at 08:32, Kyotaro Horiguchi  wrote:
>
> At Tue, 3 Sep 2024 09:10:07 +0530, vignesh C  wrote in
> > The attached v2 version patch has the changes for the same.
>
> Sorry for jumping in at this point. I've just reviewed the latest
> patch (v2), and the frequent Own/Disown-Latch operations caught my
> attention. Additionally, handling multiple concurrently operating
> trigger sources with nested latch waits seems bug-prone, which I’d
> prefer to avoid from both a readability and safety perspective.
>
> With that in mind, I’d like to suggest an alternative approach. I may
> not be fully aware of all the previous discussions, so apologies if
> this idea has already been considered and dismissed.
>
> Currently, WaitForReplicationWorkerAttach() and
> logicalrep_worker_stop_internal() wait on a latch after verifying the
> desired state. This ensures that even if there are spurious or missed
> wakeups, they won't cause issues. In contrast, ApplyLauncherMain()
> enters a latch wait without checking the desired state
> first. Consequently, if another process sets the latch to wake up the
> main loop while the former two functions are waiting, that wakeup
> could be missed. If my understanding is correct, the problem lies in
> ApplyLauncherMain() not checking the expected state before beginning
> to wait on the latch. There is no issue with waiting if the state
> hasn't been satisfied yet.
>
> So, I propose that ApplyLauncherMain() should check the condition that
> triggers a main loop wakeup before calling WaitLatch(). To do this, we
> could add a flag in LogicalRepCtxStruct to signal that the main loop
> has immediate tasks to handle. ApplyLauncherWakeup() would set this
> flag before sending SIGUSR1. In turn, ApplyLauncherMain() would check
> this flag before calling WaitLatch() and skip the WaitLatch() call if
> the flag is set.
>
> I think this approach could solve the issue without adding
> complexity. What do you think?

I agree that this approach is more simple than the other approach. How
about something like the attached patch to handle the same.

Regards,
Vignesh
From 3efa4ea7dff937f60d16664ec4e3a79f798a72ab Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Wed, 4 Sep 2024 16:39:07 +0530
Subject: [PATCH v3] Resolve a problem where detection of concurrent
 subscription creation was skipped.

The current implementation uses a single latch for multiple purposes:
a) attaching worker processes, b) handling worker process exits, and
c) creating subscriptions. This overlap causes issues in concurrent scenarios,
such as when the launcher starts a new apply worker and is waiting for it to
attach, while simultaneously a new subscription creation request arrives.

This commit addresses the issue by adding a new launcher_reload_sub flag
to the LogicalRepCtxStruct. Although the latch will be reset once the
worker is attached, the launcher can still detect concurrent operations
using this flag. The flag will be activated during new subscription
creation or modification, as well as when an apply worker exits. The
launcher will then check this flag in its main loop, allowing it to
reload the subscription list and initiate the necessary apply workers.
---
 src/backend/replication/logical/launcher.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index e5fdca8bbf..92f1253b39 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -62,6 +62,9 @@ typedef struct LogicalRepCtxStruct
 	dsa_handle	last_start_dsa;
 	dshash_table_handle last_start_dsh;
 
+	/* Flag to reload the subscription list and start worker if required */
+	bool launcher_reload_sub;
+
 	/* Background workers. */
 	LogicalRepWorker workers[FLEXIBLE_ARRAY_MEMBER];
 } LogicalRepCtxStruct;
@@ -1118,7 +1121,10 @@ static void
 ApplyLauncherWakeup(void)
 {
 	if (LogicalRepCtx->launcher_pid != 0)
+	{
+		LogicalRepCtx->launcher_reload_sub = true;
 		kill(LogicalRepCtx->launcher_pid, SIGUSR1);
+	}
 }
 
 /*
@@ -1164,6 +1170,8 @@ ApplyLauncherMain(Datum main_arg)
 	   ALLOCSET_DEFAULT_SIZES);
 		oldctx = MemoryContextSwitchTo(subctx);
 
+		LogicalRepCtx->launcher_reload_sub = false;
+
 		/* Start any missing workers for enabled subscriptions. */
 		sublist = get_subscription_list();
 		foreach(lc, sublist)
@@ -1220,6 +1228,9 @@ ApplyLauncherMain(Datum main_arg)
 		/* Clean the temporary memory. */
 		MemoryContextDelete(subctx);
 
+		if (LogicalRepCtx->launcher_reload_sub)
+			continue;
+
 		/* Wait for more work. */
 		rc = WaitLatch(MyLatch,
 	   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-- 
2.34.1

Re: Commit Timestamp and LSN Inversion issue

2024-09-04 Thread Amit Kapila

On Wed, Sep 4, 2024 at 2:05 PM Aleksander Alekseev
 wrote:
>
> > While discussing Logical Replication's Conflict Detection and
> > Resolution (CDR) design in [1] , it came to  our notice that the
> > commit LSN and timestamp may not correlate perfectly i.e. commits may
> > happen with LSN1 < LSN2 but with Ts1 > Ts2. This issue may arise
> > because, during the commit process, the timestamp (xactStopTimestamp)
> > is captured slightly earlier than when space is reserved in the WAL.
> >  [...]
> > There was a suggestion in [3] to acquire the timestamp while reserving
> > the space (because that happens in LSN order). The clock would need to
> > be monotonic (easy enough with CLOCK_MONOTONIC), but also cheap. The
> > main problem why it's being done outside the critical section, because
> > gettimeofday() may be quite expensive. There's a concept of hybrid
> > clock, combining "time" and logical counter, which might be useful
> > independently of CDR.
>
> I don't think you can rely on a system clock for conflict resolution.
> In a corner case a DBA can move the clock forward or backward between
> recordings of Ts1 and Ts2. On top of that there is no guarantee that
> 2+ servers have synchronised clocks. It seems to me that what you are
> proposing will just hide the problem instead of solving it in the
> general case.
>

It is possible that we can't rely on the system clock for conflict
resolution but that is not the specific point of this thread. As
mentioned in the subject of this thread, the problem is "Commit
Timestamp and LSN Inversion issue". The LSN value and timestamp for a
commit are not generated atomically, so two different transactions can
have them in different order.

Your point as far as I can understand is that in the first place, it
is not a good idea to have a strategy like "last_write_wins" which
relies on the system clock. So, even if LSN->Timestamp ordering has
any problem, it won't matter to us. Now, we can discuss whether
"last_write_wins" is a poor strategy or not but if possible, for the
sake of the point of this thread, let's assume that users using the
resolution feature ("last_write_wins") ensure that clocks are synced
or they won't enable this feature and then see if we can think of any
problem w.r.t the current code.

-- 
With Regards,
Amit Kapila.

Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes

2024-09-04 Thread Guillaume Lelarge

Hi,

Le mer. 4 sept. 2024 à 10:47, Bertrand Drouvot 
a écrit :

> Hi,
>
> On Thu, Aug 29, 2024 at 04:04:05PM +0200, Guillaume Lelarge wrote:
> > Hello,
> >
> > This patch was a bit discussed on [1], and with more details on [2]. It
> > introduces four new columns in pg_stat_all_tables:
> >
> > * parallel_seq_scan
> > * last_parallel_seq_scan
> > * parallel_idx_scan
> > * last_parallel_idx_scan
> >
> > and two new columns in pg_stat_all_indexes:
> >
> > * parallel_idx_scan
> > * last_parallel_idx_scan
> >
> > As Benoit said yesterday, the intent is to help administrators evaluate
> the
> > usage of parallel workers in their databases and help configuring
> > parallelization usage.
>
> Thanks for the patch. I think that's a good idea to provide more
> instrumentation
> in this area. So, +1 regarding this patch.
>
>
Thanks.


> A few random comments:
>
> 1 ===
>
> + 
> +  
> +   parallel_seq_scan bigint
> +  
> +  
> +   Number of parallel sequential scans initiated on this table
> +  
> + 
>
> I wonder if we should not update the seq_scan too to indicate that it
> includes the parallel_seq_scan.
>
> Same kind of comment for last_seq_scan, idx_scan and last_idx_scan.
>
>
Yeah, not sure why I didn't do it at first. I was wondering the same thing.
The patch attached does this.


> 2 ===
>
> @@ -410,6 +410,8 @@ initscan(HeapScanDesc scan, ScanKey key, bool
> keep_startblock)
>  */
> if (scan->rs_base.rs_flags & SO_TYPE_SEQSCAN)
> pgstat_count_heap_scan(scan->rs_base.rs_rd);
> +  if (scan->rs_base.rs_parallel != NULL)
> +   pgstat_count_parallel_heap_scan(scan->rs_base.rs_rd);
>
> Indentation seems broken.
>
>
My bad, sorry. Fixed in the attached patch.


> Shouldn't the parallel counter relies on the "scan->rs_base.rs_flags &
> SO_TYPE_SEQSCAN"
> test too?
>
>
You're right. Fixed in the attached patch.


> What about to get rid of the pgstat_count_parallel_heap_scan and add an
> extra
> bolean parameter to pgstat_count_heap_scan to indicate if
> counts.parallelnumscans
> should be incremented too?
>
> Something like:
>
> pgstat_count_heap_scan(scan->rs_base.rs_rd, scan->rs_base.rs_parallel !=
> NULL)
>
> 3 ===
>
> Same comment for pgstat_count_index_scan (add an extra bolean parameter)
> and
> get rid of pgstat_count_parallel_index_scan()).
>
> I think that 2 === and 3 === would help to avoid missing increments should
> we
> add those call to other places in the future.
>
>
Oh OK, understood.  Done for both.

4 ===
>
> +   if (lstats->counts.numscans || lstats->counts.parallelnumscans)
>
> Is it possible to have (lstats->counts.parallelnumscans) whithout having
> (lstats->counts.numscans) ?
>
>
Nope, parallel scans are included in seq/index scans, as far as I can tell.
I could remove the parallelnumscans testing but it would be less obvious to
read.


> > First time I had to add new columns to a statistics catalog. I'm actually
> > not sure that we were right to change pg_proc.dat manually.
>
> I think that's the right way to do.
>
>
OK, new patch attached.


> I don't see a CF entry for this patch. Would you mind creating one so that
> we don't lost track of it?
>
>
I don't mind adding it, though I don't know if I should add it to the
September or November commit fest. Which one should I choose?

Thanks.

Regards.


-- 
Guillaume.
From 6a202b7bd44cf33be13a8f7e0a8dc7077604c3c0 Mon Sep 17 00:00:00 2001
From: Guillaume Lelarge 
Date: Wed, 28 Aug 2024 21:35:30 +0200
Subject: [PATCH v2] Add parallel columns for pg_stat_all_tables,indexes

pg_stat_all_tables gets 4 new columns: parallel_seq_scan,
last_parallel_seq_scan, parallel_idx_scan, last_parallel_idx_scan.

pg_stat_all_indexes gets 2 new columns: parallel_idx_scan,
last_parallel_idx_scan.
---
 doc/src/sgml/monitoring.sgml | 69 ++--
 src/backend/access/brin/brin.c   |  2 +-
 src/backend/access/gin/ginscan.c |  2 +-
 src/backend/access/gist/gistget.c|  4 +-
 src/backend/access/hash/hashsearch.c |  2 +-
 src/backend/access/heap/heapam.c |  2 +-
 src/backend/access/nbtree/nbtsearch.c|  2 +-
 src/backend/access/spgist/spgscan.c  |  2 +-
 src/backend/catalog/system_views.sql |  6 ++
 src/backend/utils/activity/pgstat_relation.c |  7 +-
 src/backend/utils/adt/pgstatfuncs.c  |  6 ++
 src/include/catalog/pg_proc.dat  |  8 +++
 src/include/pgstat.h | 23 +--
 13 files changed, 113 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 933de6fe07..6886094095 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3773,7 +3773,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
seq_scan bigint
   
   
-   Number of sequential scans initiated on this table
+   Number of sequential scans (including parall

Re: Optimize mul_var() for var1ndigits >= 8

2024-09-04 Thread Joel Jacobson

On Wed, Sep 4, 2024, at 09:22, Dean Rasheed wrote:
> On Tue, 3 Sept 2024 at 21:31, Tom Lane  wrote:
>>
>> Dean Rasheed  writes:
>> > Ah, OK. I've pushed a fix.
>>
>> There is an open CF entry pointing at this thread [1].
>> Shouldn't it be marked committed now?
>>
>
> Oops, yes I missed that CF entry. I've closed it now.
>
> Joel, are you still planning to work on the Karatsuba multiplication
> patch? If not, we should close that CF entry too.

No, I think it's probably not worth it given that we have now optimises 
mul_var() in other ways. Will maybe have a look at it again in the future. 
Patch withdrawn for now.

Thanks for really good guidance and help on the numeric, much fun and I've 
learned a lot.

/Joel

Re: thread-safety: strerror_r()

2024-09-04 Thread Peter Eisentraut


On 02.09.24 21:56, Tom Lane wrote:

Peter Eisentraut  writes:

I think we can apply these patches now to check this off the list of
not-thread-safe functions to check.


+1 for the first patch.  I'm less happy with

-   static char errbuf[36];
+   static char errbuf[128];

As a minor point, shouldn't this be

+   static char errbuf[PG_STRERROR_R_BUFLEN];

But the bigger issue is that the use of a static buffer makes
this not thread-safe, so having it use strerror_r to fill that
buffer is just putting lipstick on a pig.  If we really want
to make this thread-ready, we need to adopt the approach used
in libpq's fe-secure-openssl.c, where callers have to free the
buffer later.  Or maybe we could just palloc the result, and
trust that it's not in a long-lived context?


Ok, I have committed the first patch.  We can think about the best 
solution for the second issue when we get to the business end of all 
this.  The current situation doesn't really prevent making any progress 
on that.

Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes

2024-09-04 Thread Bertrand Drouvot

Hi,

On Wed, Sep 04, 2024 at 02:51:51PM +0200, Guillaume Lelarge wrote:
> Le mer. 4 sept. 2024 à 10:47, Bertrand Drouvot 
> a écrit :
> 
> > I don't see a CF entry for this patch. Would you mind creating one so that
> > we don't lost track of it?
> >
> >
> I don't mind adding it, though I don't know if I should add it to the
> September or November commit fest. Which one should I choose?

Thanks! That should be the November one (as the September one already started).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Commit Timestamp and LSN Inversion issue

2024-09-04 Thread Aleksander Alekseev

Hi Amit,

> > I don't think you can rely on a system clock for conflict resolution.
> > In a corner case a DBA can move the clock forward or backward between
> > recordings of Ts1 and Ts2. On top of that there is no guarantee that
> > 2+ servers have synchronised clocks. It seems to me that what you are
> > proposing will just hide the problem instead of solving it in the
> > general case.
> >
>
> It is possible that we can't rely on the system clock for conflict
> resolution but that is not the specific point of this thread. As
> mentioned in the subject of this thread, the problem is "Commit
> Timestamp and LSN Inversion issue". The LSN value and timestamp for a
> commit are not generated atomically, so two different transactions can
> have them in different order.

Hm Then I'm having difficulties understanding why this is a
problem and why it was necessary to mention CDR in this context in the
first place.

OK, let's forget about CDR completely. Who is affected by the current
behavior and why would it be beneficial changing it?

-- 
Best regards,
Aleksander Alekseev

Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description

2024-09-04 Thread Ashutosh Bapat

On Wed, Sep 4, 2024 at 11:34 AM David G. Johnston
 wrote:

>
>
> Agree on sticking with “The time…”
>
> Thus I suggest either:
>
> The time when the slot became inactive.

+1. Definitely removes confusion caused by "since" and keeps The time
as subject.




-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions

2024-09-04 Thread Daniel Gustafsson

> On 3 Sep 2024, at 14:18, Daniel Gustafsson  wrote:

> Attached is a v4 rebase over the recent OpenSSL 1.0.2 removal which made this
> patch no longer apply.  I've just started to dig into it so have no comments 
> on
> it right now, but wanted to get a cleaned up version into the CFBot.

CFBot building green for this, I just have a few small questions/comments:

+   my_bio_index |= BIO_TYPE_SOURCE_SINK;

According to the OpenSSL docs we should set BIO_TYPE_DESCRIPTOR as well as this
BIO is socket based, but it's not entirely clear to me why.  Is there a
specific reason it was removed?

+   bio_method = port_bio_method();
if (bio_method == NULL)
{
SSLerr(SSL_F_SSL_SET_FD, ERR_R_BUF_LIB);

SSL_F_SSL_SET_FD is no longer the correct function context for this error
reporting.  In OpenSSL 3.x it means nothing since SSLerr throws away the
function when calling ERR_raise_data, but we still support 1.1.0+.  I think the
correct error would be BIOerr(BIO_F_BIO_METH_NEW..) but I wonder if we should
just remove it since BIO_meth_new and BIO_new already set errors for us to
consume?  It doesn't seem to make sense to add more errors on the queue here?
The same goes for the frontend part.

The attached v5 is a fresh rebase with my comments from above as 0002 to
illustrate.

--
Daniel Gustafsson

v5-0001-Avoid-mixing-custom-and-OpenSSL-BIO-functions.patch
Description: Binary data

v5-0002-Review-comments.patch
Description: Binary data

Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes

2024-09-04 Thread Guillaume Lelarge

Le mer. 4 sept. 2024 à 14:58, Bertrand Drouvot 
a écrit :

> Hi,
>
> On Wed, Sep 04, 2024 at 02:51:51PM +0200, Guillaume Lelarge wrote:
> > Le mer. 4 sept. 2024 ą 10:47, Bertrand Drouvot <
> bertranddrouvot...@gmail.com>
> > a écrit :
> >
> > > I don't see a CF entry for this patch. Would you mind creating one so
> that
> > > we don't lost track of it?
> > >
> > >
> > I don't mind adding it, though I don't know if I should add it to the
> > September or November commit fest. Which one should I choose?
>
> Thanks! That should be the November one (as the September one already
> started).
>
>
I should have gone to the commit fest website, it says the same. I had the
recollection that it started on the 15th. Anyway, added to the november
commit fest (https://commitfest.postgresql.org/50/5238/).


-- 
Guillaume.

Re: Improving the latch handling between logical replication launcher and worker processes.

2024-09-04 Thread Heikki Linnakangas


On 04/09/2024 14:24, vignesh C wrote:

On Wed, 4 Sept 2024 at 08:32, Kyotaro Horiguchi  wrote:

I think this approach could solve the issue without adding
complexity. What do you think?


I agree that this approach is more simple than the other approach. How
about something like the attached patch to handle the same.


I haven't looked at these new patches from the last few days, but please 
also note the work at 
https://www.postgresql.org/message-id/476672e7-62f1-4cab-a822-f3a8e949dd3f%40iki.fi. 
If those "interrupts" patches are committed, this is pretty 
straightforward to fix by using a separate interrupt bit for this, as 
the patch on that thread does.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Index AM API cleanup

2024-09-04 Thread Alexandra Wang

On Thu, Aug 22, 2024 at 11:28 AM Mark Dilger
 wrote:
> > On Aug 22, 2024, at 1:36 AM, Alexandra Wang  
> > wrote:
> > "make installcheck" for treeb is causing issues on my end. I can
> > investigate further if it’s not a problem for others.
>
> The test module index AMs are not intended for use in any installed database, 
> so 'make installcheck' is unnecessary.  A mere 'make check' should suffice.  
> However, if you want to run it, you can install the modules, edit 
> postgresql.conf to add 'treeb' to shared_preload_libraries, restart the 
> server, and run 'make installcheck'.   This is necessary for 'treeb' because 
> it requests shared memory, and that needs to be done at startup.

Thanks, Mark. This works, and I can run treeb now.

> The v18 patch set includes the changes your patches suggest, though I 
> modified the approach a bit.  Specifically, rather than standardizing on 
> '1.0.0' for the module versions, as your patches do, I went with '1.0', as is 
> standard in other modules in neighboring directories.  The '1.0.0' numbering 
> was something I had been using in versions 1..16 of this patch, and I only 
> partially converted to '1.0' before posting v17, so sorry about that.  The 
> v18 patch also has some whitespace fixes.
>
> To address your comments about the noise in the test failures, v18 modifies 
> the clone_tests.pl script to do a little more work translating the expected 
> output to expect the module's AM name ("xash", "xtree", "treeb", or whatnot) 
> beyond what that script did in v17.

Thanks! I see fewer failures now.

There are a few occurrences of the following error when I run "make
check" in the xtree module:

+ERROR:  bogus RowCompare index qualification

I needed to specify `amroutine->amcancrosscompare = true;` in xtree.c
to eliminate them, as below:

diff --git a/src/test/modules/xtree/access/xtree.c
b/src/test/modules/xtree/access/xtree.c
index bd472edb04..960966801d 100644
--- a/src/test/modules/xtree/access/xtree.c
+++ b/src/test/modules/xtree/access/xtree.c
@@ -33,6 +33,7 @@ xtree_indexam_handler(PG_FUNCTION_ARGS)
amroutine->amoptsprocnum = BTOPTIONS_PROC;
amroutine->amcanorder = true;
amroutine->amcanhash = false;
+   amroutine->amcancrosscompare = true;
amroutine->amcanorderbyop = false;
amroutine->amcanbackward = true;
amroutine->amcanunique = true;

After adding that, the regression.diffs for the xtree and treeb
modules are identical in content, with only plan diffs remaining. I
think this change should be either part of
v18-0008-Generalize-hash-and-ordering-support-in-amapi.patch or a
separate commit, similar to
v18-0009-Adjust-treeb-to-use-amcanhash-and-amcancrosscomp.patch.

Best,
Alex

Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes

2024-09-04 Thread Bertrand Drouvot

Hi,

On Wed, Sep 04, 2024 at 02:51:51PM +0200, Guillaume Lelarge wrote:
> Hi,
> 
> Le mer. 4 sept. 2024 à 10:47, Bertrand Drouvot 
> a écrit :
> 
> > What about to get rid of the pgstat_count_parallel_heap_scan and add an
> > extra
> > bolean parameter to pgstat_count_heap_scan to indicate if
> > counts.parallelnumscans
> > should be incremented too?
> >
> > Something like:
> >
> > pgstat_count_heap_scan(scan->rs_base.rs_rd, scan->rs_base.rs_parallel !=
> > NULL)
> >
> > 3 ===
> >
> > Same comment for pgstat_count_index_scan (add an extra bolean parameter)
> > and
> > get rid of pgstat_count_parallel_index_scan()).
> >
> > I think that 2 === and 3 === would help to avoid missing increments should
> > we
> > add those call to other places in the future.
> >
> >
> Oh OK, understood.  Done for both.

Thanks for v2!

1 ===

-#define pgstat_count_heap_scan(rel)
+#define pgstat_count_heap_scan(rel, parallel)
do {
-   if (pgstat_should_count_relation(rel))
-   (rel)->pgstat_info->counts.numscans++;
+if (pgstat_should_count_relation(rel)) {
+   if (!parallel)
+   (rel)->pgstat_info->counts.numscans++;
+   else
+   (rel)->pgstat_info->counts.parallelnumscans++;
+   }

I think counts.numscans has to be incremented in all the cases (so even if
"parallel" is true).

Same comment for pgstat_count_index_scan().

> 4 ===
> >
> > +   if (lstats->counts.numscans || lstats->counts.parallelnumscans)
> >
> > Is it possible to have (lstats->counts.parallelnumscans) whithout having
> > (lstats->counts.numscans) ?
> >
> >
> Nope, parallel scans are included in seq/index scans, as far as I can tell.
> I could remove the parallelnumscans testing but it would be less obvious to
> read.

2 ===

What about adding a comment instead of this extra check?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: scalability bottlenecks with (many) partitions (and more)

2024-09-04 Thread Matthias van de Meent

On Tue, 3 Sept 2024 at 18:20, Tomas Vondra  wrote:
> FWIW the actual cost is somewhat higher, because we seem to need ~400B
> for every lock (not just the 150B for the LOCK struct).

We do indeed allocate two PROCLOCKs for every LOCK, and allocate those
inside dynahash tables. That amounts to (152+2*64+3*16=) 328 bytes in
dynahash elements, and (3 * 8-16) = 24-48 bytes for the dynahash
buckets/segments, resulting in 352-376 bytes * NLOCKENTS() being
used[^1]. Does that align with your usage numbers, or are they
significantly larger?

> At least based on a quick experiment. (Seems a bit high, right?).

Yeah, that does seem high, thanks for nerd-sniping me.

The 152 bytes of LOCK are mostly due to a combination of two
MAX_LOCKMODES-sized int[]s that are used to keep track of the number
of requested/granted locks of each level. As MAX_LOCKMODES = 10, these
arrays use a total of 2*4*10=80 bytes, with the remaining 72 spent on
tracking. MAX_BACKENDS sadly doesn't fit in int16, so we'll have to
keep using int[]s, but that doesn't mean we can't improve this size:

ISTM that MAX_LOCKMODES is 2 larger than it has to be: LOCKMODE=0 is
NoLock, which is never used or counted in these shared structures, and
the max lock mode supported by any of the supported lock methods is
AccessExclusiveLock (8). We can thus reduce MAX_LOCKMODES to 8,
reducing size of the LOCK struct by 16 bytes.

If some struct- and field packing is OK, then we could further reduce
the size of LOCK by an additional 8 bytes by resizing the LOCKMASK
type from int to int16 (we only use the first MaxLockMode (8) + 1
bits), and then storing the grant/waitMask fields (now 4 bytes total)
in the padding that's present at the end of the waitProcs struct. This
would depend on dclist not writing in its padding space, but I
couldn't find any user that did so, and most critically dclist_init
doesn't scribble in the padding with memset.

If these are both implemented, it would save 24 bytes, reducing the
struct to 128 bytes. :) [^2]

I also checked PROCLOCK: If it is worth further packing the struct, we
should probably look at whether it's worth replacing the PGPROC* typed
fields with ProcNumber -based ones, potentially in both PROCLOCK and
PROCLOCKTAG. When combined with int16-typed LOCKMASKs, either one of
these fields being replaced with ProcNumber would allow a reduction in
size by one MAXALIGN quantum, reducing the struct to 56 bytes, the
smallest I could get it to without ignoring type alignments.

Further shmem savings can be achieved by reducing the "10% safety
margin" added at the end of LockShmemSize, as I'm fairly sure the
memory used in shared hashmaps doesn't exceed the estimated amount,
and if it did then we should probably fix that part, rather than
requesting that (up to) 10% overhead here.

Alltogether that'd save 40 bytes/lock entry on size, and ~35
bytes/lock on "safety margin", for a saving of (up to) 19% of our
current allocation. I'm not sure if these tricks would benefit with
performance or even be a demerit, apart from smaller structs usually
being better at fitting better in CPU caches.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[^1] NLOCKENTS() benefits from being a power of 2, or slightly below
one, as it's rounded up to a power of 2 when dynahash decides its
number of buckets to allocate.
[^2] Sadly this 2-cachelines alignment is lost due to dynahash's
HASHELEMENT prefix of elements. :(

Re: Refactoring postmaster's code to cleanup after child exit

2024-09-04 Thread Andres Freund

Hi,

On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:
> While rebasing this today, I spotted another instance of that mistake
> mentioned in the XXX comment above. I called "CountChildren(B_BACKEND)"
> instead of "CountChildren(1 << B_BACKEND)". Some ideas on how to make that
> less error-prone:
> 
> 1. Add a separate typedef for the bitmasks, and macros/functions to work
> with it. Something like:
> 
> typedef struct {
>   uint32  mask;
> } BackendTypeMask;
> 
> static const BackendTypeMask BTMASK_ALL = { 0x };
> static const BackendTypeMask BTMASK_NONE = { 0 };
> 
> static inline BackendTypeMask
> BTMASK_ADD(BackendTypeMask mask, BackendType t)
> {
>   mask.mask |= 1 << t;
>   return mask;
> }
> 
> static inline BackendTypeMask
> BTMASK_DEL(BackendTypeMask mask, BackendType t)
> {
>   mask.mask &= ~(1 << t);
>   return mask;
> }
> 
> Now the compiler will complain if you try to pass a BackendType for the
> mask. We could do this just for BackendType, or we could put this in
> src/include/lib/ with a more generic name, like "bitmask_u32".

I don't like the second suggestion - that just ends up creating a similar
problem in the future because flag values for one thing can be passed to
something else.



> +Running the tests
> +=
> +
> +NOTE: You must have given the --enable-tap-tests argument to configure.
> +
> +Run
> +make check
> +or
> +make installcheck
> +You can use "make installcheck" if you previously did "make install".
> +In that case, the code in the installation tree is tested.  With
> +"make check", a temporary installation tree is built from the current
> +sources and then tested.
> +
> +Either way, this test initializes, starts, and stops a test Postgres
> +cluster.
> +
> +See src/test/perl/README for more info about running these tests.

Is it really useful to have such instructions all over the tree?


> From 93b9e9b6e072f63af9009e0d66ab6d0d62ea8c15 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 12 Aug 2024 10:55:11 +0300
> Subject: [PATCH v4 2/8] Add test for dead-end backends
> 
> The code path for launching a dead-end backend because we're out of
> slots was not covered by any tests, so add one. (Some tests did hit
> the case of launching a dead-end backend because the server is still
> starting up, though, so the gap in our test coverage wasn't as big as
> it sounds.)
> ---
>  src/test/perl/PostgreSQL/Test/Cluster.pm  | 39 +++
>  .../postmaster/t/001_connection_limits.pl | 17 +++-
>  2 files changed, 55 insertions(+), 1 deletion(-)

Why does this need to use "raw" connections?  Can't you just create a bunch of
connections with BackgroundPsql?




> From 88287a2db95e584018f1c7fa9e992feb7d179ce8 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Mon, 12 Aug 2024 10:58:35 +0300
> Subject: [PATCH v4 3/8] Use an shmem_exit callback to remove backend from
>  PMChildFlags on exit
> 
> This seems nicer than having to duplicate the logic between
> InitProcess() and ProcKill() for which child processes have a
> PMChildFlags slot.
> 
> Move the MarkPostmasterChildActive() call earlier in InitProcess(),
> out of the section protected by the spinlock.

> ---
>  src/backend/storage/ipc/pmsignal.c | 10 ++--
>  src/backend/storage/lmgr/proc.c| 38 ++
>  src/include/storage/pmsignal.h |  1 -
>  3 files changed, 21 insertions(+), 28 deletions(-)
> 
> diff --git a/src/backend/storage/ipc/pmsignal.c 
> b/src/backend/storage/ipc/pmsignal.c
> index 27844b46a2..cb99e77476 100644
> --- a/src/backend/storage/ipc/pmsignal.c
> +++ b/src/backend/storage/ipc/pmsignal.c
> @@ -24,6 +24,7 @@
>  #include "miscadmin.h"
>  #include "postmaster/postmaster.h"
>  #include "replication/walsender.h"
> +#include "storage/ipc.h"
>  #include "storage/pmsignal.h"
>  #include "storage/shmem.h"
>  #include "utils/memutils.h"
> @@ -121,6 +122,8 @@ postmaster_death_handler(SIGNAL_ARGS)
>  
>  #endif   /* 
> USE_POSTMASTER_DEATH_SIGNAL */
>  
> +static void MarkPostmasterChildInactive(int code, Datum arg);
> +
>  /*
>   * PMSignalShmemSize
>   *   Compute space needed for pmsignal.c's shared memory
> @@ -328,6 +331,9 @@ MarkPostmasterChildActive(void)
>   slot--;
>   Assert(PMSignalState->PMChildFlags[slot] == PM_CHILD_ASSIGNED);
>   PMSignalState->PMChildFlags[slot] = PM_CHILD_ACTIVE;
> +
> + /* Arrange to clean up at exit. */
> + on_shmem_exit(MarkPostmasterChildInactive, 0);
>  }
>  
>  /*
> @@ -352,8 +358,8 @@ MarkPostmasterChildWalSender(void)
>   * MarkPostmasterChildInactive - mark a postmaster child as done using
>   * shared memory.  This is called in the child process.
>   */
> -void
> -MarkPostmasterChildInactive(void)
> +static void
> +MarkPostmasterChildInactive(int code, Datum arg)
>  {
>   int slot = MyPMChildSlot;
>  
> diff --git a/src/backend/storage/lmg

Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes

2024-09-04 Thread Guillaume Lelarge

Hi,

Le mer. 4 sept. 2024 à 16:18, Bertrand Drouvot 
a écrit :

> Hi,
>
> On Wed, Sep 04, 2024 at 02:51:51PM +0200, Guillaume Lelarge wrote:
> > Hi,
> >
> > Le mer. 4 sept. 2024 à 10:47, Bertrand Drouvot <
> bertranddrouvot...@gmail.com>
> > a écrit :
> >
> > > What about to get rid of the pgstat_count_parallel_heap_scan and add an
> > > extra
> > > bolean parameter to pgstat_count_heap_scan to indicate if
> > > counts.parallelnumscans
> > > should be incremented too?
> > >
> > > Something like:
> > >
> > > pgstat_count_heap_scan(scan->rs_base.rs_rd, scan->rs_base.rs_parallel
> !=
> > > NULL)
> > >
> > > 3 ===
> > >
> > > Same comment for pgstat_count_index_scan (add an extra bolean
> parameter)
> > > and
> > > get rid of pgstat_count_parallel_index_scan()).
> > >
> > > I think that 2 === and 3 === would help to avoid missing increments
> should
> > > we
> > > add those call to other places in the future.
> > >
> > >
> > Oh OK, understood.  Done for both.
>
> Thanks for v2!
>
> 1 ===
>
> -#define pgstat_count_heap_scan(rel)
> +#define pgstat_count_heap_scan(rel, parallel)
> do {
> -   if (pgstat_should_count_relation(rel))
> -   (rel)->pgstat_info->counts.numscans++;
> +if (pgstat_should_count_relation(rel)) {
> +   if (!parallel)
> +   (rel)->pgstat_info->counts.numscans++;
> +   else
> +
>  (rel)->pgstat_info->counts.parallelnumscans++;
> +   }
>
> I think counts.numscans has to be incremented in all the cases (so even if
> "parallel" is true).
>
> Same comment for pgstat_count_index_scan().
>
>
You're right, and I've been too quick. Fixed in v3.


> > 4 ===
> > >
> > > +   if (lstats->counts.numscans || lstats->counts.parallelnumscans)
> > >
> > > Is it possible to have (lstats->counts.parallelnumscans) whithout
> having
> > > (lstats->counts.numscans) ?
> > >
> > >
> > Nope, parallel scans are included in seq/index scans, as far as I can
> tell.
> > I could remove the parallelnumscans testing but it would be less obvious
> to
> > read.
>
> 2 ===
>
> What about adding a comment instead of this extra check?
>
>
Done too in v3.


-- 
Guillaume.
From 97a95650cd220c1b88ab6f3d36149b8860bead1d Mon Sep 17 00:00:00 2001
From: Guillaume Lelarge 
Date: Wed, 28 Aug 2024 21:35:30 +0200
Subject: [PATCH v3] Add parallel columns for pg_stat_all_tables,indexes

pg_stat_all_tables gets 4 new columns: parallel_seq_scan,
last_parallel_seq_scan, parallel_idx_scan, last_parallel_idx_scan.

pg_stat_all_indexes gets 2 new columns: parallel_idx_scan,
last_parallel_idx_scan.
---
 doc/src/sgml/monitoring.sgml | 69 ++--
 src/backend/access/brin/brin.c   |  2 +-
 src/backend/access/gin/ginscan.c |  2 +-
 src/backend/access/gist/gistget.c|  4 +-
 src/backend/access/hash/hashsearch.c |  2 +-
 src/backend/access/heap/heapam.c |  2 +-
 src/backend/access/nbtree/nbtsearch.c|  2 +-
 src/backend/access/spgist/spgscan.c  |  2 +-
 src/backend/catalog/system_views.sql |  6 ++
 src/backend/utils/activity/pgstat_relation.c | 10 ++-
 src/backend/utils/adt/pgstatfuncs.c  |  6 ++
 src/include/catalog/pg_proc.dat  |  8 +++
 src/include/pgstat.h | 17 +++--
 13 files changed, 113 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 933de6fe07..6886094095 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3773,7 +3773,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
seq_scan bigint
   
   
-   Number of sequential scans initiated on this table
+   Number of sequential scans (including parallel ones) initiated on this table
   
  
 
@@ -3782,7 +3782,26 @@ description | Waiting for a newly initialized WAL file to reach durable storage
last_seq_scan timestamp with time zone
   
   
-   The time of the last sequential scan on this table, based on the
+   The time of the last sequential scan (including parallel ones) on this table, based on the
+   most recent transaction stop time
+  
+ 
+
+ 
+  
+   parallel_seq_scan bigint
+  
+  
+   Number of parallel sequential scans initiated on this table
+  
+ 
+
+ 
+  
+   last_parallel_seq_scan timestamp with time zone
+  
+  
+   The time of the last parallel sequential scan on this table, based on the
most recent transaction stop time
   
  
@@ -3801,7 +3820,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
idx_scan bigint
   
   
-   Number of index scans initiated on this table
+   Number of index scans (including parallel ones) initiated on this table
   
  
 
@@ -3810,7 +3829,26 @@ description

Re: Optimize WindowAgg's use of tuplestores

2024-09-04 Thread David Rowley

On Mon, 19 Aug 2024 at 22:01, David Rowley  wrote:
> To try and move this forward again, I adjusted the patch to use a
> static function with pg_noinline rather than unlikely.  I don't think
> this will make much difference code generation wise, but I did think
> it was an improvement in code cleanliness. Patches attached.
>
> I did a round of benchmarking on an AMD Zen4 7945hx and on an Apple
> M2. I also graphed the results you sent so they're easier to compare
> with mine.
>
> 0001 is effectively the unlikely() patch for calculating the frame offsets.
> 0002 is the tuplestore_reset() patch

I was experimenting with this again.  The 0002 patch added a
next_partition field to the WindowAggState struct and caused the
struct to become slightly bigger.  I've now included a 0003 patch
which shifts some fields around in that struct so as to keep it the
same size as it is on master. Benchmarking with that removes that very
tiny performance regression.  Please see the attached CSV file for the
results. The percentage row compares master to all patches. I also
tested this on an AMD 3990x machine along with fresh results from the
AMD 7945hx laptop. Both of those machines come out faster on all tests
when comparing master to all 3 patches.  With the Apple M2, there does
not seem to be much change in performance with the tests containing
fewer rows per partition, some are faster, some are slower, all within
typical noise fluctuations.

Given the performance now seems improved in all cases, I plan on
pushing this change as a single commit.

David

v4-0001-Speedup-WindowAgg-code-by-moving-uncommon-code-ou.patch
Description: Binary data

v4-0002-Optimize-WindowAgg-s-use-of-tuplestores.patch
Description: Binary data

v4-0003-Experiment-with-WindowAggState-fields.patch
Description: Binary data
AMD 7045HX,,,
version,100,10,1,1000,100,10,1
master,300.7,201.4,182.2,180.1,179.7,180.8,185.9
v4-0001,295.6,189.4,176.4,172.2,172.2,173.4,180.9
v4-0001+0002,222.3,186.6,185.3,177.9,177.4,177.8,183.9
v4-0002,224.5,192.5,183.7,186.3,180.7,182.8,188
v4-0001+0002+0003,217.6,181.4,177.2,173.9,173.6,174.5,184.2
,138.20%,111.03%,102.80%,103.55%,103.50%,103.59%,100.96%
Apple M2,,,
version,100,10,1,1000,100,10,1
master,269.2,169.7,152.6,147.5,147.2,147.7,148.9
v4-0001,269.1,170.1,154.5,149,147.7,148.9,150.7
v4-0001+0002,193.1,164.6,154.3,149,148.2,149,149.9
v4-0002,192,165,157.1,151.6,150.9,150.9,152.8
v4-0001+0002+0003,187.7,162.1,153.2,148.2,147,147.5,150.4
,143.40%,104.68%,99.59%,99.48%,100.14%,100.13%,98.95%
,,,
AMD 3990x,,,
version,100,10,1,1000,100,10,1
master,570.8,354.1,327.4,320,320.7,322.2,343.7
v4-0001,561.6,353.4,327.4,320.6,321.1,323.3,343
v4-0001+0002,401.2,339.1,327.6,322.8,323,324.1,344.5
v4-0002,409.1,341.3,330.5,324.5,325.1,327.7,351
v4-0001+0002+0003,403.3,336,324.1,319.7,320.6,320.6,342.5
,141.55%,105.39%,101.02%,100.09%,100.02%,100.49%,100.37%

Re: using extended statistics to improve join estimates

2024-09-04 Thread Andrei Lepikhov


On 3/9/2024 14:58, Andrei Lepikhov wrote:

On 17/6/2024 18:10, Tomas Vondra wrote:
x = $1 AND y = $2 AND ...
As I see, current patch doesn't resolve this issue currently.

Let's explain my previous argument with an example (see in attachment).

The query designed to be executed with parameterised NL join:

EXPLAIN (ANALYZE, TIMING OFF)
SELECT * FROM test t1 NATURAL JOIN test1 t2 WHERE t2.x1 < 1;

After applying the topmost patch from the patchset we can see two 
different estimations (explain tuned a little bit) before and after 
extended statistics:


-- before:

 Nested Loop  (rows=1) (actual rows=1 loops=1)
   ->  Seq Scan on test1 t2  (rows=100) (actual rows=100 loops=1)
 Filter: (x1 < 1)
   ->  Memoize  (rows=1) (actual rows=100 loops=100)
 Cache Key: t2.x1, t2.x2, t2.x3, t2.x4
 ->  Index Scan using test_x1_x2_x3_x4_idx on test t1
 (rows=1 width=404) (actual rows=100 loops=1)
   Index Cond: ((x1 = t2.x1) AND (x2 = t2.x2) AND
   (x3 = t2.x3) AND (x4 = t2.x4))

-- after:

 Nested Loop  (rows=1) (actual rows=1 loops=1)
   ->  Seq Scan on test1 t2  (rows=100) (actual rows=100 loops=1)
 Filter: (x1 < 1)
   ->  Memoize  (rows=1) (actual rows=100 loops=100)
 Cache Key: t2.x1, t2.x2, t2.x3, t2.x4
 ->  Index Scan using test_x1_x2_x3_x4_idx on test t1  (rows=1)
 (actual rows=100 loops=1)
   Index Cond: ((x1 = t2.x1) AND (x2 = t2.x2) AND
   (x3 = t2.x3) AND (x4 = t2.x4))

You can see, that index condition was treated as join clause and PNL 
estimated correctly by an MCV on both sides.

But scan estimation is incorrect.
Moreover, sometimes we don't have MCV at all. And the next step for this 
patch should be implementation of bare estimation by the only ndistinct 
on each side.


What to do with the scan filter? Not sure so far, but it looks like here 
may be used the logic similar to var_eq_non_const().



--
regards, Andrei Lepikhov


example.sql
Description: application/sql

Re: scalability bottlenecks with (many) partitions (and more)

2024-09-04 Thread David Rowley

On Wed, 4 Sept 2024 at 03:06, Robert Haas  wrote:
>
> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra  wrote:
> > But say we add a GUC and set it to -1 by default, in which case it just
> > inherits the max_locks_per_transaction value. And then also provide some
> > basic metric about this fast-path cache, so that people can tune this?
>
> All things being equal, I would prefer not to add another GUC for
> this, but we might need it.

I think driving the array size from max_locks_per_transaction is a
good idea (rounded up to the next multiple of 16?). If someone comes
along one day and shows us a compelling case where some backend needs
more than its fair share of locks and performance is bad because of
that, then maybe we can consider adding a GUC then. Certainly, it's
much easier to add a GUC later if someone convinces us that it's a
good idea than it is to add it now and try to take it away in the
future if we realise it's not useful enough to keep.

David

Re: Parallel workers stats in pg_stat_database

2024-09-04 Thread Benoit Lobréau

On 9/4/24 08:46, Bertrand Drouvot wrote:> What about moving the tests to 
places where it's "guaranteed" to get

parallel workers involved? For example, a "parallel_maint_workers" only test
could be done in vacuum_parallel.sql.


Thank you ! I was too focussed on the stat part and missed the obvious.
It's indeed better with this file.

... Which led me to discover that the area I choose to gather my stats 
is wrong (parallel_vacuum_end), it only traps workers allocated for 
parallel_vacuum_cleanup_all_indexes() and not 
parallel_vacuum_bulkdel_all_indexes().


Back to the drawing board...

--
Benoit Lobréau
Consultant
http://dalibo.com

Re: race condition in pg_class

2024-09-04 Thread Nitin Motiani

On Wed, Sep 4, 2024 at 2:53 AM Noah Misch  wrote:
>
>
> > So this also pulls the invalidation to the genam.c
> > layer. Am I understanding this correctly?
>
> Compared to the v9 patch, the "call both" alternative would just move the
> systable_endscan() call to a new systable_inplace_update_end().  It wouldn't
> move anything across the genam:heapam boundary.
> systable_inplace_update_finish() would remain a thin wrapper around a heapam
> function.
>

Thanks for the clarification.

> > > > > - Hoist the CacheInvalidateHeapTuple() up to the genam.c layer.  While
> > > > >   tolerable now, this gets less attractive after the inplace160 patch 
> > > > > from
> > > > >   https://postgr.es/m/flat/20240523000548.58.nmi...@google.com
> > > > >
> > > > I skimmed through the inplace160 patch. It wasn't clear to me why this
> > > > becomes less attractive with the patch. I see there is a new
> > > > CacheInvalidateHeapTupleInPlace but that looks like it would be called
> > > > while holding the lock. And then there is an
> > > > AcceptInvalidationMessages which can perhaps be moved to the genam.c
> > > > layer too. Is the concern that one invalidation call will be in the
> > > > heapam layer and the other will be in the genam layer?
> > >
> > > That, or a critical section would start in heapam.c, then end in genam.c.
> > > Current call tree at inplace160 v4:
>
> > How about this alternative?
> >
> >  genam.c:systable_inplace_update_finish
> >PreInplace_Inval
> >START_CRIT_SECTION
> >heapam.c:heap_inplace_update
> >BUFFER_LOCK_UNLOCK
> >AtInplace_Inval
> >END_CRIT_SECTION
> >UnlockTuple
> >AcceptInvalidationMessages
>
> > Looking at inplace160, it seems that the start of the critical section
> > is right after PreInplace_Inval. So why not pull START_CRIT_SECTION
> > and END_CRIT_SECTION out to the genam.c layer?
>
> heap_inplace_update() has an elog(ERROR) that needs to happen outside any
> critical section.  Since the condition for that elog deals with tuple header
> internals, it belongs at the heapam layer more than the systable layer.
>

Understood. How about this alternative then? The tuple length check
and the elog(ERROR) gets its own function. Something like
heap_inplace_update_validate or
heap_inplace_update_validate_tuple_length. So in that case, it would
look like this :

  genam.c:systable_inplace_update_finish
heapam.c:heap_inplace_update_validate/heap_inplace_update_precheck
PreInplace_Inval
START_CRIT_SECTION
heapam.c:heap_inplace_update
BUFFER_LOCK_UNLOCK
AtInplace_Inval
END_CRIT_SECTION
UnlockTuple
AcceptInvalidationMessages

This is starting to get complicated though so I don't have any issues
with just renaming the heap_inplace_update to
heap_inplace_update_and_unlock.

> > Alternatively since
> > heap_inplace_update is commented as a subroutine of
> > systable_inplace_update_finish, should everything just be moved to the
> > genam.c layer? Although it looks like you already considered and
> > rejected this approach.
>
> Calling XLogInsert(RM_HEAP_ID) in genam.c would be a worse modularity
> violation than the one that led to the changes between v8 and v9.  I think
> even calling CacheInvalidateHeapTuple() in genam.c would be a worse modularity
> violation than the one attributed to v8.  Modularity would have the
> heap_inplace function resemble heap_update() handling of invals.
>

Understood. Thanks.

> > If the above alternatives are not possible, it's probably fine to go
> > ahead with the current patch with the function renamed to
> > heap_inplace_update_and_unlock (or something similar) as mentioned
> > earlier?
>
> I like that name.  The next version will use it.
>

So either we go with this or try the above approach of having a
separate function _validate/_precheck/_validate_tuple_length. I don't
have a strong opinion on either of these approaches.

> > > > I see that all the places we check resultRelInfo->ri_needLockTagTuple,
> > > > we can just call
> > > > IsInplaceUpdateRelation(resultRelInfo->ri_RelationDesc). Is there a
> > > > big advantage of storing a separate bool field? Also there is another
> > >
> > > No, not a big advantage.  I felt it was more in line with the typical 
> > > style of
> > > src/backend/executor.
>
> > just find it a smell that there are two fields which are not
> > independent and they go out of sync. And that's why my preference is
> > to not have a dependent field unless there is a specific advantage.
>
> Got it.  This check happens for every tuple of every UPDATE, so performance
> may be a factor.  Some designs and their merits:
>

Thanks. If performance is a factor, it makes sense to keep it.

>  a. ri_needLockTagTuple
> Performance: best: check one value for nonzero
> Drawback: one more value lifecycle to understand
> Drawback: users of ResultRelInfo w/o InitResultRelInfo() could miss this
>
>  b. call IsInplaceUpdateRelation
> Performance: worst: two extern function calls, then com

Re: Logging parallel worker draught

2024-09-04 Thread Benoit Lobréau

I found out in [1] that I am not correctly tracking the workers for 
vacuum commands. I trap workers used by 
parallel_vacuum_cleanup_all_indexes but not 
parallel_vacuum_bulkdel_all_indexes.


Back to the drawing board.

[1] 
https://www.postgresql.org/message-id/flat/783bc7f7-659a-42fa-99dd-ee0565644...@dalibo.com


--
Benoit Lobréau
Consultant
http://dalibo.com

Re: POC, WIP: OR-clause support for indexes

2024-09-04 Thread Alena Rybakina


Hi!

On 03.09.2024 12:52, Andrei Lepikhov wrote:
If OR constants have different types, then they belong to different 
groups, and I think that's unfair. I think that conversion to a 
single type should be used here - while I’m working on this, I’ll 
send the code in the next letter.
IMO, that means additional overhead, isn't it? It is an improvement 
and I suggest to discuss it in a separate thread if current feature 
will be applied.
I think we will have a slight overhead, so in essence it will go 
additionally through the transformed groups, and not through the entire 
list of Or expressions. I agree with your suggestion to discuss it in 
separate thread.


And I noticed that there were some tests missing on this, so I added 
this.


I've updated the patch file to include my and Jian's suggestions, as 
well as the diff file if there's no objection.
I doubt if you really need additional index on the tenk1 table. What 
is the case you can't reproduce with current indexes, playing, let's 
say, with casting to numeric and integer data types?

See in attachment minor fixes to the v38 version of the patch set.

I rewrote the tests with integer types. Thanks for your suggestion. If 
you don't mind, I've updated the diff file you attached earlier to 
include the tests.
diff --git a/src/backend/optimizer/path/indxpath.c 
b/src/backend/optimizer/path/indxpath.c
index cde635d9cb6..d54462f0fce 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -3345,10 +3345,8 @@ match_orclause_to_indexcol(PlannerInfo *root,
get_typlenbyvalalign(consttype, &typlen, &typbyval, &typalign);
 
elems = (Datum *) palloc(sizeof(Datum) * list_length(consts));
-   foreach(lc, consts)
+   foreach_node(Const, value, consts)
{
-   Const *value = (Const *) lfirst(lc);
-
Assert(!value->constisnull && value->constvalue);
 
elems[i++] = value->constvalue;
diff --git a/src/test/regress/expected/create_index.out 
b/src/test/regress/expected/create_index.out
index 5623f4b3123..103e135de9f 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1890,52 +1890,45 @@ SELECT * FROM tenk1
Index Cond: ((thousand = 42) AND (tenthous = ANY 
('{1,3,42}'::integer[])))
 (8 rows)
 
-create index stringu1_idx on tenk1 (stringu1);
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1
-  WHERE thousand = 42 AND (stringu1::text = 'MA' OR stringu1::text = 
'TU'::name OR stringu1 = 'OB'::name);
-  QUERY PLAN   


+  WHERE thousand = 42 AND (tenthous = 1::int2 OR tenthous::int2 = 3::int8 OR 
tenthous = 42::int8);
+ QUERY PLAN
  
+-
  Bitmap Heap Scan on tenk1
Recheck Cond: (thousand = 42)
-   Filter: (((stringu1)::text = 'MA'::text) OR ((stringu1)::text = 
'TU'::name) OR (stringu1 = 'OB'::name))
+   Filter: ((tenthous = '1'::smallint) OR ((tenthous)::smallint = '3'::bigint) 
OR (tenthous = '42'::bigint))
->  Bitmap Index Scan on tenk1_thous_tenthous
  Index Cond: (thousand = 42)
 (5 rows)
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1
-  WHERE thousand = 42 AND (stringu1 = 'MA' OR stringu1 = 'TU' OR 
stringu1 = 'OB');
-QUERY PLAN 


+  WHERE thousand = 42 AND (tenthous = 1::int2 OR tenthous::int2 = 3::int8 OR 
tenthous::int2 = 42::int8);
+   QUERY PLAN  
  
+-
  Bitmap Heap Scan on tenk1
-   Recheck Cond: ((thousand = 42) AND ((stringu1 = 'MA'::name) OR 
(stringu1 = 'TU'::name) OR (stringu1 = 'OB'::name)))
-   ->  BitmapAnd
- ->  Bitmap Index Scan on tenk1_thous_tenthous
-   Index Cond: (thousand = 42)
- ->  Bitmap Index Scan on stringu1_idx
-   Index Cond: (stringu1 = ANY ('{MA,TU,OB}'::name[]))
-(7 rows)
+   Recheck Cond: (thousand = 42)
+   Filter: ((tenthous = '1'::smallint) OR ((tenthous)::smallint = '3'::bigint) 
OR ((tenthous)::smallint = '42'::bigint))
+   ->  Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: (t

Re: Typos in the code and README

2024-09-04 Thread David Rowley

On Wed, 4 Sept 2024 at 20:24, Daniel Gustafsson  wrote:
> Not mandatory at all, but since you were prepping a typo backpatch anyways I
> figured these could join to put a small dent in reducing risks for future
> backports.

I think this is pretty good logic.  I think fixing comment typos in
ancient code and backpatching to all supported versions isn't good use
of time, but fixing a typo in "recent" code and backpatching to where
that code was added seems useful. Newer code is more likely to need
bug fixes in the future, so going to a bit more effort to make
backpatching those bug fixes easier seems worth the effort.  I just
don't know what "recent" should be defined as. I'd say if it's in a
version we've not released yet, that's probably recent. By the time .1
is out, there's less chance of bugs in new code. Anyway, I doubt hard
guidelines are warranted here, but maybe some hints about best
practices in https://wiki.postgresql.org/wiki/Committing_checklist ?

David

Re: scalability bottlenecks with (many) partitions (and more)

2024-09-04 Thread Tomas Vondra

On 9/4/24 16:25, Matthias van de Meent wrote:
> On Tue, 3 Sept 2024 at 18:20, Tomas Vondra  wrote:
>> FWIW the actual cost is somewhat higher, because we seem to need ~400B
>> for every lock (not just the 150B for the LOCK struct).
> 
> We do indeed allocate two PROCLOCKs for every LOCK, and allocate those
> inside dynahash tables. That amounts to (152+2*64+3*16=) 328 bytes in
> dynahash elements, and (3 * 8-16) = 24-48 bytes for the dynahash
> buckets/segments, resulting in 352-376 bytes * NLOCKENTS() being
> used[^1]. Does that align with your usage numbers, or are they
> significantly larger?
> 

I see more like ~470B per lock. If I patch CalculateShmemSize to log the
shmem allocated, I get this:

  max_connections=100 max_locks_per_transaction=1000 => 194264001
  max_connections=100 max_locks_per_transaction=2000 => 241756967

and (((241756967-194264001)/100/1000)) = 474

Could be alignment of structs or something, not sure.

>> At least based on a quick experiment. (Seems a bit high, right?).
> 
> Yeah, that does seem high, thanks for nerd-sniping me.
> 
> The 152 bytes of LOCK are mostly due to a combination of two
> MAX_LOCKMODES-sized int[]s that are used to keep track of the number
> of requested/granted locks of each level. As MAX_LOCKMODES = 10, these
> arrays use a total of 2*4*10=80 bytes, with the remaining 72 spent on
> tracking. MAX_BACKENDS sadly doesn't fit in int16, so we'll have to
> keep using int[]s, but that doesn't mean we can't improve this size:
> 
> ISTM that MAX_LOCKMODES is 2 larger than it has to be: LOCKMODE=0 is
> NoLock, which is never used or counted in these shared structures, and
> the max lock mode supported by any of the supported lock methods is
> AccessExclusiveLock (8). We can thus reduce MAX_LOCKMODES to 8,
> reducing size of the LOCK struct by 16 bytes.
> 
> If some struct- and field packing is OK, then we could further reduce
> the size of LOCK by an additional 8 bytes by resizing the LOCKMASK
> type from int to int16 (we only use the first MaxLockMode (8) + 1
> bits), and then storing the grant/waitMask fields (now 4 bytes total)
> in the padding that's present at the end of the waitProcs struct. This
> would depend on dclist not writing in its padding space, but I
> couldn't find any user that did so, and most critically dclist_init
> doesn't scribble in the padding with memset.
> 
> If these are both implemented, it would save 24 bytes, reducing the
> struct to 128 bytes. :) [^2]
> 
> I also checked PROCLOCK: If it is worth further packing the struct, we
> should probably look at whether it's worth replacing the PGPROC* typed
> fields with ProcNumber -based ones, potentially in both PROCLOCK and
> PROCLOCKTAG. When combined with int16-typed LOCKMASKs, either one of
> these fields being replaced with ProcNumber would allow a reduction in
> size by one MAXALIGN quantum, reducing the struct to 56 bytes, the
> smallest I could get it to without ignoring type alignments.
> 
> Further shmem savings can be achieved by reducing the "10% safety
> margin" added at the end of LockShmemSize, as I'm fairly sure the
> memory used in shared hashmaps doesn't exceed the estimated amount,
> and if it did then we should probably fix that part, rather than
> requesting that (up to) 10% overhead here.
> 
> Alltogether that'd save 40 bytes/lock entry on size, and ~35
> bytes/lock on "safety margin", for a saving of (up to) 19% of our
> current allocation. I'm not sure if these tricks would benefit with
> performance or even be a demerit, apart from smaller structs usually
> being better at fitting better in CPU caches.
> 

Not sure either, but it seems worth exploring. If you do an experimental
patch for the LOCK size reduction, I can get some numbers.

I'm not sure about the safety margins. 10% sure seems like quite a bit
of memory (it might not have in the past, but as the instances are
growing, that probably changed).


regards

-- 
Tomas Vondra

Re: Use streaming read API in ANALYZE

2024-09-04 Thread Robert Haas

On Wed, Sep 4, 2024 at 6:38 AM Thomas Munro  wrote:
> Thanks for the explanation.  I think we should revert it.  IMHO it was
> a nice clean example of a streaming transformation, but unfortunately
> it transformed an API that nobody liked in the first place, and broke
> some weird and wonderful workarounds.  Let's try again in 18.

The problem I have with this is that we just released RC1. I suppose
if we have to make this change it's better to do it sooner than later,
but are we sure we want to whack this around this close to final
release?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: scalability bottlenecks with (many) partitions (and more)

2024-09-04 Thread Tomas Vondra





On 9/4/24 17:12, David Rowley wrote:
> On Wed, 4 Sept 2024 at 03:06, Robert Haas  wrote:
>>
>> On Mon, Sep 2, 2024 at 1:46 PM Tomas Vondra  wrote:
>>> But say we add a GUC and set it to -1 by default, in which case it just
>>> inherits the max_locks_per_transaction value. And then also provide some
>>> basic metric about this fast-path cache, so that people can tune this?
>>
>> All things being equal, I would prefer not to add another GUC for
>> this, but we might need it.
> 
> I think driving the array size from max_locks_per_transaction is a
> good idea (rounded up to the next multiple of 16?).

Maybe, although I was thinking we'd just use the regular doubling, to
get nice "round" numbers. It will likely overshoot a little bit (unless
people set the GUC to exactly 2^N), but I don't think that's a problem.

> If someone comes along one day and shows us a compelling case where
> some backend needs more than its fair share of locks and performance
> is bad because of that, then maybe we can consider adding a GUC then.
> Certainly, it's much easier to add a GUC later if someone convinces
> us that it's a good idea than it is to add it now and try to take it
> away in the future if we realise it's not useful enough to keep.
> 

Yeah, I agree with this.



regards

-- 
Tomas Vondra

Re: POC, WIP: OR-clause support for indexes

2024-09-04 Thread Alena Rybakina


On 04.09.2024 18:31, Alena Rybakina wrote:
I rewrote the tests with integer types. Thanks for your suggestion. If 
you don't mind, I've updated the diff file you attached earlier to 
include the tests.
Sorry, I've just noticed that one of your changes with the regression 
test wasn't included. I fixed it here.diff --git a/src/backend/optimizer/path/indxpath.c 
b/src/backend/optimizer/path/indxpath.c
index cde635d9cb6..d54462f0fce 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -3345,10 +3345,8 @@ match_orclause_to_indexcol(PlannerInfo *root,
get_typlenbyvalalign(consttype, &typlen, &typbyval, &typalign);
 
elems = (Datum *) palloc(sizeof(Datum) * list_length(consts));
-   foreach(lc, consts)
+   foreach_node(Const, value, consts)
{
-   Const *value = (Const *) lfirst(lc);
-
Assert(!value->constisnull && value->constvalue);
 
elems[i++] = value->constvalue;
diff --git a/src/test/regress/expected/create_index.out 
b/src/test/regress/expected/create_index.out
index 5623f4b3123..dc6925e83f9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1877,7 +1877,7 @@ SELECT * FROM tenk1
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1
-  WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42 or 
tenthous is null);
+  WHERE thousand = 42 AND (tenthous = 1 OR tenthous = 3 OR tenthous = 42 OR 
tenthous IS NULL);
 QUERY PLAN 

 
---
  Bitmap Heap Scan on tenk1
@@ -1890,52 +1890,45 @@ SELECT * FROM tenk1
Index Cond: ((thousand = 42) AND (tenthous = ANY 
('{1,3,42}'::integer[])))
 (8 rows)
 
-create index stringu1_idx on tenk1 (stringu1);
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1
-  WHERE thousand = 42 AND (stringu1::text = 'MA' OR stringu1::text = 
'TU'::name OR stringu1 = 'OB'::name);
-  QUERY PLAN   


+  WHERE thousand = 42 AND (tenthous = 1::int2 OR tenthous::int2 = 3::int8 OR 
tenthous = 42::int8);
+ QUERY PLAN
  
+-
  Bitmap Heap Scan on tenk1
Recheck Cond: (thousand = 42)
-   Filter: (((stringu1)::text = 'MA'::text) OR ((stringu1)::text = 
'TU'::name) OR (stringu1 = 'OB'::name))
+   Filter: ((tenthous = '1'::smallint) OR ((tenthous)::smallint = '3'::bigint) 
OR (tenthous = '42'::bigint))
->  Bitmap Index Scan on tenk1_thous_tenthous
  Index Cond: (thousand = 42)
 (5 rows)
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1
-  WHERE thousand = 42 AND (stringu1 = 'MA' OR stringu1 = 'TU' OR 
stringu1 = 'OB');
-QUERY PLAN 


+  WHERE thousand = 42 AND (tenthous = 1::int2 OR tenthous::int2 = 3::int8 OR 
tenthous::int2 = 42::int8);
+   QUERY PLAN  
  
+-
  Bitmap Heap Scan on tenk1
-   Recheck Cond: ((thousand = 42) AND ((stringu1 = 'MA'::name) OR 
(stringu1 = 'TU'::name) OR (stringu1 = 'OB'::name)))
-   ->  BitmapAnd
- ->  Bitmap Index Scan on tenk1_thous_tenthous
-   Index Cond: (thousand = 42)
- ->  Bitmap Index Scan on stringu1_idx
-   Index Cond: (stringu1 = ANY ('{MA,TU,OB}'::name[]))
-(7 rows)
+   Recheck Cond: (thousand = 42)
+   Filter: ((tenthous = '1'::smallint) OR ((tenthous)::smallint = '3'::bigint) 
OR ((tenthous)::smallint = '42'::bigint))
+   ->  Bitmap Index Scan on tenk1_thous_tenthous
+ Index Cond: (thousand = 42)
+(5 rows)
 
 EXPLAIN (COSTS OFF)
 SELECT * FROM tenk1
-  WHERE thousand = 42 AND (stringu1 = 'MA' OR stringu1 = 'TU'::text OR 
stringu1 = 'OB'::text);
- QUERY PLAN

GetRelationPath() vs critical sections

2024-09-04 Thread Andres Freund

Hi,

In the AIO patchset there are cases where we have to LOG failures inside a
critical section. This is necessary because e.g. a buffer read might complete
while we are waiting for a WAL write inside a critical section.

We can't just defer the log message, as the IO might end up being
waited-on/completed-by another backend than the backend that issued the IO, so
we'd defer logging issues until an effectively arbitrary later time.

In general emitting a LOG inside a critical section isn't a huge issue - we
made sure that elog.c has a reserve of memory to be able to log without
crashing.

However, the current message for buffer IO issues use relpath*() (ending up in
a call to GetRelationPath()). Which in turn uses psprintf() to generate the
path. Which in turn violates the no-memory-allocations-in-critical-sections
rule, as the containing memory context will typically not have
->allowInCritSection == true.

It's not obvious to me what the best way to deal with this is.

One idea I had was to add an errrelpath() that switches to
edata->assoc_context before calling relpath(), but that would end up leaking
memory, as FreeErrorDataContents() wouldn't know about the allocation.

Obviously we could add a version of GetRelationPath() that just prints into a
caller provided buffer - but that's somewhat awkward API wise.

A third approach would be to have a dedicated memory context for this kind of
thing that's reset after logging the message - but that comes awkwardly close
to duplicating ErrorContext.


I wonder if we're lacking a bit of infrastructure here...

Greetings,

Andres Freund

Re: GUC names in messages

2024-09-04 Thread Alexander Lakhin


Hello Peter,

[ sorry for the kind of off-topic ]

17.05.2024 14:57, Peter Eisentraut wrote:

I committed your 0001 and 0002 now, with some small fixes.

There has also been quite a bit of new code, of course, since you posted your patches, so we'll probably find a few 
more things that could use adjustment.


I'd be happy to consider the rest of your patch set after beta1 and/or for PG18.


While translating messages, I've encountered a weird message, updated by
17974ec25:
    printf(_("(in \"wal_sync_method\" preference order, except fdatasync is Linux's 
default)\n"));

Does "except ..." make sense here or it's just a copy-and-paste from docs?:
    The default is the first method in the above list that is supported
    by the platform, except that fdatasync is the 
default on
    Linux and FreeBSD.

Best regards,
Alexander

Re: Test 041_checkpoint_at_promote.pl faild in installcheck due to missing injection_points

2024-09-04 Thread Maxim Orlov

On Wed, 4 Sept 2024 at 03:40, Michael Paquier  wrote:

> Any thoughts about the attached?  This makes installcheck work here
> with and without the configure switch.
>
>
Works for me with configure build. Meson build, obviously, still need extra
"meson compile install-test-files" step
as expected. From your patch, I see that you used safe_psql call to check
for availability of the injection_points
extension. Are there some hidden, non-obvious reasons for it? It's much
simpler to check output of the
pg_config as Álvaro suggested above, does it? And we don't need active node
for this. Or I miss something?

And one other thing I must mention. I don't know why, but my pg_config from
meson build show empty configure
despite the fact, I make pass the same options in both cases.

autotools:
$ ./pg_config --configure
'--enable-debug' '--enable-tap-tests' '--enable-depend' 

meson:
$ ./pg_config --configure

-- 
Best regards,
Maxim Orlov.

Avoid dead code (contrib/pg_visibility/pg_visibility.c)

2024-09-04 Thread Ranier Vilela

Hi.

Per coverity.

I think that commit c582b75
,
left an oversight.

The report is:
CID 1559993: (#1 of 1): Logically dead code (DEADCODE)

Trivial patch attached.

best regards,
Ranier Vilela


0001-avoid-dead-code-pg_visibility.patch
Description: Binary data

Avoid possible dereference null pointer (src/bin/pg_dump/pg_dump.c)

2024-09-04 Thread Ranier Vilela

Hi.

Per Coverity.

I think that commit 6ebeeae
 left out an oversight.

The report is:
CID 1559991: (#1 of 1): Dereference null return value (NULL_RETURNS)

The function *findTypeByOid* can return NULL.
It is necessary to check the function's return,
as is already done in other parts of the source.

patch attached.

Best regards,
Ranier Vilela


0001-avoid-possible-dereference-null-pointer-pg_dump.patch
Description: Binary data

Re: Avoid possible dereference null pointer (src/bin/pg_dump/pg_dump.c)

2024-09-04 Thread Nathan Bossart

On Wed, Sep 04, 2024 at 02:10:28PM -0300, Ranier Vilela wrote:
> I think that commit 6ebeeae
>  left out an oversight.
> 
> The report is:
> CID 1559991: (#1 of 1): Dereference null return value (NULL_RETURNS)
> 
> The function *findTypeByOid* can return NULL.
> It is necessary to check the function's return,
> as is already done in other parts of the source.
> 
> patch attached.

Yeah, that looks like a problem to me.  I've cc'd Daniel here.

-- 
nathan

Re: AIO v2.0

2024-09-04 Thread 陈宗志

I hope there can be a high-level design document that includes a
description, high-level architecture, and low-level design.
This way, others can also participate in reviewing the code.
For example, which paths were modified in the AIO module? Is it the
path for writing WAL logs, or the path for flushing pages, etc.?

Also, I recommend keeping this patch as small as possible.
For example, the first step could be to introduce libaio only, without
considering io_uring, as that would make it too complex.

Fix possible resource leaks (src/backend/replication/logical/conflict.c)

2024-09-04 Thread Ranier Vilela

Hi.

Per Coverity.

The commit 9758174 ,
included the source src/backend/replication/logical/conflict.c.
The function *errdetail_apply_conflict* reports potential conflicts.
But do not care about possible resource leaks.
However, the leaked size can be considerable, since it can have logs with
the LOG level.
The function *ReportSlotInvalidation* has similar utility, but on the
contrary, be careful not to leak.

IMO, these potential leaks need to be fixed.

Patch attached.

best regards,
Ranier Vilela


0001-fix-possible-resources-leaks-conflict.patch
Description: Binary data

Re: gamma() and lgamma() functions

2024-09-04 Thread Dean Rasheed

On Fri, 23 Aug 2024 at 13:40, Peter Eisentraut  wrote:
>
> What are examples of where this would be useful in a database context?

gamma() and lgamma() are the kinds of functions that are generally
useful for a variety of tasks like statistical analysis and
combinatorial computations, and having them in the database allows
those sort of computations to be performed without the need to export
the data to an external tool. We discussed adding them in a thread
last year [1], and there has been at least a little prior interest
[2].

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGKJAcB8Q5qziKTTSnkA4Mnv_6f%2B7-_XUgbh9jFjSdEFQg%40mail.gmail.com
[2] 
https://stackoverflow.com/questions/58884066/how-can-i-run-the-equivalent-of-excels-gammaln-function-in-postgres

Of course, there's a somewhat fuzzy line between what is generally
useful enough, and what is too specialised for core Postgres, but I
would argue that these qualify, since they are part of C99, and
commonly appear in other general-purpose math libraries like the
Python math module.

> > The error-handling for these functions seems to be a little trickier
> > than most, so that might need further discussion.
>
> Yeah, this is quite something.
>
> I'm not sure why you are doing the testing for special values (NaN etc.)
> yourself when the C library function already does it.  For example, if I
> remove the isnan(arg1) check in your dlgamma(), then it still behaves
> the same in all tests.

It's useful to do that so that we don't need to assume that every
platform conforms to the POSIX standard, and because it simplifies the
later overflow checks. This is consistent with the approach taken in
other functions, such as dexp(), dsin(), dcos(), etc.

> The overflow checks after the C library call are written
> differently for the two functions.  dgamma() does not check errno for
> ERANGE for example.  It might also be good if dgamma() checked errno for
> EDOM, because other the result of gamma(-1) being "overflow" seems a bit
> wrong.

They're intentionally different because the functions themselves are
different. In this case:

select gamma(-1);
ERROR:  value out of range: overflow

it is correct to throw an error, because gamma(-1) is undefined (it
goes to -Inf as x goes to -1 from above, and +Inf as x goes to -1 from
below, so there is no well-defined limit).

I've updated the patch to give a more specific error message for
negative integer inputs, as opposed to other overflow cases.

Relying on errno being ERANGE or EDOM doesn't seem possible though,
because the docs indicate that, while its behaviour is one thing
today, per POSIX, that will change in the future.

By contrast, lgamma() does not raise an error for such inputs:

select lgamma(-1);
  lgamma
--
 Infinity

This is correct because lgamma() is the log of the absolute value of
the gamma function, so the limit is +Inf as x goes to -1 from both
sides.

> I'm also not clear why you are turning a genuine result overflow into
> infinity in lgamma().

OK, I've changed that to only return infinity if the input is
infinite, zero, or a negative integer. Otherwise, it now throws an
overflow error.

> Btw., I'm reading that lgamma() in the C library is not necessarily
> thread-safe.  Is that a possible problem?

It's not quite clear what to do about that. Some platforms may define
the lgamma_r() function, but not all. Do we have a standard pattern
for dealing with functions for which there is no thread-safe
alternative?

Regards,
Dean
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index 461fc3f..b2a3368
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1399,6 +1399,27 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FRO

+ gamma
+
+gamma ( double precision )
+double precision
+   
+   
+Gamma function
+   
+   
+gamma(0.5)
+1.772453850905516
+   
+   
+gamma(6)
+120
+   
+  
+
+  
+   
+
  gcd

 gcd ( numeric_type, numeric_type )
@@ -1436,6 +1457,23 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FRO

+  
+   
+
+ lgamma
+
+lgamma ( double precision )
+double precision
+   
+   
+Natural logarithm of the absolute value of the gamma function
+   
+   
+lgamma(1000)
+5905.220423209181
+   
+  
+

diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
new file mode 100644
index f709c21..4694ea2
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -2787,6 +2787,97 @@ derfc(PG_FUNCTION_ARGS)
 }

+/* == GAMMA FUNCTIONS == */
+
+
+/*
+ *		dgamma			- returns the gamma function of arg1
+ */
+Datum
+dgamma(PG_FUNCTION_ARGS)
+{
+	float8		arg1 = PG_GETARG_FLOAT8(0);
+	float8		result;
+
+	/*
+	 * Per the POSIX spec,

Re: Fix possible resource leaks (src/backend/replication/logical/conflict.c)

2024-09-04 Thread Tom Lane

Ranier Vilela  writes:
> Per Coverity.

Coverity is never to be trusted about "leaks" in the backend,
because it has no idea about short- versus long-lived memory
contexts.

> The function *errdetail_apply_conflict* reports potential conflicts.
> But do not care about possible resource leaks.
> However, the leaked size can be considerable, since it can have logs with
> the LOG level.
> The function *ReportSlotInvalidation* has similar utility, but on the
> contrary, be careful not to leak.
> IMO, these potential leaks need to be fixed.

This is nonsense.  If there is a problem here, then we also have
leaks to worry about in the immediate caller ReportApplyConflict,
not to mention its callers.  The correct solution is to be sure that
the whole thing happens in a short-lived context, which I believe
is true --- looks like it should be running in ApplyMessageContext,
which will be reset after each replication message.

(I'm inclined to suspect that that pfree in ReportSlotInvalidation
is equally useless, but I didn't track down its call sites.)

regards, tom lane

Avoid overflowed array index (src/backend/utils/activity/pgstat.c)

2024-09-04 Thread Ranier Vilela

Hi.

Per Coverity.

The commit 7949d95 , left
out an oversight.

The report is:
CID 1559468: (#1 of 1): Overflowed array index read (INTEGER_OVERFLOW)

I think that Coverity is right.
In the function *pgstat_read_statsfile* It is necessary to first check
whether it is the most restrictive case.

Otherwise, if PgStat_Kind is greater than 11, a negative index may occur.

Patch attached.

best regards,
Ranier Vilela


0001-avoid-overflowed-array-index-pgstat.patch
Description: Binary data

Re: gamma() and lgamma() functions

2024-09-04 Thread Tom Lane

Dean Rasheed  writes:
> On Fri, 23 Aug 2024 at 13:40, Peter Eisentraut  wrote:
>> What are examples of where this would be useful in a database context?

> Of course, there's a somewhat fuzzy line between what is generally
> useful enough, and what is too specialised for core Postgres, but I
> would argue that these qualify, since they are part of C99, and
> commonly appear in other general-purpose math libraries like the
> Python math module.

Yeah, I think any math function that's part of C99 or POSIX is
arguably of general interest.

>> I'm not sure why you are doing the testing for special values (NaN etc.)
>> yourself when the C library function already does it.  For example, if I
>> remove the isnan(arg1) check in your dlgamma(), then it still behaves
>> the same in all tests.

> It's useful to do that so that we don't need to assume that every
> platform conforms to the POSIX standard, and because it simplifies the
> later overflow checks. This is consistent with the approach taken in
> other functions, such as dexp(), dsin(), dcos(), etc.

dexp() and those other functions date from the late stone age, before
it was safe to assume that libm's behavior matched the POSIX specs.
Today I think we can assume that, at least till proven differently.
There's not necessarily anything wrong with what you've coded, but
I don't buy this argument for it.

>> Btw., I'm reading that lgamma() in the C library is not necessarily
>> thread-safe.  Is that a possible problem?

> It's not quite clear what to do about that.

Per the Linux man page, the reason lgamma() isn't thread-safe is

   The lgamma(), lgammaf(), and lgammal()  functions  return  the  natural
   logarithm of the absolute value of the Gamma function.  The sign of the
   Gamma function is returned in the external integer signgam declared  in
   .  It is 1 when the Gamma function is positive or zero, -1 when
   it is negative.

AFAICS this patch doesn't inspect signgam, so whether it gets
overwritten by a concurrent thread wouldn't matter.  However,
it'd be a good idea to add a comment noting the hazard.

(Presumably, the reason POSIX says "need not be thread-safe"
is that an implementation could make it thread-safe by
making signgam thread-local, but the standard doesn't wish
to mandate that.)

regards, tom lane

Re: gamma() and lgamma() functions

2024-09-04 Thread Tom Lane

I wrote:
> AFAICS this patch doesn't inspect signgam, so whether it gets
> overwritten by a concurrent thread wouldn't matter.  However,
> it'd be a good idea to add a comment noting the hazard.

Further to that ... I looked at POSIX issue 8 (I had been reading 7)
and found this illuminating discussion:

Earlier versions of this standard did not require lgamma(),
lgammaf(), and lgammal() to be thread-safe because signgam was a
global variable. They are now required to be thread-safe to align
with the ISO C standard (which, since the introduction of threads
in 2011, requires that they avoid data races), with the exception
that they need not avoid data races when storing a value in the
signgam variable. Since signgam is not specified by the ISO C
standard, this exception is not a conflict with that standard.

So the other reason to avoid using signgam is that it might
not exist at all in some libraries.

regards, tom lane

Re: Avoid possible dereference null pointer (src/bin/pg_dump/pg_dump.c)

2024-09-04 Thread Daniel Gustafsson

> On 4 Sep 2024, at 19:30, Nathan Bossart  wrote:
> 
> On Wed, Sep 04, 2024 at 02:10:28PM -0300, Ranier Vilela wrote:
>> I think that commit 6ebeeae
>>  left out an oversight.
>> 
>> The report is:
>> CID 1559991: (#1 of 1): Dereference null return value (NULL_RETURNS)
>> 
>> The function *findTypeByOid* can return NULL.
>> It is necessary to check the function's return,
>> as is already done in other parts of the source.
>> 
>> patch attached.
> 
> Yeah, that looks like a problem to me.  I've cc'd Daniel here.

Thanks for the report, it does seem genuine to me too.  I'll get that handled
later today.

--
Daniel Gustafsson

Re: Large expressions in indexes can't be stored (non-TOASTable)

2024-09-04 Thread Nathan Bossart

On Tue, Sep 03, 2024 at 12:35:42PM -0400, Jonathan S. Katz wrote:
> However, I ran into the issue in[1], where pg_index was identified as
> catalog that is missing a toast table, even though `indexprs` is marked for
> extended storage. Providing a very simple reproducer in psql below:

Thanks to commit 96cdeae, only a few catalogs remain that are missing TOAST
tables: pg_attribute, pg_class, pg_index, pg_largeobject, and
pg_largeobject_metadata.  I've attached a short patch to add one for
pg_index, which resolves the issue cited here.  This passes "check-world"
and didn't fail for a few ad hoc tests (e.g., VACUUM FULL on pg_index).  I
haven't spent too much time investigating possible circularity issues, but
I'll note that none of the system indexes presently use the indexprs and
indpred columns.

If we do want to proceed with adding a TOAST table to pg_index, IMHO it
would be better to do it sooner than later so that it has plenty of time to
bake.

-- 
nathan
>From 301aeeb346c3499e4555e4db1f4fc124f9cbab62 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 4 Sep 2024 13:28:27 -0500
Subject: [PATCH v1 1/1] add toast table to pg_index

---
 src/include/catalog/pg_index.h| 2 ++
 src/test/regress/expected/misc_sanity.out | 4 +---
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/include/catalog/pg_index.h b/src/include/catalog/pg_index.h
index 3462572eb5..6c89639a9e 100644
--- a/src/include/catalog/pg_index.h
+++ b/src/include/catalog/pg_index.h
@@ -69,6 +69,8 @@ CATALOG(pg_index,2610,IndexRelationId) BKI_SCHEMA_MACRO
  */
 typedef FormData_pg_index *Form_pg_index;
 
+DECLARE_TOAST_WITH_MACRO(pg_index, 8149, 8150, PgIndexToastTable, 
PgIndexToastIndex);
+
 DECLARE_INDEX(pg_index_indrelid_index, 2678, IndexIndrelidIndexId, pg_index, 
btree(indrelid oid_ops));
 DECLARE_UNIQUE_INDEX_PKEY(pg_index_indexrelid_index, 2679, IndexRelidIndexId, 
pg_index, btree(indexrelid oid_ops));
 
diff --git a/src/test/regress/expected/misc_sanity.out 
b/src/test/regress/expected/misc_sanity.out
index ad88cbd5c4..2152c65810 100644
--- a/src/test/regress/expected/misc_sanity.out
+++ b/src/test/regress/expected/misc_sanity.out
@@ -56,11 +56,9 @@ ORDER BY 1, 2;
  pg_class| relacl| aclitem[]
  pg_class| reloptions| text[]
  pg_class| relpartbound  | pg_node_tree
- pg_index| indexprs  | pg_node_tree
- pg_index| indpred   | pg_node_tree
  pg_largeobject  | data  | bytea
  pg_largeobject_metadata | lomacl| aclitem[]
-(11 rows)
+(9 rows)
 
 -- system catalogs without primary keys
 --
-- 
2.39.3 (Apple Git-146)

Re: Use read streams in pg_visibility

2024-09-04 Thread Noah Misch

On Tue, Sep 03, 2024 at 10:46:34PM +0300, Nazir Bilal Yavuz wrote:
> On Tue, 3 Sept 2024 at 22:20, Noah Misch  wrote:
> > On Tue, Sep 03, 2024 at 10:50:11AM -0700, Noah Misch wrote:
> > > Pushed with some other cosmetic changes.
> 
> Thanks!
> 
> > I see I pushed something unacceptable under ASAN.  I will look into that:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit&dt=2024-09-03%2017%3A47%3A20
> 
> I think it is related to the scope of BlockRangeReadStreamPrivate in
> the collect_visibility_data() function. Attached a small fix for that.

Right.

https://postgr.es/m/caeudqaozv3wty5tv2t29jcwpydbmkbiwqkzd42s2ogzdixp...@mail.gmail.com
then observed that collect_corrupt_items() was now guaranteed to never detect
corruption.  I have pushed revert ddfc556 of the pg_visibility.c changes.  For
the next try, could you add that testing we discussed?

Re: GetRelationPath() vs critical sections

2024-09-04 Thread Noah Misch

On Wed, Sep 04, 2024 at 11:58:33AM -0400, Andres Freund wrote:
> In general emitting a LOG inside a critical section isn't a huge issue - we
> made sure that elog.c has a reserve of memory to be able to log without
> crashing.
> 
> However, the current message for buffer IO issues use relpath*() (ending up in
> a call to GetRelationPath()). Which in turn uses psprintf() to generate the
> path. Which in turn violates the no-memory-allocations-in-critical-sections
> rule, as the containing memory context will typically not have
> ->allowInCritSection == true.
> 
> It's not obvious to me what the best way to deal with this is.
> 
> One idea I had was to add an errrelpath() that switches to
> edata->assoc_context before calling relpath(), but that would end up leaking
> memory, as FreeErrorDataContents() wouldn't know about the allocation.
> 
> Obviously we could add a version of GetRelationPath() that just prints into a
> caller provided buffer - but that's somewhat awkward API wise.

Agreed.

> A third approach would be to have a dedicated memory context for this kind of
> thing that's reset after logging the message - but that comes awkwardly close
> to duplicating ErrorContext.

That's how I'd try to do it.  Today's ErrorContext is the context for
allocations FreeErrorDataContents() knows how to find.  The new context would
be for calling into arbitrary code unknown to FreeErrorDataContents().  Most
of the time, we'd avoid reset work for the new context, since it would have no
allocations.

Ideally, errstart() would switch to the new context before returning true, and
errfinish() would switch back.  That way, you could just call relpath*()
without an errrelpath() to help.  This does need functions called in ereport()
argument lists to not use CurrentMemoryContext for allocations that need to
survive longer.  I'd not be concerned about imposing that in a major release.
What obstacles would arise if we did that?

> I wonder if we're lacking a bit of infrastructure here...

Conceptually, the ereport() argument list should be a closure that runs in a
suitable mcxt.  I think we're not far from the goal.

Re: Large expressions in indexes can't be stored (non-TOASTable)

2024-09-04 Thread Tom Lane

Nathan Bossart  writes:
> Thanks to commit 96cdeae, only a few catalogs remain that are missing TOAST
> tables: pg_attribute, pg_class, pg_index, pg_largeobject, and
> pg_largeobject_metadata.  I've attached a short patch to add one for
> pg_index, which resolves the issue cited here.  This passes "check-world"
> and didn't fail for a few ad hoc tests (e.g., VACUUM FULL on pg_index).  I
> haven't spent too much time investigating possible circularity issues, but
> I'll note that none of the system indexes presently use the indexprs and
> indpred columns.

Yeah, the possibility of circularity seems like the main hazard, but
I agree it's unlikely that the entries for system indexes could ever
need out-of-line storage.  There are many other things that would have
to be improved before a system index could use indexprs or indpred.

regards, tom lane

Re: Large expressions in indexes can't be stored (non-TOASTable)

2024-09-04 Thread Jonathan S. Katz


On 9/4/24 3:08 PM, Tom Lane wrote:

Nathan Bossart  writes:

Thanks to commit 96cdeae, only a few catalogs remain that are missing TOAST
tables: pg_attribute, pg_class, pg_index, pg_largeobject, and
pg_largeobject_metadata.  I've attached a short patch to add one for
pg_index, which resolves the issue cited here.  This passes "check-world"
and didn't fail for a few ad hoc tests (e.g., VACUUM FULL on pg_index).  I
haven't spent too much time investigating possible circularity issues, but
I'll note that none of the system indexes presently use the indexprs and
indpred columns.


Yeah, the possibility of circularity seems like the main hazard, but
I agree it's unlikely that the entries for system indexes could ever
need out-of-line storage.  There are many other things that would have
to be improved before a system index could use indexprs or indpred.


Agreed on the unlikeliness of that, certainly in the short-to-mid term. 
The impetus driving this is dealing with a data type that can be quite 
large, and it's unlikely system catalogs will be dealing with anything 
of that nature, or requiring very long expressions that couldn't be 
encapsulated in a different way.


Just to be fair, in the case I presented there's an argument that what 
I'm trying to do is fairly inefficient for an expression, given I'm 
passing around an additional several KB payload into the query. However, 
we'd likely have to do that anyway for this problem space, and the 
overall performance hit is negligible compared to the search relevancy 
boost.


I'm working on a much more robust test, but using a known 10MM 768-dim 
dataset and two C-based quantization functions (one using the 
expression), I got a 3% relevancy boost with a 2% reduction in latency 
and throughput. On some other known datasets, I was able to improve 
relevancy 40% or more, though given they were initially returning with 
0% relevancy in some cases, it's not fair to compare performance numbers.


There are other ways to solve the problem as well, but allowing for the 
larger expression gives users more choices in how they can approach it.


Jonathan


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: race condition in pg_class

2024-09-04 Thread Noah Misch

On Wed, Sep 04, 2024 at 09:00:32PM +0530, Nitin Motiani wrote:
> How about this alternative then? The tuple length check
> and the elog(ERROR) gets its own function. Something like
> heap_inplace_update_validate or
> heap_inplace_update_validate_tuple_length. So in that case, it would
> look like this :
> 
>   genam.c:systable_inplace_update_finish
> heapam.c:heap_inplace_update_validate/heap_inplace_update_precheck
> PreInplace_Inval
> START_CRIT_SECTION
> heapam.c:heap_inplace_update
> BUFFER_LOCK_UNLOCK
> AtInplace_Inval
> END_CRIT_SECTION
> UnlockTuple
> AcceptInvalidationMessages
> 
> This is starting to get complicated though so I don't have any issues
> with just renaming the heap_inplace_update to
> heap_inplace_update_and_unlock.

Complexity aside, I don't see the _precheck design qualifying as a modularity
improvement.

>  Assert(rel->ri_needsLockTagTuple ==  
> IsInplaceUpdateRelation(rel->relationDesc)
> 
> This can safeguard against users of ResultRelInfo missing this field.

v10 does the rename and adds that assertion.  This question remains open:

On Thu, Aug 22, 2024 at 12:32:00AM -0700, Noah Misch wrote:
> On Tue, Aug 20, 2024 at 11:59:45AM +0300, Heikki Linnakangas wrote:
> > How many of those for RELKIND_INDEX vs tables? I'm thinking if we should
> > always require a tuple lock on indexes, if that would make a difference.
> 
> Three sites.  See attached inplace125 patch.  Is it a net improvement?  If so,
> I'll squash it into inplace120.

If nobody has an opinion, I'll discard inplace125.  I feel it's not a net
improvement, but either way is fine with me.
Author: Noah Misch 
Commit: Noah Misch 

Warn if LOCKTAG_TUPLE is held at commit, under debug_assertions.

The current use always releases this locktag.  A planned use will
continue that intent.  It will involve more areas of code, making unlock
omissions easier.  Warn under debug_assertions, like we do for various
resource leaks.  Back-patch to v12 (all supported versions), the plan
for the commit of the new use.

Reviewed by Heikki Linnakangas.

Discussion: https://postgr.es/m/20240512232923.aa.nmi...@google.com

diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 83b99a9..51d52c4 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -2255,6 +2255,16 @@ LockReleaseAll(LOCKMETHODID lockmethodid, bool allLocks)
locallock->numLockOwners = 0;
}

+#ifdef USE_ASSERT_CHECKING
+
+   /*
+* Tuple locks are currently held only for short durations 
within a
+* transaction. Check that we didn't forget to release one.
+*/
+   if (LOCALLOCK_LOCKTAG(*locallock) == LOCKTAG_TUPLE && !allLocks)
+   elog(WARNING, "tuple lock held at commit");
+#endif
+
/*
 * If the lock or proclock pointers are NULL, this lock was 
taken via
 * the relation fast-path (and is not known to have been 
transferred).
Author: Noah Misch 
Commit: Noah Misch 

Fix data loss at inplace update after heap_update().

As previously-added tests demonstrated, heap_inplace_update() could
instead update an unrelated tuple of the same catalog.  It could lose
the update.  Losing relhasindex=t was a source of index corruption.
Inplace-updating commands like VACUUM will now wait for heap_update()
commands like GRANT TABLE and GRANT DATABASE.  That isn't ideal, but a
long-running GRANT already hurts VACUUM progress more just by keeping an
XID running.  The VACUUM will behave like a DELETE or UPDATE waiting for
the uncommitted change.

For implementation details, start at the systable_inplace_update_begin()
header comment and README.tuplock.  Back-patch to v12 (all supported
versions).  In back branches, retain a deprecated heap_inplace_update(),
for extensions.

Reviewed by Heikki Linnakangas, Nitin Motiani and Alexander Lakhin.

Discussion: 
https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bg2mgqecerqr5gg...@mail.gmail.com

diff --git a/src/backend/access/heap/README.tuplock 
b/src/backend/access/heap/README.tuplock
index 6441e8b..ddb2def 100644
--- a/src/backend/access/heap/README.tuplock
+++ b/src/backend/access/heap/README.tuplock
@@ -153,3 +153,14 @@ The following infomask bits are applicable:

 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
+
+Reading inplace-updated columns
+---
+
+Inplace updates create an exception to the rule that tuple data won't change
+under a reader holding a pin.  A reader of a heap_fetch() result tuple may
+witness a torn read.  Current inplace-updated fields are aligned and are no
+wider than four bytes, and current readers don't need consistency across
+fields.  Hence

Re: json_query conditional wrapper bug

2024-09-04 Thread Andrew Dunstan




On 2024-09-04 We 6:16 AM, Peter Eisentraut wrote:

On 28.08.24 11:21, Peter Eisentraut wrote:

These are ok:

select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' without 
wrapper);

  json_query

  42

select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' with 
unconditional wrapper);

  json_query

  [42]

But this appears to be wrong:

select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' with 
conditional wrapper);

  json_query

  [42]

This should return an unwrapped 42.


If I make the code change illustrated in the attached patch, then I 
get the correct result here.  And various regression test results 
change, which, to me, all look more correct after this patch.  I don't 
know what the code I removed was supposed to accomplish, but it seems 
to be wrong somehow.  In the current implementation, the WITH 
CONDITIONAL WRAPPER clause doesn't appear to work correctly in any 
case I could identify.



Agree the code definitely looks wrong. If anything the test should 
probably be reversed:


    wrap = count > 1  || !(
    IsAJsonbScalar(singleton) ||
    (singleton->type == jbvBinary &&
JsonContainerIsScalar(singleton->val.binary.data)));

i.e. in the count = 1 case wrap unless it's a scalar or a binary 
wrapping a scalar. The code could do with a comment about the logic.


I know we're very close to release but we should fix this as it's a new 
feature.



cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: psql: fix variable existence tab completion

2024-09-04 Thread Tom Lane

"Anton A. Melnikov"  writes:
> On 19.07.2024 01:10, Tom Lane wrote:
>> With respect to the other hacks Alexander mentions, maybe we
>> could clean some of those out too?  I don't recall what platform
>> we had in mind there, but we've moved our goalposts on what
>> we support pretty far in the last couple years.

> Agreed that no reason to save workarounds for non-supported systems.
> Here is the patch that removes fixes for Buster bug mentioned above.

Pushed.  I shall now watch the buildfarm from a safe distance.

regards, tom lane

Re: GetRelationPath() vs critical sections

2024-09-04 Thread Thomas Munro

On Thu, Sep 5, 2024 at 3:58 AM Andres Freund  wrote:
> Obviously we could add a version of GetRelationPath() that just prints into a
> caller provided buffer - but that's somewhat awkward API wise.

For the record, that's exactly what I did in the patch I proposed to
fix our long standing RelationTruncate() data-eating bug:

https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B5nfWcpnZ%3DZ%3DUpGvY1tTF%3D4QU_0U_07EFaKmH7Nr%2BNLQ%40mail.gmail.com#aa061db119ee7a4b5390af56e24f475d

PostgreSQL 17 release announcement draft

2024-09-04 Thread Jonathan S. Katz


Hi,

Attached is the draft of the PostgreSQL 17 release announcement. This is 
a draft of the text that will go into the press kit, with the key 
portions to review starting from the top of the document, up until the 
"About PostgreSQL" section.


Please provide feedback on content accuracy, notable omissions or items 
that should be excluded, or if an explanation is unclear and needs 
better phrasing. On the last point, I'm looking to ensure the wording is 
clear and is easy to translate into different languages.


Based on feedback, I'll be posting a revision once a day (if there's 
feedback) until the review cut-off. We'll have to freeze the 
announcement by Mon, Sep 9 @ 12:00 UTC so we can begin the translation 
process.


Thank you for your help with the release process!

Jonathan
September 26, 2024 - The PostgreSQL Global Development Group today announced the
release of [PostgreSQL 17](https://www.postgresql.org/docs/17/release-17.html),
the latest version of the world's most advanced open source database.

PostgreSQL 17 builds on decades of open source development, improving its 
performance and scalability while adapting to emergent data access and storage 
patterns. This release of PostgreSQL adds significant overall performance 
gains, including an overhauled memory management implementation for vacuum, 
optimizations to storage access and improvements for high concurrency 
workloads, speedups in bulk loading and exports, and query execution 
improvements for indexes. PostgreSQL 17 has features that benefit brand new 
workloads and critical systems alike, such as additions to the developer 
experience with the SQL/JSON `JSON_TABLE` command, and enhancements to logical 
replication that simplify management of high availability workloads and major 
version upgrades.



PostgreSQL, an innovative data management system known for its reliability,
robustness, and extensibility, benefits from over 25 years of open source
development from a global developer community and has become the preferred open
source relational database for organizations of all sizes.

### System-wide performance gains

A foundational feature of PostgreSQL is 
[vacuum](https://www.postgresql.org/docs/17/routine-vacuuming.html), which is 
used to reclaim storage from data that was marked as removed. Reducing 
resources required for vacuuming directly helps other areas of PostgreSQL, 
particularly on very busy systems. PostgreSQL 17 introduces a new internal 
memory structure for vacuum that's shown up to a 20x reduction in memory and 
improvements in overall vacuuming speed. This release also removes the  `1GB` 
limit on the memory it can use (controlled by 
[`maintenance_work_mem`](https://www.postgresql.org/docs/17/runtime-config-resource.html#GUC-MAINTENANCE-WORK-MEM)),
 letting users apply more resources to vacuuming, which is beneficial for 
systems with lots of changes.

PostgreSQL 17 continues to improve performance of its I/O layer. As part of 
PostgreSQL 17, highly concurrent workloads may see up to a 2x performance 
improvements in write speed based on new optimizations to lock management for 
the [write-ahead log](https://www.postgresql.org/docs/17/wal-intro.html) 
([WAL](https://www.postgresql.org/docs/17/wal-intro.html)). Additionally, this 
release introduces an interface to stream I/O, with immediate improvements for 
sequential scans (reading all the data from a table) and updating database 
statistics with 
[`ANALYZE`](https://www.postgresql.org/docs/17/sql-analyze.html), and allowing 
extensions to integrate with this capability to further accelerate their 
performance.

PostgreSQL 17 extends its performance gains to query execution. Planner 
statistics help PostgreSQL to determine the best way to search for data, and 
PostgreSQL 17 can now use planner statistics and the sort order [common table 
expressions](https://www.postgresql.org/docs/17/queries-with.html) ([`WITH` 
queries](https://www.postgresql.org/docs/17/queries-with.html)) to further 
speed up queries. PostgreSQL 17 also includes several indexing optimizations, 
including speeding execution of queries that contain an `IN` clause that use a 
[B-tree 
index](https://www.postgresql.org/docs/17/indexes-types.html#INDEXES-TYPES-BTREE)
 (the default index method in PostgreSQL), and parallel index builds for 
[BRIN](https://www.postgresql.org/docs/17/brin.html) indexes. The query planner 
in PostgreSQL 17 can now remove redundant `IS NOT NULL` statements when a 
column has a `NOT NULL` constraint, and skips over clauses that contain an `IS 
NULL` clause on a column with an `IS NOT NULL` constraint. PostgreSQL 17 
continues to build on its support of explicit use SIMD (Single 
Instruction/Multiple Data) instructions to accelerate computations, adding 
support for using AVX-512 to accelerate computations for the 
[`bit_count`](https://www.postgresql.org/docs/17/functions-bitstring.html) 
function.

### Further expansion of a robust developer experience

Postgr

Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions

2024-09-04 Thread David Benjamin

On Wed, Sep 4, 2024 at 9:22 AM Daniel Gustafsson  wrote:

> > On 3 Sep 2024, at 14:18, Daniel Gustafsson  wrote:
>
> > Attached is a v4 rebase over the recent OpenSSL 1.0.2 removal which made
> this
> > patch no longer apply.  I've just started to dig into it so have no
> comments on
> > it right now, but wanted to get a cleaned up version into the CFBot.
>
> CFBot building green for this, I just have a few small questions/comments:
>
> +   my_bio_index |= BIO_TYPE_SOURCE_SINK;
>
> According to the OpenSSL docs we should set BIO_TYPE_DESCRIPTOR as well as
> this
> BIO is socket based, but it's not entirely clear to me why.  Is there a
> specific reason it was removed?
>

Looking around at what uses it, it seems BIO_TYPE_DESCRIPTOR is how OpenSSL
decides whether the BIO is expected to respond to BIO_get_fd
(BIO_C_GET_FD). Since the custom BIO is not directly backed by an fd and
doesn't implement that control, I think we don't want to include that bit.
https://github.com/openssl/openssl/blob/openssl-3.3.2/ssl/ssl_lib.c#L1621-L1643

The other place I saw that cares about this bit is this debug callback.
That one's kinda amusing because it assumes every fd-backed BIO stores its
fd in bio->num, but bio->num is not even accessible to external BIOs. I
assume this is an oversight because no one cares about this function.
Perhaps that should be sampled from BIO_get_fd.
https://github.com/openssl/openssl/blob/openssl-3.3.2/crypto/bio/bio_cb.c#L45-L62

Practically speaking, though, I don't think it makes any difference whether
BIO_TYPE_DESCRIPTOR or even BIO_TYPE_SOURCE_SINK is set or unset. I
couldn't find any code that's sensitive to BIO_TYPE_SOURCE_SINK and
presumably Postgres is not calling SSL_get_rfd on an SSL object that it
already knows is backed by a PGconn. TBH if you just passed 0 in for the
index, it would probably work just as well.

> +   bio_method = port_bio_method();
> if (bio_method == NULL)
> {
> SSLerr(SSL_F_SSL_SET_FD, ERR_R_BUF_LIB);
>
> SSL_F_SSL_SET_FD is no longer the correct function context for this error
> reporting.  In OpenSSL 3.x it means nothing since SSLerr throws away the
> function when calling ERR_raise_data, but we still support 1.1.0+.  I
> think the
> correct error would be BIOerr(BIO_F_BIO_METH_NEW..) but I wonder if we
> should
> just remove it since BIO_meth_new and BIO_new already set errors for us to
> consume?  It doesn't seem to make sense to add more errors on the queue
> here?
> The same goes for the frontend part.
>

Ah yeah, +1 to removing them. I've always found external code adding to the
error queue to be a little goofy. OpenSSL's error queue is weird enough
without external additions! :-)

> The attached v5 is a fresh rebase with my comments from above as 0002 to
> illustrate.
>

LGTM

David

Re: tiny step toward threading: reduce dependence on setlocale()

2024-09-04 Thread Jeff Davis

Committed v2-0001.

On Tue, 2024-09-03 at 22:04 -0700, Jeff Davis wrote:

> * This patch may change the handling of collation oid 0, and I'm not
> sure whether that was intentional or not. lc_collate_is_c(0) returned
> false, whereas pg_newlocale_from_collation(0)->collate_is_c raised
> Assert or threw an cache lookup error. I'm not sure if this is an
> actual problem, but looking at the callers, they should be more
> defensive if they expect a collation oid of 0 to work at all.

For functions that do call pg_newlocale_from_collation() when
lc_collation_is_c() returns false, the behavior is unchanged.

There are only 3 callers which don't follow that pattern:

  * spg_text_inner_consistent: gets collation ID from the index, and
text is a collatable type so it will be valid

  * match_pattern_prefix: same

  * make_greater_string: I changed the order of the tests in
make_greater_string so that if len=0 and collation=0, it won't throw an
error. If len !=0, it goes into varstr_cmp(), which will call
pg_newlocale_from_collation().

> * The comment in lc_collate_is_c() said that it returned false with
> the
> idea that the caller would throw a useful error, and I think that
> comment has been wrong for a while. If it returns false, the caller
> is
> expected to call pg_newlocale_from_collation(), which would Assert on
> a
> debug build. Should we remove that Assert, too, so that it will
> consistently throw a cache lookup failure?

I fixed this by replacing the assert with an elog(ERROR, ...), so that
it will consistently show a "cache lookup failed for collation 0"
regardless of whether it's a debug build or not. It's not expected that
the error will be encountered.

Regards,
Jeff Davis

Re: [PATCH] Avoid mixing custom and OpenSSL BIO functions

2024-09-04 Thread Daniel Gustafsson

> On 4 Sep 2024, at 23:19, David Benjamin  wrote:
> 
> On Wed, Sep 4, 2024 at 9:22 AM Daniel Gustafsson  > wrote:
>> > On 3 Sep 2024, at 14:18, Daniel Gustafsson > > > wrote:
>> 
>> > Attached is a v4 rebase over the recent OpenSSL 1.0.2 removal which made 
>> > this
>> > patch no longer apply.  I've just started to dig into it so have no 
>> > comments on
>> > it right now, but wanted to get a cleaned up version into the CFBot.
>> 
>> CFBot building green for this, I just have a few small questions/comments:
>> 
>> +   my_bio_index |= BIO_TYPE_SOURCE_SINK;
>> 
>> According to the OpenSSL docs we should set BIO_TYPE_DESCRIPTOR as well as 
>> this
>> BIO is socket based, but it's not entirely clear to me why.  Is there a
>> specific reason it was removed?
> 
> Looking around at what uses it, it seems BIO_TYPE_DESCRIPTOR is how OpenSSL 
> decides whether the BIO is expected to respond to BIO_get_fd (BIO_C_GET_FD). 
> Since the custom BIO is not directly backed by an fd and doesn't implement 
> that control, I think we don't want to include that bit.
> https://github.com/openssl/openssl/blob/openssl-3.3.2/ssl/ssl_lib.c#L1621-L1643
> 
> The other place I saw that cares about this bit is this debug callback. That 
> one's kinda amusing because it assumes every fd-backed BIO stores its fd in 
> bio->num, but bio->num is not even accessible to external BIOs. I assume this 
> is an oversight because no one cares about this function. Perhaps that should 
> be sampled from BIO_get_fd.
> https://github.com/openssl/openssl/blob/openssl-3.3.2/crypto/bio/bio_cb.c#L45-L62
> 
> Practically speaking, though, I don't think it makes any difference whether 
> BIO_TYPE_DESCRIPTOR or even BIO_TYPE_SOURCE_SINK is set or unset. I couldn't 
> find any code that's sensitive to BIO_TYPE_SOURCE_SINK and presumably 
> Postgres is not calling SSL_get_rfd on an SSL object that it already knows is 
> backed by a PGconn. TBH if you just passed 0 in for the index, it would 
> probably work just as well.

Following the bouncing ball around the code tonight I agree with that.  I think
we should stick to setting BIO_TYPE_SOURCE_SINK though, if only for passing in
zero might seem incorrect enough that we get emails about that from future 
readers.

>> +   bio_method = port_bio_method();
>> if (bio_method == NULL)
>> {
>> SSLerr(SSL_F_SSL_SET_FD, ERR_R_BUF_LIB);
>> 
>> SSL_F_SSL_SET_FD is no longer the correct function context for this error
>> reporting.  In OpenSSL 3.x it means nothing since SSLerr throws away the
>> function when calling ERR_raise_data, but we still support 1.1.0+.  I think 
>> the
>> correct error would be BIOerr(BIO_F_BIO_METH_NEW..) but I wonder if we should
>> just remove it since BIO_meth_new and BIO_new already set errors for us to
>> consume?  It doesn't seem to make sense to add more errors on the queue here?
>> The same goes for the frontend part.
> 
> Ah yeah, +1 to removing them. I've always found external code adding to the 
> error queue to be a little goofy. OpenSSL's error queue is weird enough 
> without external additions! :-)

I wholeheartedly agree.  I've previously gone on record saying that every day
with the OpenSSL API is an adventure, and the errorhandling code doubly so.

>> The attached v5 is a fresh rebase with my comments from above as 0002 to
>> illustrate.
> 
> LGTM 

Thanks for reviewing, I plan on going ahead with this patch shortly.

--
Daniel Gustafsson

Re: Invalid "trailing junk" error message when non-English letters are used

2024-09-04 Thread Tom Lane

Karina Litskevich  writes:
> I see the two solutions here: either move the rule for decinteger_junk
> below the rules for hexinteger_junk, octinteger_junk and bininteger_junk,
> or just use a single rule decinteger_junk for all these cases, since the
> error message is the same anyway. I implemented the latter in the second
> version of the patch, also renamed this common rule to integer_junk.

That seems reasonable, but IMO this code was unacceptably
undercommented before and what you've done has made it worse.
We really need a comment block associated with the flex macros,
perhaps along the lines of

/*
 * An identifier immediately following a numeric literal is disallowed
 * because in some cases it's ambiguous what is meant: for example,
 * 0x1234 could be either a hexinteger or a decinteger "0" and an
 * identifier "x1234".  We can detect such problems by seeing if
 * integer_junk matches a longer substring than any of the XXXinteger
 * patterns.  (One "junk" pattern is sufficient because this will match
 * all the same strings we'd match with {hexinteger}{identifier} etc.)
 * Note that the rule for integer_junk must appear after the ones for
 * XXXinteger to make this work correctly.
 */

(Hmm, actually, is that last sentence true?  My flex is a bit rusty.)

param_junk really needs a similar comment, or maybe we could put
all the XXX_junk macros together and use one comment for all.

> Additionally, I noticed that this patch is going to change error messages
> in some cases, though I don't think it's a big deal. Example:
> Without patch:
> postgres=# select 0xyz;
> ERROR:  invalid hexadecimal integer at or near "0x"
> With patch:
> postgres=# select 0xyz;
> ERROR:  trailing junk after numeric literal at or near "0xyz"

That's sort of annoying, but I don't really see a better way,
or at least not one that's worth the effort.

>> FWIW output of the whole string in the error message doesnt' look nice to
>> me, but other places of code do this anyway e.g:
>> select ('1'||repeat('p',100))::integer;
>> This may be worth fixing.

I think this is nonsense: we are already in the habit of repeating the
whole failing query string in the STATEMENT field.  In any case it's
not something for this patch to worry about.

regards, tom lane

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2024-09-04 Thread Jehan-Guillaume de Rorthais

On Mon, 2 Sep 2024 23:01:47 +0200
Jehan-Guillaume de Rorthais  wrote:

[…]

>   My proposal was to clean everything related to the old FK and use some
>   existing code path to create a fresh and cleaner one. This requires some
>   refactoring in existing code, but we would win a common path of code between
>   create/attach/detach, a cleaner catalog and easier code maintenance.
> 
>   I've finally been able to write a PoC that implement this by calling
>   addFkRecurseReferenced() from DetachPartitionFinalize(). I can't join
>   it here because it is currently an ugly draft and I still have some work
>   to do. But I would really like to have a little more time (one or two days)
>   to explore this avenue further before you commit yours, if you don't mind?
>   Or maybe you already have considered this avenue and rejected it?

Please, find in attachment a patch implementing this idea.

Regards,
>From 0a235cc7040e9b33ba0bb03b0cff34a7281db137 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais 
Date: Wed, 5 Jul 2023 19:19:40 +0200
Subject: [PATCH] Rework foreign key mangling during ATTACH/DETACH

---
 src/backend/commands/tablecmds.c  | 363 +-
 src/test/regress/expected/foreign_key.out |  67 
 src/test/regress/sql/foreign_key.sql  |  40 +++
 3 files changed, 386 insertions(+), 84 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index b3cc6f8f69..f345d6f018 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -504,7 +504,13 @@ static ObjectAddress ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *
 			   Relation rel, Constraint *fkconstraint,
 			   bool recurse, bool recursing,
 			   LOCKMODE lockmode);
-static ObjectAddress addFkRecurseReferenced(List **wqueue, Constraint *fkconstraint,
+static ObjectAddress addFkConstraint(Constraint *fkconstraint, Relation rel,
+	 Relation pkrel, Oid indexOid, Oid parentConstr,
+	 int numfks, int16 *pkattnum, int16 *fkattnum,
+	 Oid *pfeqoperators, Oid *ppeqoperators,
+	 Oid *ffeqoperators, int numfkdelsetcols,
+	 int16 *fkdelsetcols);
+static void addFkRecurseReferenced(List **wqueue, Constraint *fkconstraint,
 			Relation rel, Relation pkrel, Oid indexOid, Oid parentConstr,
 			int numfks, int16 *pkattnum, int16 *fkattnum,
 			Oid *pfeqoperators, Oid *ppeqoperators, Oid *ffeqoperators,
@@ -645,9 +651,11 @@ static void DropClonedTriggersFromPartition(Oid partitionId);
 static ObjectAddress ATExecDetachPartition(List **wqueue, AlteredTableInfo *tab,
 		   Relation rel, RangeVar *name,
 		   bool concurrent);
-static void DetachPartitionFinalize(Relation rel, Relation partRel,
-	bool concurrent, Oid defaultPartOid);
-static ObjectAddress ATExecDetachPartitionFinalize(Relation rel, RangeVar *name);
+static void DetachPartitionFinalize(List **wqueue, Relation rel,
+	Relation partRel, bool concurrent,
+	Oid defaultPartOid);
+static ObjectAddress ATExecDetachPartitionFinalize(List **wqueue, Relation rel,
+   RangeVar *name);
 static ObjectAddress ATExecAttachPartitionIdx(List **wqueue, Relation parentIdx,
 			  RangeVar *name);
 static void validatePartitionedIndex(Relation partedIdx, Relation partedTbl);
@@ -5468,7 +5476,7 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 			((PartitionCmd *) cmd->def)->concurrent);
 			break;
 		case AT_DetachPartitionFinalize:
-			address = ATExecDetachPartitionFinalize(rel, ((PartitionCmd *) cmd->def)->name);
+			address = ATExecDetachPartitionFinalize(wqueue, rel, ((PartitionCmd *) cmd->def)->name);
 			break;
 		default:/* oops */
 			elog(ERROR, "unrecognized alter table type: %d",
@@ -9901,7 +9909,7 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	 * Create all the constraint and trigger objects, recursing to partitions
 	 * as necessary.  First handle the referenced side.
 	 */
-	address = addFkRecurseReferenced(wqueue, fkconstraint, rel, pkrel,
+	address = addFkConstraint(fkconstraint, rel, pkrel,
 	 indexOid,
 	 InvalidOid,	/* no parent constraint */
 	 numfks,
@@ -9911,6 +9919,18 @@ ATAddForeignKeyConstraint(List **wqueue, AlteredTableInfo *tab, Relation rel,
 	 ppeqoperators,
 	 ffeqoperators,
 	 numfkdelsetcols,
+	 fkdelsetcols);
+
+	addFkRecurseReferenced(wqueue, fkconstraint, rel, pkrel,
+	 indexOid,
+	 address.objectId,
+	 numfks,
+	 pkattnum,
+	 fkattnum,
+	 pfeqoperators,
+	 ppeqoperators,
+	 ffeqoperators,
+	 numfkdelsetcols,
 	 fkdelsetcols,
 	 old_check_ok,
 	 InvalidOid, InvalidOid);
@@ -9974,47 +9994,20 @@ validateFkOnDeleteSetColumns(int numfks, const int16 *fkattnums,
 	}
 }
 
-/*
- * addFkRecurseReferenced
- *		subroutine for ATAddForeignKeyCo

Re: Use streaming read API in ANALYZE

2024-09-04 Thread Thomas Munro

On Thu, Sep 5, 2024 at 3:36 AM Robert Haas  wrote:
> On Wed, Sep 4, 2024 at 6:38 AM Thomas Munro  wrote:
> > Thanks for the explanation.  I think we should revert it.  IMHO it was
> > a nice clean example of a streaming transformation, but unfortunately
> > it transformed an API that nobody liked in the first place, and broke
> > some weird and wonderful workarounds.  Let's try again in 18.
>
> The problem I have with this is that we just released RC1. I suppose
> if we have to make this change it's better to do it sooner than later,
> but are we sure we want to whack this around this close to final
> release?

I hear you.  But I definitely don't want to (and likely can't at this
point) make any of the other proposed changes, and I also don't want
to break Timescale.  That seems to leave only one option: go back to
the v16 API for RC2, and hope that the ongoing table AM discussions
for v18 (CF #4866) will fix all the problems for the people whose TAMs
don't quack like a "heap", and the people whose TAMs do and who would
not like to duplicate the code, and the people who want streaming I/O.

Re: Avoid overflowed array index (src/backend/utils/activity/pgstat.c)

2024-09-04 Thread Michael Paquier

On Wed, Sep 04, 2024 at 03:14:34PM -0300, Ranier Vilela wrote:
> The commit 7949d95 , left
> out an oversight.
> 
> The report is:
> CID 1559468: (#1 of 1): Overflowed array index read (INTEGER_OVERFLOW)
> 
> I think that Coverity is right.
> In the function *pgstat_read_statsfile* It is necessary to first check
> whether it is the most restrictive case.
> 
> Otherwise, if PgStat_Kind is greater than 11, a negative index may occur.

You are missing the fact that there is a call to
pgstat_is_kind_valid() a couple of lines above, meaning that we are
sure that the kind ID we are dealing with is within the range [1,11]
for built-in kinds or [128,256] for the custom kinds, so any ID not
within the first range would just be within the second range.

Speaking of which, I am spotting two possible pointer dereferences
when reading the stats file if we are loading custom stats that do not
exist anymore compared to the moment when they were written, for two
cases:
- Fixed-numbered stats entries.
- Named entries, like replication slot stats, but for the custom case.
It would mean that we'd crash at startup when reading stats depending
on how shared_preload_libraries has changed, which is not the original
intention.  The patch includes details to inform what was found
wrong with two new WARNING messages.  Will fix in a bit, attaching it
for now.

Kind of interesting that your tool did not spot that, and missed the
two I have noticed considering that we're dealing with the same code
paths.  The community coverity did not complain on any of them, AFAIK.
--
Michael
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index b2ca3f39b7..612158f2b9 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1781,6 +1781,12 @@ pgstat_read_statsfile(XLogRecPtr redo)
 	}
 
 	info = pgstat_get_kind_info(kind);
+	if (!info)
+	{
+		elog(WARNING, "could not find information of kind %u for entry of type %c",
+			 kind, t);
+		goto error;
+	}
 
 	if (!info->fixed_amount)
 	{
@@ -1861,6 +1867,12 @@ pgstat_read_statsfile(XLogRecPtr redo)
 		}
 
 		kind_info = pgstat_get_kind_info(kind);
+		if (!kind_info)
+		{
+			elog(WARNING, "could not find information of kind %u for entry of type %c",
+ kind, t);
+			goto error;
+		}
 
 		if (!kind_info->from_serialized_name)
 		{


signature.asc
Description: PGP signature

Re: Typos in the code and README

2024-09-04 Thread Michael Paquier

On Thu, Sep 05, 2024 at 03:34:31AM +1200, David Rowley wrote:
> Anyway, I doubt hard
> guidelines are warranted here, but maybe some hints about best
> practices in https://wiki.postgresql.org/wiki/Committing_checklist ?

Yep, that may be useful.  I just tend to be cautious because it can be
very easy to mess up things depending on the code path you're
manipulating, speaking with some..  Experience on the matter.  And an
RC1 is kind of something to be cautious with.
--
Michael


signature.asc
Description: PGP signature

Re: Eager aggregation, take 3

2024-09-04 Thread Tender Wang

Richard Guo  于2024年8月21日周三 15:11写道：

> On Fri, Aug 16, 2024 at 4:14 PM Richard Guo 
> wrote:
> > I had a self-review of this patchset and made some refactoring,
> > especially to the function that creates the RelAggInfo structure for a
> > given relation.  While there were no major changes, the code should
> > now be simpler.
>
> I found a bug in v10 patchset: when we generate the GROUP BY clauses
> for the partial aggregation that is pushed down to a non-aggregated
> relation, we may produce a clause with a tleSortGroupRef that
> duplicates one already present in the query's groupClause, which would
> cause problems.
>
> Attached is the updated version of the patchset that fixes this bug
> and includes further code refactoring.

 I review the v11 patch set, and here are a few of my thoughts:

1.  in setup_eager_aggregation(), before calling create_agg_clause_infos(),
it does
some checks if eager aggregation is available. Can we move those checks
into a function,
for example, can_eager_agg(), like can_partial_agg() does?

2.  I found that outside of joinrel.c we all use IS_DUMMY_REL,  but in
joinrel.c, Tom always uses
is_dummy_rel(). Other commiters use IS_DUMMY_REL.

3.  The attached patch does not consider FDW when creating a path for
grouped_rel or grouped_join.
Do we need to think about FDW?

I haven't finished reviewing the patch set. I will continue to learn this
feature.

-- 
Thanks,
Tender Wang

Re: Test 041_checkpoint_at_promote.pl faild in installcheck due to missing injection_points

2024-09-04 Thread Michael Paquier

On Wed, Sep 04, 2024 at 07:05:32PM +0300, Maxim Orlov wrote:
> Works for me with configure build. Meson build, obviously, still need extra
> "meson compile install-test-files" step
> as expected. From your patch, I see that you used safe_psql call to check
> for availability of the injection_points
> extension.

> Are there some hidden, non-obvious reasons for it? It's much
> simpler to check output of the
> pg_config as Álvaro suggested above, does it? And we don't need active node
> for this. Or I miss something?

Even if the code is compiled with injection points enabled, it's still
going to be necessary to check if the module exists in the
installation or not.  And it may or may not be around.

One thing that we could do, rather than relying on the environment
variable for the compile-time check, would be to scan pg_config.h for
"USE_INJECTION_POINTS 1".  If it is set, we could skip the use of these
environment variables.  That's really kind of the same thing for
with_ssl or similar depending on the dependencies that exist.  So
that's switching one thing to the other.  I am not sure that's worth
switching at this stage.  It does not change the need of a runtime
check to make sure that the module is installed, though.

Another thing that we could do, rather than query
pg_available_extension, is implementing an equivalent on the perl
side, scanning an installation tree for a module .so or equivalent,
but I've never been much a fan of the extra maintenance burden these
duplications introduce, copying what the backend is able to handle
already in a more transparent way for the TAP routines.

> And one other thing I must mention. I don't know why, but my pg_config from
> meson build show empty configure
> despite the fact, I make pass the same options in both cases.
> 
> autotools:
> $ ./pg_config --configure
> '--enable-debug' '--enable-tap-tests' '--enable-depend' 
> 
> meson:
> $ ./pg_config --configure

Yes, ./configure does not apply to meson, so I'm not sure what we
should do here, except perhaps inventing a new option switch that
reports the options that have been used with the meson command, or
something like that.
--
Michael

signature.asc
Description: PGP signature

Re: meson and check-tests

2024-09-04 Thread jian he

On Wed, Jun 5, 2024 at 7:26 PM Ashutosh Bapat
 wrote:
>
>
>
> On Mon, Jun 3, 2024 at 10:04 PM Tristan Partin  wrote:
>>
>> On Sun Jun 2, 2024 at 12:25 AM CDT, Tom Lane wrote:
>> > "Tristan Partin"  writes:
>> > > On Fri May 31, 2024 at 12:02 PM CDT, Ashutosh Bapat wrote:
>> > >> We talked this off-list at the conference. It seems we have to somehow
>> > >> avoid passing pg_regress --schedule argument and instead pass the list 
>> > >> of
>> > >> tests. Any idea how to do that?
>> >
>> > > I think there are 2 solutions to this.
>> > > 1. Avoid passing --schedule by default, which doesn't sound like a great
>> > >solution.
>> > > 2. Teach pg_regress to ignore the --schedule option if specific tests
>> > >are passed instead.
>> > > 3. Add a --no-schedule option to pg_regress which would override the
>> > >previously added --schedule option.
>> > > I personally prefer 2 or 3.
>
>
>
>
>>
>> >
>> > Just to refresh peoples' memory of what the Makefiles do:
>> > src/test/regress/GNUmakefile has
>> >
>> > check: all
>> >   $(pg_regress_check) $(REGRESS_OPTS) 
>> > --schedule=$(srcdir)/parallel_schedule $(MAXCONNOPT) $(EXTRA_TESTS)
>> >
>> > check-tests: all | temp-install
>> >   $(pg_regress_check) $(REGRESS_OPTS) $(MAXCONNOPT) $(TESTS) 
>> > $(EXTRA_TESTS)
>>
>> >
>> > (and parallel cases for installcheck etc).  AFAICS, meson.build has
>> > no equivalent to the EXTRA_TESTS add-on, nor does it have behavior
>> > equivalent to check-tests' substitution of $(TESTS) for --schedule.
>> > But I suggest that those behaviors have stood for a long time and
>> > so the appropriate thing to do is duplicate them as best we can,
>> > not invent something different.
>>
>> In theory, this makes sense. In practice, this is hard to emulate. We
>> could make the check-tests a Meson run_target() instead of another
>> test(), which would end up running the same tests more than once.
>
>
> meson has changed the way we run individual perl tests and that looks better. 
> So I am fine if meson provides a better way to do what `make check-tests` 
> does. But changing pg_regress seems a wrong choice or even making changes to 
> the current make system. Instead we should make meson pass the right 
> arguments to pg_regress. In this case, it should not pass --schedule when we 
> need `make check-tests` like functionality.
>
> Just adding check-tests as new target won't help we need some way to specify 
> "which tests" to run. Thus by default this target should not run any tests? I 
> don't understand meson well. So I might be completely wrong?
>
> How about the following options?
> 1. TESTS="..." meson test --suite regress - would run the specified tests 
> from regress
>
> 2. Make `meson test --suite regress / regress/partition_join` run 
> partition_join.sql test. I am not how to specify multiple tests in this 
> command. May be `meson test --suite regress / 
> regress/test_setup,partition_join` will do that. make check-tests allows one 
> to run multiple tests like TESTS="test_setup partition_join" make check-tests.
>

hi. I think it's a good  feature for development.
The full regression test is still a very long time to run.

sometimes when you implement a feature, you are pretty sure some
changes will only happen in a single sql file.
so this can make it more faster.

Re: Add callback in pgstats for backend initialization

2024-09-04 Thread Michael Paquier

On Wed, Sep 04, 2024 at 06:42:33PM +0900, Kyotaro Horiguchi wrote:
> More accurately, I believe this applies when the sentence follows a
> verb-object structure. In this case, the function’s meaning is “pgstat
> initialization on backend,” which seems somewhat different from the
> policy you mentioned. However, it could also be interpreted as
> “initialize backend status related to pgstat.” Either way, I’m okay
> with the current name if you are, based on the discussion above.

Not sure which one is better than the other, TBH.  Either way, we
still have a full release cycle to finalize that, and I am OK to
switch the name to something else if there are more folks in favor of
one, the other or even something entirely different but descriptive
enough.
--
Michael


signature.asc
Description: PGP signature

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-04 Thread Amit Kapila

On Wed, Sep 4, 2024 at 2:49 PM shveta malik  wrote:
>
> On Wed, Sep 4, 2024 at 9:17 AM shveta malik  wrote:
> >
> > On Tue, Sep 3, 2024 at 3:01 PM shveta malik  wrote:
> > >
> > >
>
>
> 1)
> It is related to one of my previous comments (pt 3 in [1]) where I
> stated that inactive_since should not keep on changing once a slot is
> invalidated.
>

Agreed. Updating the inactive_since for a slot that is already invalid
is misleading.

>
>
> 2)
> One more issue in this message is, once I set
> replication_slot_inactive_timeout to a bigger value, it becomes more
> misleading. This is because invalidation was done in the past using
> previous value while message starts showing new value:
>
> ALTER SYSTEM SET replication_slot_inactive_timeout TO '36h';
>
> --see 129600 secs in DETAIL and the current time.
> postgres=# SELECT * FROM pg_replication_slot_advance('mysubnew1_1',
> pg_current_wal_lsn());
> ERROR:  can no longer get changes from replication slot "mysubnew1_1"
> DETAIL:  The slot became invalid because it was inactive since
> 2024-09-04 10:06:38.980939+05:30, which is more than 129600 seconds
> ago.
> postgres=# select now();
>now
> --
>  2024-09-04 10:07:35.201894+05:30
>
> I feel we should change this message itself.
>

+1.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-04 Thread Amit Kapila

On Wed, Sep 4, 2024 at 9:17 AM shveta malik  wrote:
>
> On Tue, Sep 3, 2024 at 3:01 PM shveta malik  wrote:
> >
> >
> > 1)
> > I see that ReplicationSlotAlter() will error out if the slot is
> > invalidated due to timeout. I have not tested it myself, but do you
> > know if  slot-alter errors out for other invalidation causes as well?
> > Just wanted to confirm that the behaviour is consistent for all
> > invalidation causes.
>
> I was able to test this and as anticipated behavior is different. When
> slot is invalidated due to say 'wal_removed', I am still able to do
> 'alter' of that slot.
> Please see:
>
> Pub:
>   slot_name  | failover | synced |  inactive_since  |
> invalidation_reason
> -+--++--+-
>  mysubnew1_1 | t| f  | 2024-09-04 08:58:12.802278+05:30 |
> wal_removed
>
> Sub:
> newdb1=# alter subscription mysubnew1_1 disable;
> ALTER SUBSCRIPTION
>
> newdb1=# alter subscription mysubnew1_1 set (failover=false);
> ALTER SUBSCRIPTION
>
> Pub: (failover altered)
>   slot_name  | failover | synced |  inactive_since  |
> invalidation_reason
> -+--++--+-
>  mysubnew1_1 | f| f  | 2024-09-04 08:58:47.824471+05:30 |
> wal_removed
>
>
> while when invalidation_reason is 'inactive_timeout', it fails:
>
> Pub:
>   slot_name  | failover | synced |  inactive_since  |
> invalidation_reason
> -+--++--+-
>  mysubnew1_1 | t| f  | 2024-09-03 14:30:57.532206+05:30 |
> inactive_timeout
>
> Sub:
> newdb1=# alter subscription mysubnew1_1 disable;
> ALTER SUBSCRIPTION
>
> newdb1=# alter subscription mysubnew1_1 set (failover=false);
> ERROR:  could not alter replication slot "mysubnew1_1": ERROR:  can no
> longer get changes from replication slot "mysubnew1_1"
> DETAIL:  The slot became invalid because it was inactive since
> 2024-09-04 08:54:20.308996+05:30, which is more than 0 seconds ago.
> HINT:  You might need to increase "replication_slot_inactive_timeout.".
>
> I think the behavior should be same.
>

We should not allow the invalid replication slot to be altered
irrespective of the reason unless there is any benefit.

-- 
With Regards,
Amit Kapila.

Role Granting Issues in PostgreSQL: Need Help

2024-09-04 Thread Muhammad Imtiaz

Hi,

I need to assign role permissions from one role to another. However, after
granting the role, I see that the permission list for the target role has
not been updated. For this process, I followed the PostgreSQL documentation
available at PostgreSQL Role Membership
. Please let
me know if I've missed anything.

I am using PostgreSQL version 16 and I have followed these steps.
postgres=# select version();
 version
-
 PostgreSQL 16.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0
20210514 (Red Hat 8.5.0-22), 64-bit
(1 row)

1. Create a role with specific permissions

CREATE ROLE rep_admin WITH LOGIN CREATEDB CREATEROLE REPLICATION;

2.Create another role named replication_expert:

CREATE ROLE replication_expert;

3.Grant the rep_admin role to the replication_expert role with inheritance:

 GRANT rep_admin TO replication_expert with INHERIT true;
GRANT ROLE

4.Attempt to log in using the replication_expert role:

postgres=# \c postgres replication_expert
connection to server on socket "/run/postgresql/.s.PGSQL.5432" failed:
FATAL:  role "replication_expert" is not permitted to log in

5.Check the role attributes to see if they have been reflected:

postgres=# \du+
 List of roles
 Role name  | Attributes
  | Description
++-
 postgres   | Superuser, Create role, Create DB, Replication,
Bypass RLS |
 rep_admin  | Create role, Create DB, Replication
 |
 replication_expert | Cannot login

6.Examine the pg_roles table to confirm that the permissions for
replication_expert have not been updated:

postgres=# SELECT rolname,rolinherit, rolcreaterole, rolcreatedb,
rolcanlogin,rolreplication
FROM pg_roles where rolname in('rep_admin','replication_expert');;
  rolname   | rolinherit | rolcreaterole | rolcreatedb |
rolcanlogin | rolreplication
++---+-+-+
 rep_admin  | t  | t | t   | t
  | t
 replication_expert | t  | f | f   | f
  | f
(2 rows)

postgres=#

Regards,
Muhammad Imtiaz

Re: Role Granting Issues in PostgreSQL: Need Help

2024-09-04 Thread David G. Johnston

On Wednesday, September 4, 2024, Muhammad Imtiaz 
wrote:

>
> 1. Create a role with specific permissions
>
> CREATE ROLE rep_admin WITH LOGIN CREATEDB CREATEROLE REPLICATION;
>
>  List of roles
>  Role name  | Attributes
>   | Description
> +---
> -+-
>  postgres   | Superuser, Create role, Create DB, Replication,
> Bypass RLS |
>  rep_admin  | Create role, Create DB, Replication
>|
>  replication_expert | Cannot login
>
>
> 6.Examine the pg_roles table to confirm that the permissions for
> replication_expert have not been updated:
>
> postgres=# SELECT rolname,rolinherit, rolcreaterole, rolcreatedb,
> rolcanlogin,rolreplication
> FROM pg_roles where rolname in('rep_admin','replication_expert');;
>   rolname   | rolinherit | rolcreaterole | rolcreatedb |
> rolcanlogin | rolreplication
> ++---+--
> ---+-+
>  rep_admin  | t  | t | t   | t
>   | t
>  replication_expert | t  | f | f   | f
>   | f
> (2 rows)
>
>
Those are not permissions, they are attributes, and attributes are not
inherited.

David J.

Re: Role Granting Issues in PostgreSQL: Need Help

2024-09-04 Thread Tom Lane

"David G. Johnston"  writes:
> On Wednesday, September 4, 2024, Muhammad Imtiaz 
> wrote:
>> replication_expert | Cannot login

> Those are not permissions, they are attributes, and attributes are not
> inherited.

Specifically: the NOLOGIN attribute on a role is a hard block on
logging in with that role, independently of any and every other
condition.

regards, tom lane

Re: relfilenode statistics

2024-09-04 Thread Bertrand Drouvot

Hi,

On Mon, Aug 05, 2024 at 05:28:22AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Thu, Jul 11, 2024 at 06:10:23AM +, Bertrand Drouvot wrote:
> > Hi,
> > 
> > On Thu, Jul 11, 2024 at 01:58:19PM +0900, Michael Paquier wrote:
> > > On Wed, Jul 10, 2024 at 01:38:06PM +, Bertrand Drouvot wrote:
> > > > So, I think it makes sense to link the hashkey to all the RelFileLocator
> > > > fields, means:
> > > > 
> > > > dboid (linked to RelFileLocator's dbOid)
> > > > objoid (linked to RelFileLocator's spcOid)
> > > > relfile (linked to RelFileLocator's relNumber)
> > > 
> > > Hmm.  How about using the table OID as objoid,
> > 
> > The issue is that we don't have the relation OID when writing buffers out 
> > (that's
> > one of the reason explained in [1]).
> > 
> > [1]: 
> > https://www.postgresql.org/message-id/Zl2k8u4HDTUW6QlC%40ip-10-97-1-34.eu-west-3.compute.internal
> > 
> > Regards,
> > 
> 
> Please find attached a mandatory rebase due to the recent changes around
> statistics.
> 
> As mentioned up-thread:
> 
> The attached patch is not in a fully "polished" state yet: there is more 
> places
> we should add relfilenode counters, create more APIS to retrieve the 
> relfilenode
> stats
> 
> It is in a state that can be used to discuss the approach it is implementing 
> (as
> we have done so far) before moving forward.

Please find attached a mandatory rebase.

In passing, checking if based on the previous discussion (and given that we
don't have the relation OID when writing buffers out) you see another approach
that the one this patch is implementing?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 095af2878f8ab85509766807f60e0dadcf0cd018 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 16 Nov 2023 02:30:01 +
Subject: [PATCH v4] Provide relfilenode statistics
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We currently don’t have writes counters for relations.
The reason is that we don’t have the relation OID when writing buffers out.
Tracking writes per relfilenode would allow us to track/consolidate writes per
relation.

relfilenode stats is also beneficial for the "Split index and table statistics
into different types of stats" work in progress: it would allow us to avoid
additional branches in some situations.

=== Remarks ===

This is a POC patch. There is still work to do: there is more places we should
add relfilenode counters, create more APIS to retrieve the relfilenode stats,
the patch takes care of rewrite generated by TRUNCATE but there is more to
care about like CLUSTER,VACUUM FULL.

The new logic to retrieve stats in pg_statio_all_tables has been implemented
only for the new blocks_written stat (we'd need to do the same for the existing
buffer read / buffer hit if we agree on the approach implemented here).

The goal of this patch is to start the discussion and agree on the design before
moving forward.
---
 src/backend/access/rmgrdesc/xactdesc.c|   5 +-
 src/backend/catalog/storage.c |   8 ++
 src/backend/catalog/system_functions.sql  |   2 +-
 src/backend/catalog/system_views.sql  |   5 +-
 src/backend/postmaster/checkpointer.c |   5 +
 src/backend/storage/buffer/bufmgr.c   |   6 +-
 src/backend/storage/smgr/md.c |   7 ++
 src/backend/utils/activity/pgstat.c   |  39 --
 src/backend/utils/activity/pgstat_database.c  |  12 +-
 src/backend/utils/activity/pgstat_function.c  |  13 +-
 src/backend/utils/activity/pgstat_relation.c  | 112 --
 src/backend/utils/activity/pgstat_replslot.c  |  13 +-
 src/backend/utils/activity/pgstat_shmem.c |  19 ++-
 .../utils/activity/pgstat_subscription.c  |  14 +--
 src/backend/utils/activity/pgstat_xact.c  |  60 +++---
 src/backend/utils/adt/pgstatfuncs.c   |  34 +-
 src/include/access/tableam.h  |  19 +++
 src/include/access/xact.h |   1 +
 src/include/catalog/pg_proc.dat   |  14 ++-
 src/include/pgstat.h  |  37 --
 src/include/utils/pgstat_internal.h   |  34 --
 src/test/recovery/t/029_stats_restart.pl  |  40 +++
 .../recovery/t/030_stats_cleanup_replica.pl   |   6 +-
 src/test/regress/expected/rules.out   |  12 +-
 src/test/regress/expected/stats.out   |  30 ++---
 src/test/regress/sql/stats.sql|  30 ++---
 src/test/subscription/t/026_stats.pl  |   4 +-
 src/tools/pgindent/typedefs.list  |   1 +
 28 files changed, 425 insertions(+), 157 deletions(-)
   4.4% src/backend/catalog/
  46.4% src/backend/utils/activity/
   6.2% src/backend/utils/adt/
   3.6% src/backend/
   3.1% src/include/access/
   3.2% src/include/catalog/
   5.9% src/include/utils/
   6.6% src/include/
  11.7% src/test/recovery/t/
   5.3% src/test/regress/expected/
   3.0%

Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-09-04 Thread David Rowley

On Wed, 10 Jul 2024 at 21:36, Tatsuo Ishii  wrote:
> v2-0005-Add-memory-disk-usage-for-Window-Aggregate-nodes-.patch: This
> adds memory/disk usage for Window Aggregate nodes in EXPLAIN (ANALYZE)
> command. Note that if David's proposal
> https://www.postgresql.org/message-id/CAHoyFK9n-QCXKTUWT_xxtXninSMEv%2BgbJN66-y6prM3f4WkEHw%40mail.gmail.com
> is committed, this will need to be adjusted.

Hi,

I pushed the changes to WindowAgg so as not to call tuplestore_end()
on every partition.  Can you rebase this patch over that change?

It would be good to do this in a way that does not add any new state
to WindowAggState, you can see that I had to shuffle fields around in
that struct because the next_parition field would have caused the
struct to become larger. I've not looked closely, but I expect this
can be done by adding more code to tuplestore_updatemax() to also
track the disk space used if the current storage has gone to disk. I
expect the maxSpace field can be used for both, but we'd need another
bool field to track if the max used was by disk or memory.

I think the performance of this would also need to be tested as it
means doing an lseek() on every tuplestore_clear() when we've gone to
disk. Probably that will be dominated by all the other overheads of a
tuplestore going to disk (i.e. dumptuples() etc), but it would be good
to check this. I suggest setting work_mem = 64 and making a test case
that only just spills to disk. Maybe do a few thousand partitions
worth of that and see if you can measure any slowdown.

David

Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2024-09-04 Thread Tender Wang

Alvaro Herrera  于2024年8月8日周四 06:50写道：

> On 2024-Jul-26, Tender Wang wrote:
>
> > Junwang Zhao  于2024年7月26日周五 14:57写道：
> >
> > > There is a bug report[0] Tender comments might be the same issue as
> > > this one, but I tried Alvaro's and mine patch, neither could solve
> > > that problem, I did not tried Tender's earlier patch thought. I post
> > > the test script below in case you are interested.
> >
> > My earlier patch should handle Alexander reported case. But I did not
> > do more test. I'm not sure that wether or not has dismatching between
> > pg_constraint and pg_trigger.
> >
> > I aggred with Alvaro said that "requires a much more invasive
> > solution".
>
> Here's the patch which, as far as I can tell, fixes all the reported
> problems (other than the one in bug 18541, for which I proposed an
> unrelated fix in that thread[1]).  If you can double-check, I would very
> much appreciate that.  Also, I think the test cases the patch adds
> reflect the provided examples sufficiently, but if we're still failing
> to cover some, please let me know.
>

When I review Jehan-Guillaume v2 patch, I found the below codes that need
a little tweak. In DetachPartitionFinalize()
/*
* If the referenced side is partitioned (which we know because our
* parent's constraint points to a different relation than ours) then
* we must, in addition to the above, create pg_constraint rows that
* point to each partition, each with its own action triggers.
*/
if (parentConForm->conrelid != conform->conrelid)

I found that the above IF was always true,  regardless of whether the
referenced side is partitioned.
Although find_all_inheritors() can return an empty children list when the
referenced side is not partitioned,
we can avoid much useless work.
How about this way:
if (get_rel_relkind(conform->confrelid) == RELKIND_PARTITIONED_TABLE)

--
Thanks,
Tender Wang

Re: Track the amount of time waiting due to cost_delay

2024-09-04 Thread Bertrand Drouvot

Hi,

On Mon, Sep 02, 2024 at 05:11:36AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Aug 20, 2024 at 12:48:29PM +, Bertrand Drouvot wrote:
> > As it looks like we have a consensus not to wait on [0] (as reducing the 
> > number
> > of interrupts makes sense on its own), then please find attached v4, a 
> > rebase
> > version (that also makes clear in the doc that that new field might show 
> > slightly
> > old values, as mentioned in [1]).
> 
> Please find attached v5, a mandatory rebase.

Please find attached v6, a mandatory rebase due to catversion bump conflict.
I'm removing the catversion bump from the patch as it generates too frequent
conflicts (just mention it needs to be done in the commit message).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 45be7dfd86948415962696128a17a68e49c9a773 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 24 Jun 2024 08:43:26 +
Subject: [PATCH v6] Report the total amount of time that vacuum has been
 delayed due to cost delay

This commit adds one column: time_delayed to the pg_stat_progress_vacuum system
view to show the total amount of time in milliseconds that vacuum has been
delayed.

This uses the new parallel message type for progress reporting added
by f1889729dd.

In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.

Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.

Indeed when deciding about the throttling:

1. The number of parallel workers should not come into play:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).

Would need to bump catversion because this changes the definition of
pg_stat_progress_vacuum.
---
 doc/src/sgml/monitoring.sgml | 13 
 src/backend/catalog/system_views.sql |  2 +-
 src/backend/commands/vacuum.c| 49 
 src/include/commands/progress.h  |  1 +
 src/test/regress/expected/rules.out  |  3 +-
 5 files changed, 66 insertions(+), 2 deletions(-)
  24.2% doc/src/sgml/
   4.3% src/backend/catalog/
  65.4% src/backend/commands/
   4.2% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 933de6fe07..64b0604e04 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6380,6 +6380,19 @@ FROM pg_stat_get_backend_idset() AS backendid;
cleaning up indexes.

+
+ 
+  
+   time_delayed bigint
+  
+  
+   Total amount of time spent in milliseconds waiting due to vacuum_cost_delay
+   or autovacuum_vacuum_cost_delay. In case of parallel
+   vacuum the reported time is across all the workers and the leader. This
+   column is updated at a 1 Hz frequency (one time per second) so could show
+   slightly old values.
+  
+ 

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7fd5d256a1..a40888ef2a 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1218,7 +1218,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
 S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
 S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
 S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
-S.param10 AS indexes_processed
+S.param10 AS indexes_processed, S.param11 AS time_delayed
 FROM pg_stat_get_progress_info('VACUUM') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7d8e9d2045..5bf2e37d3f 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -40,6 +40,7 @@
 #include "catalog/pg_inherits.h"
 #include "commands/clust

Re: psql: fix variable existence tab completion

2024-09-04 Thread Anton A. Melnikov




On 04.09.2024 23:26, Tom Lane wrote:


Pushed.  I shall now watch the buildfarm from a safe distance.



Thanks! I'll be ready to fix possible falls.

With the best regards,  


--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes

2024-09-04 Thread Bertrand Drouvot

Hi,

On Wed, Sep 04, 2024 at 04:37:19PM +0200, Guillaume Lelarge wrote:
> Hi,
> 
> Le mer. 4 sept. 2024 à 16:18, Bertrand Drouvot 
> a écrit :
> > What about adding a comment instead of this extra check?
> >
> >
> Done too in v3.

Thanks!

1 ===

+   /*
+* Don't check counts.parallelnumscans because counts.numscans includes
+* counts.parallelnumscans
+*/

"." is missing at the end of the comment.

2 ===

-   if (t > tabentry->lastscan)
+   if (t > tabentry->lastscan && lstats->counts.numscans)

The extra check on lstats->counts.numscans is not needed as it's already done
a few lines before.

3 ===

+   if (t > tabentry->parallellastscan && 
lstats->counts.parallelnumscans)

This one makes sense.

And now I'm wondering if the extra comment added in v3 is really worth it (and
does not sound confusing)? I mean, the parallel check is done once we passe
the initial test on counts.numscans. I think the code is clear enough without
this extra comment, thoughts? 

4 ===

What about adding a few tests? or do you want to wait a bit more to see if "
there's an agreement on this patch" (as you stated at the start of this thread).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-09-04 Thread Tatsuo Ishii

Hi,

> On Wed, 10 Jul 2024 at 21:36, Tatsuo Ishii  wrote:
>> v2-0005-Add-memory-disk-usage-for-Window-Aggregate-nodes-.patch: This
>> adds memory/disk usage for Window Aggregate nodes in EXPLAIN (ANALYZE)
>> command. Note that if David's proposal
>> https://www.postgresql.org/message-id/CAHoyFK9n-QCXKTUWT_xxtXninSMEv%2BgbJN66-y6prM3f4WkEHw%40mail.gmail.com
>> is committed, this will need to be adjusted.
> 
> Hi,
> 
> I pushed the changes to WindowAgg so as not to call tuplestore_end()
> on every partition.  Can you rebase this patch over that change?
> 
> It would be good to do this in a way that does not add any new state
> to WindowAggState, you can see that I had to shuffle fields around in
> that struct because the next_parition field would have caused the
> struct to become larger. I've not looked closely, but I expect this
> can be done by adding more code to tuplestore_updatemax() to also
> track the disk space used if the current storage has gone to disk. I
> expect the maxSpace field can be used for both, but we'd need another
> bool field to track if the max used was by disk or memory.
> 
> I think the performance of this would also need to be tested as it
> means doing an lseek() on every tuplestore_clear() when we've gone to
> disk. Probably that will be dominated by all the other overheads of a
> tuplestore going to disk (i.e. dumptuples() etc), but it would be good
> to check this. I suggest setting work_mem = 64 and making a test case
> that only just spills to disk. Maybe do a few thousand partitions
> worth of that and see if you can measure any slowdown.

Thank you for the suggestion. I will look into this.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Commit Timestamp and LSN Inversion issue

2024-09-04 Thread Amit Kapila

On Wed, Sep 4, 2024 at 6:35 PM Aleksander Alekseev
 wrote:
>
> > > I don't think you can rely on a system clock for conflict resolution.
> > > In a corner case a DBA can move the clock forward or backward between
> > > recordings of Ts1 and Ts2. On top of that there is no guarantee that
> > > 2+ servers have synchronised clocks. It seems to me that what you are
> > > proposing will just hide the problem instead of solving it in the
> > > general case.
> > >
> >
> > It is possible that we can't rely on the system clock for conflict
> > resolution but that is not the specific point of this thread. As
> > mentioned in the subject of this thread, the problem is "Commit
> > Timestamp and LSN Inversion issue". The LSN value and timestamp for a
> > commit are not generated atomically, so two different transactions can
> > have them in different order.
>
> Hm Then I'm having difficulties understanding why this is a
> problem

This is a potential problem pointed out during discussion of CDR [1]
(Please read the point starting from "How is this going to deal .."
and response by Shveta). The point of this thread is that though it
appears to be a problem but practically there is no scenario where it
can impact even when we implement "last_write_wins" startegy as
explained in the initial email. If you or someone sees a problem due
to LSN<->timestamp inversion then we need to explore the solution for
it.

>
> and why it was necessary to mention CDR in this context in the
> first place.
>
> OK, let's forget about CDR completely. Who is affected by the current
> behavior and why would it be beneficial changing it?
>

We can't forget CDR completely as this could only be a potential
problem in that context. Right now, we don't have any built-in
resolution strategies, so this can't impact but if this is a problem
then we need to have a solution for it before considering a solution
like "last_write_wins" strategy.

Now, instead of discussing LSN<->timestamp inversion issue, you
started to discuss "last_write_wins" strategy itself which we have
discussed to some extent in the thread [2]. BTW, we are planning to
start a separate thread as well just to discuss the clock skew problem
w.r.t resolution strategies like "last_write_wins" strategy. So, we
can discuss clock skew in that thread and keep the focus of this
thread LSN<->timestamp inversion problem.

[1] - 
https://www.postgresql.org/message-id/CAJpy0uBWBEveM8LO2b7wNZ47raZ9tVJw3D2_WXd8-b6LSqP6HA%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAJpy0uD0-DpYVMtsxK5R%3DzszXauZBayQMAYET9sWr_w0CNWXxQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-09-04 Thread Tatsuo Ishii

Hi,

> hi. I can roughly understand it.
> 
> I have one minor issue with the comment.
> 
> typedef struct RecursiveUnionState
> {
> PlanStateps;/* its first field is NodeTag */
> boolrecursing;
> boolintermediate_empty;
> Tuplestorestate *working_table;
> Tuplestorestate *intermediate_table;
> int64storageSize;/* max storage size Tuplestore */
> char*storageType;/* the storage type above */
> 
> }
> 
> "/* the storage type above */"
> is kind of ambiguous, since there is more than one Tuplestorestate.
> 
> i think it roughly means: the storage type of working_table
> while the max storage of working_table.
> 
> 
> 
> typedef struct WindowAggState
> {
> ScanStatess;/* its first field is NodeTag */
> 
> /* these fields are filled in by ExecInitExpr: */
> List   *funcs;/* all WindowFunc nodes in targetlist */
> intnumfuncs;/* total number of window functions */
> intnumaggs;/* number that are plain aggregates */
> 
> WindowStatePerFunc perfunc; /* per-window-function information */
> WindowStatePerAgg peragg;/* per-plain-aggregate information */
> ExprState  *partEqfunction; /* equality funcs for partition columns */
> ExprState  *ordEqfunction;/* equality funcs for ordering columns */
> Tuplestorestate *buffer;/* stores rows of current partition */
> int64storageSize;/* max storage size in buffer */
> char*storageType;/* the storage type above */
> }
> 
> " /* the storage type above */"
> I think it roughly means:
> " the storage type of WindowAggState->buffer while the max storage of
> WindowAggState->buffer".

Thank you for looking into my patch. Unfortunately I need to work on
other issue before adjusting the comments because the fields might go
away if I change the tuplestore infrastructure per David's suggestion:
https://www.postgresql.org/message-id/CAApHDvoY8cibGcicLV0fNh%3D9JVx9PANcWvhkdjBnDCc9Quqytg%40mail.gmail.com

After this I will rebase the patches. This commit requires changes.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=908a968612f9ed61911d8ca0a185b262b82f1269

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Add parallel columns for seq scan and index scan on pg_stat_all_tables and _indexes

2024-09-04 Thread Guillaume Lelarge

Le jeu. 5 sept. 2024 à 07:36, Bertrand Drouvot 
a écrit :

> Hi,
>
> On Wed, Sep 04, 2024 at 04:37:19PM +0200, Guillaume Lelarge wrote:
> > Hi,
> >
> > Le mer. 4 sept. 2024 à 16:18, Bertrand Drouvot <
> bertranddrouvot...@gmail.com>
> > a écrit :
> > > What about adding a comment instead of this extra check?
> > >
> > >
> > Done too in v3.
>
> Thanks!
>
> 1 ===
>
> +   /*
> +* Don't check counts.parallelnumscans because counts.numscans
> includes
> +* counts.parallelnumscans
> +*/
>
> "." is missing at the end of the comment.
>
>
Fixed in v4.


> 2 ===
>
> -   if (t > tabentry->lastscan)
> +   if (t > tabentry->lastscan && lstats->counts.numscans)
>
> The extra check on lstats->counts.numscans is not needed as it's already
> done
> a few lines before.
>
>
Fixed in v4.


> 3 ===
>
> +   if (t > tabentry->parallellastscan &&
> lstats->counts.parallelnumscans)
>
> This one makes sense.
>
> And now I'm wondering if the extra comment added in v3 is really worth it
> (and
> does not sound confusing)? I mean, the parallel check is done once we passe
> the initial test on counts.numscans. I think the code is clear enough
> without
> this extra comment, thoughts?
>
>
I'm not sure I understand you here. I kinda like the extra comment though.


> 4 ===
>
> What about adding a few tests? or do you want to wait a bit more to see if
> "
> there's an agreement on this patch" (as you stated at the start of this
> thread).
>
>
Guess I can start working on that now. It will take some time as I've never
done it before. Good thing I added the patch on the November commit fest :)

Thanks again.

Regards.


-- 
Guillaume.
From 6c92e70cd2698f24fe14069f675b7934e2f95bfe Mon Sep 17 00:00:00 2001
From: Guillaume Lelarge 
Date: Wed, 28 Aug 2024 21:35:30 +0200
Subject: [PATCH v4] Add parallel columns for pg_stat_all_tables,indexes

pg_stat_all_tables gets 4 new columns: parallel_seq_scan,
last_parallel_seq_scan, parallel_idx_scan, last_parallel_idx_scan.

pg_stat_all_indexes gets 2 new columns: parallel_idx_scan,
last_parallel_idx_scan.
---
 doc/src/sgml/monitoring.sgml | 69 ++--
 src/backend/access/brin/brin.c   |  2 +-
 src/backend/access/gin/ginscan.c |  2 +-
 src/backend/access/gist/gistget.c|  4 +-
 src/backend/access/hash/hashsearch.c |  2 +-
 src/backend/access/heap/heapam.c |  2 +-
 src/backend/access/nbtree/nbtsearch.c|  2 +-
 src/backend/access/spgist/spgscan.c  |  2 +-
 src/backend/catalog/system_views.sql |  6 ++
 src/backend/utils/activity/pgstat_relation.c |  8 +++
 src/backend/utils/adt/pgstatfuncs.c  |  6 ++
 src/include/catalog/pg_proc.dat  |  8 +++
 src/include/pgstat.h | 17 +++--
 13 files changed, 112 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 933de6fe07..6886094095 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3773,7 +3773,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
seq_scan bigint
   
   
-   Number of sequential scans initiated on this table
+   Number of sequential scans (including parallel ones) initiated on this table
   
  
 
@@ -3782,7 +3782,26 @@ description | Waiting for a newly initialized WAL file to reach durable storage
last_seq_scan timestamp with time zone
   
   
-   The time of the last sequential scan on this table, based on the
+   The time of the last sequential scan (including parallel ones) on this table, based on the
+   most recent transaction stop time
+  
+ 
+
+ 
+  
+   parallel_seq_scan bigint
+  
+  
+   Number of parallel sequential scans initiated on this table
+  
+ 
+
+ 
+  
+   last_parallel_seq_scan timestamp with time zone
+  
+  
+   The time of the last parallel sequential scan on this table, based on the
most recent transaction stop time
   
  
@@ -3801,7 +3820,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
idx_scan bigint
   
   
-   Number of index scans initiated on this table
+   Number of index scans (including parallel ones) initiated on this table
   
  
 
@@ -3810,7 +3829,26 @@ description | Waiting for a newly initialized WAL file to reach durable storage
last_idx_scan timestamp with time zone
   
   
-   The time of the last index scan on this table, based on the
+   The time of the last index scan (including parallel ones) on this table, based on the
+   most recent transaction stop time
+  
+ 
+
+ 
+  
+   parallel_idx_scan bigint
+  
+  
+   Number of parallel index scans initiated on this table
+  
+ 
+
+ 
+  
+   last_p

Re: In-placre persistance change of a relation

2024-09-04 Thread Kyotaro Horiguchi

Hello.

Thank you for the response.

At Sun, 1 Sep 2024 22:15:00 +0300, Heikki Linnakangas  wrote 
in 
> On 31/08/2024 19:09, Kyotaro Horiguchi wrote:
> > - UNDO log(0002): This handles file deletion during transaction aborts,
> >which was previously managed, in part, by the commit XLOG record at
> >the end of a transaction.
> > - Prevent orphan files after a crash (0005): This is another use-case
> >of the UNDO log system.
> 
> Nice, I'm very excited if we can fix that long-standing issue! I'll
> try to review this properly later, but at a quick 5 minute glance, one
> thing caught my eye:
> 
> This requires fsync()ing the per-xid undo log file every time a
> relation is created. I fear that can be a pretty big performance hit
> for workloads that repeatedly create and drop small tables. Especially

I initially thought that one additional file manipulation during file
creation wouldn't be an issue. However, the created storage file isn't
being synced, so your concern seems valid.

> if they're otherwise running with synchronous_commit=off. Instead of
> flushing the undo log file after every write, I'd suggest WAL-logging
> the undo log like regular relations and SLRUs. So before writing the
> entry to the undo log, WAL-log it. And with a little more effort, you
> could postpone creating the files altogether until a checkpoint
> happens, similar to how twophase state files are checkpointed
> nowadays.

I thought that an UNDO log file not flushed before the last checkpoint
might not survive a system crash. However, including UNDO files in the
checkpointing process resolves that concern. Thansk you for the
suggestion.

> I wonder if the twophase state files and undo log files should be
> merged into one file. They're similar in many ways: there's one file
> per transaction, named using the XID. I haven't thought this fully
> through, just a thought..

Precisely, UNDO log files are created per subtransaction, unlike
twophase files. It might be possible if we allow the twophase files
(as they are currently named) to be overwritten or modified at every
subcommit. If ULOG contents are WAL-logged, these two things will
become even more similar. However, I'm not planning to include that in
the next version for now.

> > +static void
> > +undolog_set_filename(char *buf, TransactionId xid)
> > +{
> > +   snprintf(buf, MAXPGPATH, "%s/%08x", SIMPLE_UNDOLOG_DIR, xid);
> > +}
> 
> I'd suggest using FullTransactionId. Doesn't matter much, but seems
> like a good future-proofing.

Agreed. Will fix it in the next vesion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Use streaming read API in ANALYZE

2024-09-04 Thread Mats Kindahl

On Thu, Sep 5, 2024 at 1:34 AM Thomas Munro  wrote:

> On Thu, Sep 5, 2024 at 3:36 AM Robert Haas  wrote:
> > On Wed, Sep 4, 2024 at 6:38 AM Thomas Munro 
> wrote:
> > > Thanks for the explanation.  I think we should revert it.  IMHO it was
> > > a nice clean example of a streaming transformation, but unfortunately
> > > it transformed an API that nobody liked in the first place, and broke
> > > some weird and wonderful workarounds.  Let's try again in 18.
> >
> > The problem I have with this is that we just released RC1. I suppose
> > if we have to make this change it's better to do it sooner than later,
> > but are we sure we want to whack this around this close to final
> > release?
>
> I hear you.  But I definitely don't want to (and likely can't at this
> point) make any of the other proposed changes, and I also don't want
> to break Timescale.  That seems to leave only one option: go back to
> the v16 API for RC2, and hope that the ongoing table AM discussions
> for v18 (CF #4866) will fix all the problems for the people whose TAMs
> don't quack like a "heap", and the people whose TAMs do and who would
> not like to duplicate the code, and the people who want streaming I/O.
>

Forgive me for asking, but I am not entirely sure why the ReadStream struct
is opaque. The usual reasons are:

   - You want to provide an ABI to allow extensions to work with new major
   versions without re-compiling. Right now it is necessary to recompile
   extensions anyway, this does not seem to apply. (Because there are a lot of
   other changes that you need when switching versions because of the lack of
   a stable ABI for other parts of the code. However, it might be that the
   goal is to support it eventually, and then it would make sense to start
   making structs opaque.)
   - You want to ensure that you can make modifications *inside* a major
   version without breaking ABIs and requiring a re-compile. In this case, you
   could still follow safe practice of adding new fields last, not relying on
   the size of the struct for anything (e.g., no arrays of these structures,
   just pointers to them), etc. However, if you want to be *very* safe and
   support very drastic changes inside a major version, it needs to be opaque,
   so this could be the reason.

Is it either of these reasons, or is there another reason?

Making the ReadStream API non-opaque (that is, moving the definition to the
header file) would at least solve our problem (unless I am mistaken).
However, I am ignorant about long-term plans which might affect this, so
there might be a good reason to revert it for reasons I am not aware of.
-- 
Best wishes,
Mats Kindahl, Timescale

1 2 >

1 - 100 of 101 matches

Mail list logo