Re: optimize the value of vacthresh and anlthresh

2024-11-06 Thread Frédéric Yhuel




On 11/4/24 09:30, wenhui qiu wrote:

Hi hackers
     A few days ago, I was looking at the sql server documentation and 
found that sql server has optimized the algorithm related to updating 
statistics in the 2016 ,version,I think we can also learn from the 
implementation method of sql server to optimize the problem of automatic 
vacuum triggered by large tables,The Document 
link(https://learn.microsoft.com/en-us/sql/relational-databases/statistics/statistics?view=sql-server-ver16 ),I refer to the sql server implementation method, submit a patch, please see the attachment, hope to get the help of expert hackers


similar idea here : 
https://www.postgresql.org/message-id/6a2ac9b7-6535-4bb1-8274-0647f7c31c82%40dalibo.com





doc: explain pgstatindex fragmentation

2024-11-05 Thread Frédéric Yhuel
Hi, I thought it would be nice to give the user a better idea of what 
avg_leaf_density and leaf_fragmentation mean.


Patch attached. What do you think?From 5c82c207776fb8cde2357081b8579ba6db195c06 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Tue, 5 Nov 2024 17:59:44 +0100
Subject: [PATCH] doc: explain pgstatindex fragmentation

It was quite hard to guess what leaf_fragmentation meant without looking
at pgstattuple's code. This patch aims to give to the user a better
idea of what it means.
---
 doc/src/sgml/pgstattuple.sgml | 8 
 1 file changed, 8 insertions(+)

diff --git a/doc/src/sgml/pgstattuple.sgml b/doc/src/sgml/pgstattuple.sgml
index 4071da4ed9..8908b56663 100644
--- a/doc/src/sgml/pgstattuple.sgml
+++ b/doc/src/sgml/pgstattuple.sgml
@@ -277,6 +277,14 @@ leaf_fragmentation | 0
  page-by-page, and should not be expected to represent an
  instantaneous snapshot of the whole index.
 
+
+
+ avg_leaf_density can be seen as the inverse of bloat,
+ while leaf_fragmentation represents a measure of disorder.
+ The higher leaf_fragmentation is, the less the physical
+ order of the index leaf pages corresponds to the logical order we would have
+ just after a REINDEX.
+
 

 
-- 
2.39.5



Re: Allowing parallel-safe initplans

2024-10-02 Thread Frédéric Yhuel




Le 12/04/2023 à 20:06, Robert Haas a écrit :

There's only one existing test case that visibly changes plan with
these changes.  The new plan is clearly saner-looking than before,
and testing with some data loaded into the table confirms that it
is faster.  I'm not sure if it's worth devising more test cases.

It seems like it would be nice to see one or two additional scenarios
where these changes bring a benefit, with different kinds of plan
shapes.


Hi,

Currently working on illustrating some points in the v17 release notes, 
I'm trying to come up with a sexier scenario than the test case, but it 
seems that with a non-trivial InitPlan (2nd explain below), we still 
have a non-parallel Append node at the top:


SET parallel_setup_cost = 0;
SET parallel_tuple_cost = 0;
SET min_parallel_table_scan_size = 10;

CREATE TABLE foo (a int) PARTITION by LIST(a);

CREATE TABLE foo_0 PARTITION OF foo FOR VALUES IN (0);

CREATE TABLE foo_1 PARTITION OF foo FOR VALUES IN (1);

EXPLAIN (COSTS OFF)
SELECT * FROM foo WHERE a = (SELECT 2)
UNION ALL
SELECT * FROM foo WHERE a = 0;

 QUERY PLAN
-
 Gather
   Workers Planned: 2
   ->  Parallel Append
 ->  Parallel Append
   InitPlan 1
 ->  Result
   ->  Parallel Seq Scan on foo_0 foo_1
 Filter: (a = (InitPlan 1).col1)
   ->  Parallel Seq Scan on foo_1 foo_2
 Filter: (a = (InitPlan 1).col1)
 ->  Parallel Seq Scan on foo_0 foo_3
   Filter: (a = 0)


EXPLAIN (COSTS OFF)
SELECT * FROM foo WHERE a = (SELECT max(a) FROM foo)
UNION ALL
SELECT * FROM foo WHERE a = 0;

   QUERY PLAN

 Append
   ->  Gather
 Workers Planned: 2
 InitPlan 1
   ->  Finalize Aggregate
 ->  Gather
   Workers Planned: 2
   ->  Partial Aggregate
 ->  Parallel Append
   ->  Parallel Seq Scan on foo_0 foo_5
   ->  Parallel Seq Scan on foo_1 foo_6
 ->  Parallel Append
   ->  Parallel Seq Scan on foo_0 foo_1
 Filter: (a = (InitPlan 1).col1)
   ->  Parallel Seq Scan on foo_1 foo_2
 Filter: (a = (InitPlan 1).col1)
   ->  Gather
 Workers Planned: 1
 ->  Parallel Seq Scan on foo_0 foo_3
   Filter: (a = 0)

Did I miss something?

Best regards,
Frédéric




Re: pgstattuple: fix free space calculation

2024-09-09 Thread Frédéric Yhuel



On 9/7/24 22:45, Tom Lane wrote:

I wrote:

Now alternatively you could argue that a "new" page isn't usable free
space yet and so we should count it as zero, just as we don't count
dead tuples as usable free space.  You need VACUUM to turn either of
those things into real free space.  But that'd be a bigger definitional
change, and I'm not sure we want it.  Thoughts?


On the third hand: the code in question is in statapprox_heap, which
is presumably meant to deliver numbers comparable to pgstat_heap.
And pgstat_heap takes no special care for "new" pages, it just applies
PageGetHeapFreeSpace (or PageGetExactFreeSpace after this patch).
So that leaves me feeling pretty strongly that this whole stanza
is wrong and we should just do PageGetExactFreeSpace here.



+1

v4 patch attached.

Best regards,
Frédéric

From a59c16d33ff37ce5c9d0e663809be190c608c75b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Wed, 21 Aug 2024 15:05:11 +0200
Subject: [PATCH v4] pgstattuple: use PageGetExactFreeSpace()

instead of PageGetHeapFreeSpace() or PageGetFreeSpace()

We want the exact free space, we don't care if there's enough room for
another line pointer.

Also, statapprox_heap() shouldn't have to take any special care with new
pages.
---
 contrib/pgstattuple/pgstatapprox.c | 9 +
 contrib/pgstattuple/pgstatindex.c  | 2 +-
 contrib/pgstattuple/pgstattuple.c  | 6 +++---
 3 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index c84c642355..04457f4b79 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -106,14 +106,7 @@ statapprox_heap(Relation rel, output_type *stat)
 
 		page = BufferGetPage(buf);
 
-		/*
-		 * It's not safe to call PageGetHeapFreeSpace() on new pages, so we
-		 * treat them as being free space for our purposes.
-		 */
-		if (!PageIsNew(page))
-			stat->free_space += PageGetHeapFreeSpace(page);
-		else
-			stat->free_space += BLCKSZ - SizeOfPageHeaderData;
+		stat->free_space += PageGetExactFreeSpace(page);
 
 		/* We may count the page as scanned even if it's new/empty */
 		scanned++;
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 5c06ba6db4..1b6b768cf8 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -311,7 +311,7 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 
 			max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
 			indexStat.max_avail += max_avail;
-			indexStat.free_space += PageGetFreeSpace(page);
+			indexStat.free_space += PageGetExactFreeSpace(page);
 
 			indexStat.leaf_pages++;
 
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 3bd8b96197..7e2a7262a3 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -372,7 +372,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block,
 		RBM_NORMAL, hscan->rs_strategy);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
-			stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
+			stat.free_space += PageGetExactFreeSpace((Page) BufferGetPage(buffer));
 			UnlockReleaseBuffer(buffer);
 			block++;
 		}
@@ -385,7 +385,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block,
 	RBM_NORMAL, hscan->rs_strategy);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
-		stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
+		stat.free_space += PageGetExactFreeSpace((Page) BufferGetPage(buffer));
 		UnlockReleaseBuffer(buffer);
 		block++;
 	}
@@ -565,7 +565,7 @@ pgstat_index_page(pgstattuple_type *stat, Page page,
 {
 	OffsetNumber i;
 
-	stat->free_space += PageGetFreeSpace(page);
+	stat->free_space += PageGetExactFreeSpace(page);
 
 	for (i = minoff; i <= maxoff; i = OffsetNumberNext(i))
 	{
-- 
2.39.2



Re: pgstattuple: fix free space calculation

2024-09-09 Thread Frédéric Yhuel

Hi Tom, thanks for your review.

On 9/7/24 22:10, Tom Lane wrote:

I looked at this patch.  I agree with making the change.  However,
I don't agree with the CF entry's marking of "target version: stable"
(i.e., requesting back-patch).  I think this falls somewhere in the
gray area between a bug fix and a definitional change.  Also, people
are unlikely to be happy if they suddenly get new, not-comparable
numbers after a minor version update.  So I think we should just fix
it in HEAD.



OK, I did the change.


As far as the patch itself goes, the one thing that is bothering me
is this comment change

  /*
- * It's not safe to call PageGetHeapFreeSpace() on new pages, so we
+ * It's not safe to call PageGetExactFreeSpace() on new pages, so we
   * treat them as being free space for our purposes.
   */

which looks like it wasn't made with a great deal of thought.
Now it seems to me that the comment was already bogus when written:
there isn't anything uncertain about what will happen if you call
either of these functions on a "new" page.  PageIsNew checks for

 return ((PageHeader) page)->pd_upper == 0;

If pd_upper is 0, PageGet[Exact]FreeSpace is absolutely guaranteed
to return zero, even if pd_lower contains garbage.  And then


Indeed. I failed to notice that LocationIndex was an unsigned int, so I 
thought that pg_upper - pd_upper could be positive with garbage in pg_upper.



PageGetHeapFreeSpace will likewise return zero.  Perhaps there
could be trouble if we got into the line-pointer-checking part
of PageGetHeapFreeSpace, but we can't.  So this comment is wrong,
and is even more obviously wrong after the above change.  I thought
for a moment about removing the PageIsNew test altogether, but
then I decided that it probably*is*  what we want and is just
mis-explained.  I think the comment should read more like

 /*
  * PageGetExactFreeSpace() will return zero for a "new" page,
  * but it's actually usable free space, so count it that way.
  */

Now alternatively you could argue that a "new" page isn't usable free
space yet and so we should count it as zero, just as we don't count
dead tuples as usable free space.  You need VACUUM to turn either of
those things into real free space.  But that'd be a bigger definitional
change, and I'm not sure we want it.  Thoughts?

Also, do we need any documentation change for this?  I looked through
https://www.postgresql.org/docs/devel/pgstattuple.html
and didn't see anything that was being very specific about what
"free space" means, so maybe it's fine as-is.


It's not easy. Maybe something like this?

"For any initialized page, free space refers to anything that isn't page 
metadata (header and special), a line pointer or a tuple pointed to by a 
valid line pointer. In particular, a dead tuple is not free space 
because there's still a valid line pointer pointer pointing to it, until 
VACUUM or some other maintenance mechanism (e.g. page pruning) cleans up 
the page. A dead line pointer is not free space either, but the tuple it 
points to has become free space. An unused line pointer could be 
considered free space, but pgstattuple doesn't take it into account."





Re: pgstattuple: fix free space calculation

2024-08-29 Thread Frédéric Yhuel



On 8/23/24 12:51, Frédéric Yhuel wrote:



On 8/23/24 12:02, Rafia Sabih wrote:
On the other hand, this got me thinking about the purpose of this 
space information.
If we want to understand that there's still some space for the tuples 
in a page, then using PageGetExactFreeSpace is not doing justice in 
case of heap page, because we will not be able to add any more tuples 
there if there are already MaxHeapTuplesPerPage tuples there.


We won't be able to add, but we will be able to update a tuple in this 
page.


Sorry, that's not true.

So in this marginal case we have free space that's unusable in practice. 
No INSERT or UPDATE (HOT or not) is possible inside the page.


I don't know what pgstattuple should do in this case.

However, we should never encounter this case in practice (maybe on some 
exotic architectures with strange alignment behavior?). As I said, I 
can't fit more than 226 tuples per page on my machine, while 
MaxHeapTuplesPerPage is 291. Am I missing something?


Besides, pgstattuple isn't mission critical, is it?

So I think we should just use PageGetExactFreeSpace().

Here is a v3 patch. It's the same as v2, I only removed the last 
paragraph in the commit message.


Thank you Rafia and Andreas for your review and test.

Best regards,
FrédéricFrom f749d4ca2d6881a916c6ca2a3b882bb2740188d4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Wed, 21 Aug 2024 15:05:11 +0200
Subject: [PATCH v3] pgstattuple: use PageGetExactFreeSpace()

instead of PageGetHeapFreeSpace() or PageGetFreeSpace()

We want the exact free space, we don't care if there's enough room for
another line pointer.
---
 contrib/pgstattuple/pgstatapprox.c | 4 ++--
 contrib/pgstattuple/pgstatindex.c  | 2 +-
 contrib/pgstattuple/pgstattuple.c  | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index c84c642355..b63a9932d7 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -107,11 +107,11 @@ statapprox_heap(Relation rel, output_type *stat)
 		page = BufferGetPage(buf);
 
 		/*
-		 * It's not safe to call PageGetHeapFreeSpace() on new pages, so we
+		 * It's not safe to call PageGetExactFreeSpace() on new pages, so we
 		 * treat them as being free space for our purposes.
 		 */
 		if (!PageIsNew(page))
-			stat->free_space += PageGetHeapFreeSpace(page);
+			stat->free_space += PageGetExactFreeSpace(page);
 		else
 			stat->free_space += BLCKSZ - SizeOfPageHeaderData;
 
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 5c06ba6db4..1b6b768cf8 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -311,7 +311,7 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 
 			max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
 			indexStat.max_avail += max_avail;
-			indexStat.free_space += PageGetFreeSpace(page);
+			indexStat.free_space += PageGetExactFreeSpace(page);
 
 			indexStat.leaf_pages++;
 
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 3bd8b96197..7e2a7262a3 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -372,7 +372,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block,
 		RBM_NORMAL, hscan->rs_strategy);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
-			stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
+			stat.free_space += PageGetExactFreeSpace((Page) BufferGetPage(buffer));
 			UnlockReleaseBuffer(buffer);
 			block++;
 		}
@@ -385,7 +385,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block,
 	RBM_NORMAL, hscan->rs_strategy);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
-		stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
+		stat.free_space += PageGetExactFreeSpace((Page) BufferGetPage(buffer));
 		UnlockReleaseBuffer(buffer);
 		block++;
 	}
@@ -565,7 +565,7 @@ pgstat_index_page(pgstattuple_type *stat, Page page,
 {
 	OffsetNumber i;
 
-	stat->free_space += PageGetFreeSpace(page);
+	stat->free_space += PageGetExactFreeSpace(page);
 
 	for (i = minoff; i <= maxoff; i = OffsetNumberNext(i))
 	{
-- 
2.39.2



Re: pgstattuple: fix free space calculation

2024-08-23 Thread Frédéric Yhuel




On 8/23/24 12:02, Rafia Sabih wrote:
On the other hand, this got me thinking about the purpose of this space 
information.
If we want to understand that there's still some space for the tuples in 
a page, then using PageGetExactFreeSpace is not doing justice in case of 
heap page, because we will not be able to add any more tuples there if 
there are already MaxHeapTuplesPerPage tuples there.


We won't be able to add, but we will be able to update a tuple in this 
page. It's hard to test, because I can't fit more than 226 tuples on a 
single page, while MaxHeapTuplesPerPage = 291 on my machine :-)


In any case, IMVHO, pgstattuple shouldn't answer to the question "can I 
add more tuples?". The goal is for educational, introspection or 
debugging purposes, and we want the exact amount of free space.


Best regards,
Frédéric




Re: pgstattuple: fix free space calculation

2024-08-23 Thread Frédéric Yhuel



On 8/22/24 21:56, Rafia Sabih wrote:

I agree with the approach here.
A minor comment here is to change the comments in code referring to the 
PageGetHeapFreeSpace.


Thank you Rafia. Here is a v2 patch.

I've also added this to the commit message:

Also, PageGetHeapFreeSpace() will return zero if there are already 
MaxHeapTuplesPerPage line pointers in the page and none are free. We 
don't want that either, because here we want to keep track of the free 
space after a page pruning operation even in the (very unlikely) case 
that there are MaxHeapTuplesPerPage line pointers in the page.From c4d0569cd3dbcec7f84a794e927cfc81e137ff6a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Wed, 21 Aug 2024 15:05:11 +0200
Subject: [PATCH v2] pgstattuple: use PageGetExactFreeSpace()

instead of PageGetHeapFreeSpace() or PageGetFreeSpace()

We want the exact free space, we don't care if there's enough room for
another line pointer.

Also, PageGetHeapFreeSpace() will return zero if there are already
MaxHeapTuplesPerPage line pointers in the page and none are free. We
don't want that either, because here we want to keep track of the free
space after a page pruning operation even in the (very unlikely) case
that there are MaxHeapTuplesPerPage line pointers in the page.
---
 contrib/pgstattuple/pgstatapprox.c | 4 ++--
 contrib/pgstattuple/pgstatindex.c  | 2 +-
 contrib/pgstattuple/pgstattuple.c  | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index c84c642355..b63a9932d7 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -107,11 +107,11 @@ statapprox_heap(Relation rel, output_type *stat)
 		page = BufferGetPage(buf);
 
 		/*
-		 * It's not safe to call PageGetHeapFreeSpace() on new pages, so we
+		 * It's not safe to call PageGetExactFreeSpace() on new pages, so we
 		 * treat them as being free space for our purposes.
 		 */
 		if (!PageIsNew(page))
-			stat->free_space += PageGetHeapFreeSpace(page);
+			stat->free_space += PageGetExactFreeSpace(page);
 		else
 			stat->free_space += BLCKSZ - SizeOfPageHeaderData;
 
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 5c06ba6db4..1b6b768cf8 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -311,7 +311,7 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 
 			max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
 			indexStat.max_avail += max_avail;
-			indexStat.free_space += PageGetFreeSpace(page);
+			indexStat.free_space += PageGetExactFreeSpace(page);
 
 			indexStat.leaf_pages++;
 
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 3bd8b96197..7e2a7262a3 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -372,7 +372,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block,
 		RBM_NORMAL, hscan->rs_strategy);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
-			stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
+			stat.free_space += PageGetExactFreeSpace((Page) BufferGetPage(buffer));
 			UnlockReleaseBuffer(buffer);
 			block++;
 		}
@@ -385,7 +385,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block,
 	RBM_NORMAL, hscan->rs_strategy);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
-		stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
+		stat.free_space += PageGetExactFreeSpace((Page) BufferGetPage(buffer));
 		UnlockReleaseBuffer(buffer);
 		block++;
 	}
@@ -565,7 +565,7 @@ pgstat_index_page(pgstattuple_type *stat, Page page,
 {
 	OffsetNumber i;
 
-	stat->free_space += PageGetFreeSpace(page);
+	stat->free_space += PageGetExactFreeSpace(page);
 
 	for (i = minoff; i <= maxoff; i = OffsetNumberNext(i))
 	{
-- 
2.39.2



pgstattuple: fix free space calculation

2024-08-22 Thread Frédéric Yhuel

Hello,

I think that pgstattuple should use PageGetExactFreeSpace() instead of 
PageGetHeapFreeSpace() or PageGetFreeSpace(). The latter two compute the 
free space minus the space of a line pointer. They are used like this in 
the rest of the code (heapam.c):


pagefree = PageGetHeapFreeSpace(page);

if (newtupsize > pagefree) { we need a another page for the tuple }

... so it makes sense to take the line pointer into account in this context.

But it in the pgstattuple context, I think we want the exact free space.

I have attached a patch.

Best regards,
FrédéricFrom 350abec004fe472922800135d5d94ca3d8212da4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Wed, 21 Aug 2024 15:05:11 +0200
Subject: [PATCH] pgstattuple: use PageGetExactFreeSpace()

instead of PageGetHeapFreeSpace() and PageGetFreeSpace()

We want the exact free space, we don't care if there's enough room for
another line pointer.
---
 contrib/pgstattuple/pgstatapprox.c | 2 +-
 contrib/pgstattuple/pgstatindex.c  | 2 +-
 contrib/pgstattuple/pgstattuple.c  | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/contrib/pgstattuple/pgstatapprox.c b/contrib/pgstattuple/pgstatapprox.c
index c84c642355..a27dea621e 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -111,7 +111,7 @@ statapprox_heap(Relation rel, output_type *stat)
 		 * treat them as being free space for our purposes.
 		 */
 		if (!PageIsNew(page))
-			stat->free_space += PageGetHeapFreeSpace(page);
+			stat->free_space += PageGetExactFreeSpace(page);
 		else
 			stat->free_space += BLCKSZ - SizeOfPageHeaderData;
 
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index 5c06ba6db4..1b6b768cf8 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -311,7 +311,7 @@ pgstatindex_impl(Relation rel, FunctionCallInfo fcinfo)
 
 			max_avail = BLCKSZ - (BLCKSZ - ((PageHeader) page)->pd_special + SizeOfPageHeaderData);
 			indexStat.max_avail += max_avail;
-			indexStat.free_space += PageGetFreeSpace(page);
+			indexStat.free_space += PageGetExactFreeSpace(page);
 
 			indexStat.leaf_pages++;
 
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 3bd8b96197..7e2a7262a3 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -372,7 +372,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block,
 		RBM_NORMAL, hscan->rs_strategy);
 			LockBuffer(buffer, BUFFER_LOCK_SHARE);
-			stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
+			stat.free_space += PageGetExactFreeSpace((Page) BufferGetPage(buffer));
 			UnlockReleaseBuffer(buffer);
 			block++;
 		}
@@ -385,7 +385,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, block,
 	RBM_NORMAL, hscan->rs_strategy);
 		LockBuffer(buffer, BUFFER_LOCK_SHARE);
-		stat.free_space += PageGetHeapFreeSpace((Page) BufferGetPage(buffer));
+		stat.free_space += PageGetExactFreeSpace((Page) BufferGetPage(buffer));
 		UnlockReleaseBuffer(buffer);
 		block++;
 	}
@@ -565,7 +565,7 @@ pgstat_index_page(pgstattuple_type *stat, Page page,
 {
 	OffsetNumber i;
 
-	stat->free_space += PageGetFreeSpace(page);
+	stat->free_space += PageGetExactFreeSpace(page);
 
 	for (i = minoff; i <= maxoff; i = OffsetNumberNext(i))
 	{
-- 
2.39.2



Re: New GUC autovacuum_max_threshold ?

2024-08-12 Thread Frédéric Yhuel




On 8/7/24 23:39, Nathan Bossart wrote:

I've attached a new patch to show roughly what I think this new GUC should
look like.  I'm hoping this sparks more discussion, if nothing else.



Thank you. FWIW, I would prefer a sub-linear growth, so maybe something 
like this:


vacthresh = Min(vac_base_thresh + vac_scale_factor * reltuples, 
vac_base_thresh + vac_scale_factor * pow(reltuples, 0.7) * 100);


This would give :

* 386M (instead of 5.1 billion currently) for a 25.6 billion tuples table ;
* 77M for a 2.56 billion tuples table (Robert's example) ;
* 15M (instead of 51M currently) for a 256M tuples table ;
* 3M (instead of 5M currently) for a 25.6M tuples table.

The other advantage is that you don't need another GUC.


On Tue, Jun 18, 2024 at 12:36:42PM +0200, Frédéric Yhuel wrote:

By the way, I wonder if there were any off-list discussions after Robert's
conference at PGConf.dev (and I'm waiting for the video of the conf).


I don't recall any discussions about this idea, but Robert did briefly
mention it in his talk [0].

[0] https://www.youtube.com/watch?v=RfTD-Twpvac



Very interesting, thanks!




Re: New GUC autovacuum_max_threshold ?

2024-06-18 Thread Frédéric Yhuel




Le 18/06/2024 à 05:06, Nathan Bossart a écrit :

I didn't see a commitfest entry for this, so I created one to make sure we
don't lose track of this:

https://commitfest.postgresql.org/48/5046/



OK thanks!

By the way, I wonder if there were any off-list discussions after 
Robert's conference at PGConf.dev (and I'm waiting for the video of the 
conf).





Re: New GUC autovacuum_max_threshold ?

2024-05-13 Thread Frédéric Yhuel




Le 09/05/2024 à 16:58, Robert Haas a écrit :

As I see it, a lot of the lack of agreement up until now is people
just not understanding the math. Since I think I've got the right idea
about the math, I attribute this to other people being confused about
what is going to happen and would tend to phrase it as: some people
don't understand how catastrophically bad it will be if you set this
value too low.


FWIW, I do agree with your math. I found your demonstration convincing. 
50 was selected with the wet finger.


Using the formula I suggested earlier:

vacthresh = Min(vac_base_thresh + vac_scale_factor * reltuples, 
vac_base_thresh + vac_scale_factor * sqrt(reltuples) * 1000);


your table of 2.56 billion tuples will be vacuumed if there are
more than 10 million dead tuples (every 28 minutes).

If we want to stick with the simple formula, we should probably choose a 
very high default, maybe 100 million, as you suggested earlier.


However, it would be nice to have the visibility map updated more 
frequently than every 100 million dead tuples. I wonder if this could be 
decoupled from the vacuum process?





Re: New GUC autovacuum_max_threshold ?

2024-05-01 Thread Frédéric Yhuel




Le 01/05/2024 à 20:50, Robert Haas a écrit :

Possibly what we need here is
something other than a cap, where, say, we vacuum a 10GB table twice
as often as now, a 100GB table four times as often, and a 1TB table
eight times as often. Or whatever the right answer is.


IMO, it would make more sense. So maybe something like this:

vacthresh = Min(vac_base_thresh + vac_scale_factor * reltuples, 
vac_base_thresh + vac_scale_factor * sqrt(reltuples) * 1000);


(it could work to compute a score, too, like in David's proposal)





Re: New GUC autovacuum_max_threshold ?

2024-04-26 Thread Frédéric Yhuel




Le 25/04/2024 à 22:21, Robert Haas a écrit :

The analyze case, I feel, is really murky.
autovacuum_analyze_scale_factor stands for the proposition that as the
table becomes larger, analyze doesn't need to be done as often. If
what you're concerned about is the frequency estimates, that's true:
an injection of a million new rows can shift frequencies dramatically
in a small table, but the effect is blunted in a large one. But a lot
of the cases I've seen have involved the histogram boundaries. If
you're inserting data into a table in increasing order, every new
million rows shifts the boundary of the last histogram bucket by the
same amount. You either need those rows included in the histogram to
get good query plans, or you don't. If you do, the frequency with
which you need to analyze does not change as the table grows. If you
don't, then it probably does. But the answer doesn't really depend on
how big the table is already, but on your workload. So it's unclear to
me that the proposed parameter is the right idea here at all. It's
also unclear to me that the existing system is the right idea. 🙂


This is very interesting. And what about ndistinct? I believe it could 
be problematic, too, in some (admittedly rare or pathological) cases.


For example, suppose that the actual number of distinct values grows 
from 1000 to 20 after a batch of insertions, for a particular 
column. OK, in such a case, the default analyze sampling isn't large 
enough to compute a ndistinct close enough to reality anyway. But 
without any analyze at all, it can lead to very bad planning - think of 
a Nested Loop with a parallel seq scan for the outer table instead of a 
simple efficient index scan, because the index scan of the inner table 
is overestimated (each index scan cost and number or rows returned).





Re: New GUC autovacuum_max_threshold ?

2024-04-26 Thread Frédéric Yhuel




Le 26/04/2024 à 04:24, Laurenz Albe a écrit :

On Thu, 2024-04-25 at 14:33 -0400, Robert Haas wrote:

I believe that the underlying problem here can be summarized in this
way: just because I'm OK with 2MB of bloat in my 10MB table doesn't
mean that I'm OK with 2TB of bloat in my 10TB table. One reason for
this is simply that I can afford to waste 2MB much more easily than I
can afford to waste 2TB -- and that applies both on disk and in
memory.


I don't find that convincing.  Why are 2TB of wasted space in a 10TB
table worse than 2TB of wasted space in 100 tables of 100GB each?



Good point, but another way of summarizing the problem would be that the 
autovacuum_*_scale_factor parameters work well as long as we have a more 
or less evenly distributed access pattern in the table.


Suppose my very large table gets updated only for its 1% most recent 
rows. We probably want to decrease autovacuum_analyze_scale_factor and 
autovacuum_vacuum_scale_factor for this one.


Partitioning would be a good solution, but IMHO postgres should be able 
to handle this case anyway, ideally without per-table configuration.





Re: New GUC autovacuum_max_threshold ?

2024-04-25 Thread Frédéric Yhuel




Le 25/04/2024 à 18:51, Melanie Plageman a écrit :

I'm not too sure I understand. What are the reasons it might by skipped?
I can think of a concurrent index creation on the same table, or
anything holding a SHARE UPDATE EXCLUSIVE lock or above. Is this the
sort of thing you are talking about?

No, I was thinking more literally that, if reltuples (assuming
reltuples is modified/inserted tuples) > autovacuum_max_threshold, I
would expect the table to be vacuumed. However, with your formula,
that wouldn't necessarily be true.

I think there are values of reltuples and autovacuum_max_threshold at
which reltuples > autovacuum_max_threshold but reltuples <=
vac_base_thresh + vac_scale_factor * reltuples / (1 + vac_scale_factor
* reltuples / autovacuum_max_threshold)

I tried to reduce the formula to come up with a precise definition of
the range of values for which this is true, however I wasn't able to
reduce it to something nice.

Here is just an example of a case:

vac_base_thresh = 2000
vac_scale_factor = 0.9
reltuples = 3200
autovacuum_max_threshold = 2500

total_thresh = vac_base_thresh + vac_scale_factor * reltuples / (1 +
vac_scale_factor * reltuples / autovacuum_max_threshold)

total_thresh: 3338. dead tuples: 3200. autovacuum_max_threshold: 2500

so there are more dead tuples than the max threshold, so it should
trigger a vacuum, but it doesn't because the total calculated
threshold is higher than the number of dead tuples.



OK, thank you! I got it.


This of course may not be a realistic scenario in practice. It works
best the closer scale factor is to 1 (wish I had derived the formula
successfully) and when autovacuum_max_threshold > 2 * vac_base_thresh.
So, maybe it is not an issue.


I haven't thought much about this yet. I hope we can avoid such an 
extreme scenario by imposing some kind of constraint on this parameter, 
in relation to the others.


Anyway, with Nathan and Robert upvoting the simpler formula, this will 
probably become irrelevant anyway :-)





Re: New GUC autovacuum_max_threshold ?

2024-04-25 Thread Frédéric Yhuel




Le 25/04/2024 à 21:21, Nathan Bossart a écrit :

On Thu, Apr 25, 2024 at 02:33:05PM -0400, Robert Haas wrote:

What does surprise me is that Frédéric suggests a default value of
500,000. If half a million tuples (proposed default) is 20% of your
table (default value of autovacuum_vacuum_scale_factor) then your
table has 2.5 million tuples. Unless those tuples are very wide, that
table isn't even 1GB in size. I'm not aware that there's any problem
at all with the current formula on a table of that size, or even ten
times that size. I think you need to have tables that are hundreds of
gigabytes in size at least before this starts to become a serious
problem. Looking at this from another angle, in existing releases, the
maximum usable amount of autovacuum_work_mem is 1GB, which means we
can store one-sixth of a billion dead TIDs, or roughly 166 million.
And that limit has been a source of occasional complaints for years.
So we have those complaints on the one hand, suggesting that 166
million is not enough, and then we have this proposal, saying that
more than half a million is too much. That's really strange; my
initial hunch is that the value should be 100-500x higher than what
Frédéric proposed.


Agreed, the default should probably be on the order of 100-200M minimum.



I'm not sure... 50 comes from the table given in a previous message. 
It may not be large enough. But vacuum also updates the visibility map, 
and a few hundred thousand heap fetches can already hurt the performance 
of an index-only scan, even if most of the blocs are read from cache.



The original proposal also seems to introduce one parameter that would
affect all three of autovacuum_vacuum_threshold,
autovacuum_vacuum_insert_threshold, and autovacuum_analyze_threshold.  Is
that okay?  Or do we need to introduce a "limit" GUC for each?  I guess the
question is whether we anticipate any need to have different values for
these limits, which might be unlikely.



I agree with you, it seems unlikely. This is also an answer to Melanie's 
question about the name of the GUC : I deliberately left out the other 
"vacuum" because I thought we only needed one parameter for these three 
thresholds.


Now I have just read Robert's new message, and I understand his point. 
But is there a real problem with triggering analyze after every 50 
(or more) modifications in the table anyway?





Re: New GUC autovacuum_max_threshold ?

2024-04-25 Thread Frédéric Yhuel

Hi Nathan, thanks for your review.

Le 24/04/2024 à 21:57, Nathan Bossart a écrit :

Yeah, I'm having trouble following the proposed mechanics for this new GUC,
and it's difficult to understand how users would choose a value.  If we
just want to cap the number of tuples required before autovacuum takes
action, perhaps we could simplify it to something like

vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples;
vacthresh = Min(vacthres, vac_max_thresh);

This would effectively cause autovacuum_vacuum_scale_factor to be
overridden for large tables where the scale factor would otherwise cause
the calculated threshold to be extremely high.



This would indeed work, and the parameter would be easier to define in 
the user documentation. I prefer a continuous function... but that is 
personal taste. It seems to me that autovacuum tuning is quite hard 
anyway, and that it wouldn't be that much difficult with this kind of 
asymptotic limit parameter.


But I think the most important thing is to avoid per-table configuration 
for most of the users, or event autovacuum tuning at all, so either of 
these two formulas would do.





Re: New GUC autovacuum_max_threshold ?

2024-04-24 Thread Frédéric Yhuel




Le 24/04/2024 à 21:10, Melanie Plageman a écrit :

On Wed, Apr 24, 2024 at 8:08 AM Frédéric Yhuel
 wrote:


Hello,

I would like to suggest a new parameter, autovacuum_max_threshold, which
would set an upper limit on the number of tuples to delete/update/insert
prior to vacuum/analyze.


Hi Frédéric, thanks for the proposal! You are tackling a very tough
problem. I would also find it useful to know more about what led you
to suggest this particular solution. I am very interested in user
stories around difficulties with what tables are autovacuumed and
when.



Hi Melanie! I can certainly start compiling user stories about that.

Recently, one of my colleagues wrote an email to our DBA team saying 
something along these lines:


« Hey, here is our suggested settings for per table autovacuum 
configuration:


|  *autovacuum*| L < 1 million | L >= 1 million | L >= 5 
millions | L >= 10 millions |

|:-|--:|---:|:|-:|
|`vacuum_scale_factor` |  0.2 (défaut) |0.1 | 0.05 
 |  0.0 |
|`vacuum_threshold`|   50 (défaut) |50 (défaut) | 50 
(défaut) |  500 000 |
|`analyze_scale_factor`|  0.1 (défaut) |   0.1 (défaut) | 0.05 
 |  0.0 |
|`analyze_threshold`   |   50 (défaut) |50 (défaut) | 50 
(défaut) |  500 000 |


Let's update this table with values for the vacuum_insert_* parameters. »

I wasn't aware that we had this table, and although the settings made 
sense to me, I thought it was rather ugly and cumbersome for the user, 
and I started thinking about how postgres could make his life easier.



Am I correct in thinking that one of the major goals here is for a
very large table to be more likely to be vacuumed?



Absolutely.


The idea would be to replace the following calculation :

vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples;

with this one :

vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples / (1
+ vac_scale_factor * reltuples / autovacuum_max_threshold)

(and the same for the others, vacinsthresh and anlthresh).


My first thought when reviewing the GUC and how it is used is
wondering if its description is a bit misleading.

autovacuum_vacuum_threshold is the "minimum number of updated or
deleted tuples needed to trigger a vacuum". That is, if this many
tuples are modified, it *may* trigger a vacuum, but we also may skip
vacuuming the table for other reasons or due to other factors.
autovacuum_max_threshold's proposed definition is the upper
limit/maximum number of tuples to insert/update/delete prior to
vacuum/analyze. This implies that if that many tuples have been
modified or inserted, the table will definitely be vacuumed -- which
isn't true. Maybe that is okay, but I thought I would bring it up.



I'm not too sure I understand. What are the reasons it might by skipped? 
I can think of a concurrent index creation on the same table, or 
anything holding a SHARE UPDATE EXCLUSIVE lock or above. Is this the 
sort of thing you are talking about?


Perhaps a better name for the GUC would be 
autovacuum_asymptotic_limit... or something like that?



The attached (draft) patch further illustrates the idea.


Thanks for including a patch!


My guess is that a similar proposal has already been submitted... and
rejected 🙂 If so, I'm very sorry for the useless noise.


I rooted around in the hackers archive and couldn't find any threads
on this specific proposal. I copied some other hackers I knew of who
have worked on this problem and thought about it in the past, in case
they know of some existing threads or prior work on this specific
topic.



Thanks!




New GUC autovacuum_max_threshold ?

2024-04-24 Thread Frédéric Yhuel

Hello,

I would like to suggest a new parameter, autovacuum_max_threshold, which 
would set an upper limit on the number of tuples to delete/update/insert 
prior to vacuum/analyze.


A good default might be 50.

The idea would be to replace the following calculation :

vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples;

with this one :

vacthresh = (float4) vac_base_thresh + vac_scale_factor * reltuples / (1 
+ vac_scale_factor * reltuples / autovacuum_max_threshold)


(and the same for the others, vacinsthresh and anlthresh).

The attached graph plots vacthresh against pgclass.reltuples, with 
default settings :


autovacuum_vacuum_threshold = 50
autovacuum_vacuum_scale_factor = 0.2

and

autovacuum_max_threshold = 50 (the suggested default)

Thus, for small tables, vacthresh is only slightly smaller than 0.2 * 
pgclass.reltuples, but it grows towards 50 when reltuples → ∞


The idea is to reduce the need for autovacuum tuning.

The attached (draft) patch further illustrates the idea.

My guess is that a similar proposal has already been submitted... and 
rejected 🙂 If so, I'm very sorry for the useless noise.


Best regards,
FrédéricFrom 9027d857e3426f327a2a5f61aec11a7604bb48a9 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Fri, 19 Apr 2024 14:05:37 +0200
Subject: [PATCH] Add new GUC autovacuum_max_threshold

---
 src/backend/access/common/reloptions.c | 11 +++
 src/backend/postmaster/autovacuum.c| 18 +++---
 src/backend/utils/misc/guc_tables.c|  9 +
 src/include/postmaster/autovacuum.h|  1 +
 src/include/utils/rel.h|  1 +
 5 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/common/reloptions.c 
b/src/backend/access/common/reloptions.c
index 469de9bb49..11a6423aff 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -233,6 +233,15 @@ static relopt_int intRelOpts[] =
},
-1, 0, INT_MAX
},
+   {
+   {
+   "autovacuum_max_threshold",
+   "Maximum number of tuple XXX prior to vacuum/analyze",
+   RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+   ShareUpdateExclusiveLock
+   },
+   -1, 0, INT_MAX
+   },
{
{
"autovacuum_vacuum_insert_threshold",
@@ -1845,6 +1854,8 @@ default_reloptions(Datum reloptions, bool validate, 
relopt_kind kind)
offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
enabled)},
{"autovacuum_vacuum_threshold", RELOPT_TYPE_INT,
offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
vacuum_threshold)},
+   {"autovacuum_max_threshold", RELOPT_TYPE_INT,
+   offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
vacuum_max_threshold)},
{"autovacuum_vacuum_insert_threshold", RELOPT_TYPE_INT,
offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, 
vacuum_ins_threshold)},
{"autovacuum_analyze_threshold", RELOPT_TYPE_INT,
diff --git a/src/backend/postmaster/autovacuum.c 
b/src/backend/postmaster/autovacuum.c
index 7dd9345c61..e453c7c824 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -119,6 +119,7 @@ int autovacuum_max_workers;
 intautovacuum_work_mem = -1;
 intautovacuum_naptime;
 intautovacuum_vac_thresh;
+intautovacuum_max_thresh;
 double autovacuum_vac_scale;
 intautovacuum_vac_ins_thresh;
 double autovacuum_vac_ins_scale;
@@ -3017,6 +3018,12 @@ recheck_relation_needs_vacanalyze(Oid relid,
*doanalyze = false;
 }
 
+#define COMPUTE_TRESHOLD(res, scale_factor, base_thresh) \
+do { \
+   res = (float4) base_thresh + scale_factor * reltuples / \
+   (1 + scale_factor * reltuples / vac_max_thresh); \
+} while(0)
+
 /*
  * relation_needs_vacanalyze
  *
@@ -3071,6 +3078,7 @@ relation_needs_vacanalyze(Oid relid,
 
/* constants from reloptions or GUC variables */
int vac_base_thresh,
+   vac_max_thresh,
vac_ins_base_thresh,
anl_base_thresh;
float4  vac_scale_factor,
@@ -3111,6 +3119,10 @@ relation_needs_vacanalyze(Oid relid,
? relopts->vacuum_threshold
: autovacuum_vac_thresh;
 
+   vac_max_thresh = (relopts && relopts->vacuum_max_threshold >= 0)
+   ? relopts->vacuum_max_threshold
+   : autovacuum_max_thresh;
+
vac_ins_scale_factor = (relopts && relopts->vacuum_ins_scale_factor >= 
0)
? relopts->vacuum_ins_scale_factor
: autovac

Re: Set log_lock_waits=on by default

2023-12-21 Thread Frédéric Yhuel




Le 21/12/2023 à 14:29, Laurenz Albe a écrit :

Here is a patch to implement this.
Being stuck behind a lock for more than a second is almost
always a problem, so it is reasonable to turn this on by default.


I think it's a really good idea. At Dalibo, we advise our customers to 
switch it on. AFAICT, it's never been a problem.


Best regards,
Frédéric





Re: Out of memory error handling in frontend code

2023-10-06 Thread Frédéric Yhuel

Hi Daniel,

Thank you for your answer.

On 9/28/23 14:02, Daniel Gustafsson wrote:

On 28 Sep 2023, at 10:14, Frédéric Yhuel  wrote:



After some time, we understood that the 20 million of large objects were 
responsible for the huge memory usage (more than 10 GB) by pg_dump.


This sounds like a known issue [0] which has been reported several times, and
one we should get around to fixing sometime.



Indeed, I saw some of these reports afterwards :)


I think a more useful error message would help for such cases.


Knowing that this is case that pops up, I agree that we could do better around
the messaging here.


I haven't try to get the patch ready for review, I know that the format of the 
messages isn't right, I'd like to know what do you think of the idea, first.


I don't think adding more details is a bad idea, but it shouldn't require any
knowledge about internals so I think messages like the one below needs to be
reworded to be more helpful.

+   if (loinfo == NULL)
+   {
+   pg_fatal("getLOs: out of memory");
+   }



OK, here is a second version of the patch.

I didn't try to fix the path getLOs -> AssignDumpId -> catalogid_insert 
-> [...] -> catalogid_allocate, but that's annoying because it amounts 
to 11% of the memory allocations from valgrind's output.


Best regards,
FrédéricFrom 575903c1fd4ccbe49c762d1d6424e2264ba6cfad Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Mon, 18 Sep 2023 08:18:19 +0200
Subject: [PATCH] pg_dump: fix OOM handling

Exit with better error messages when there's not enough memory
to process large objects.
---
 src/bin/pg_dump/pg_backup_archiver.c | 45 +---
 src/bin/pg_dump/pg_backup_archiver.h |  4 +++
 src/bin/pg_dump/pg_dump.c| 33 
 src/common/fe_memutils.c | 14 +++--
 src/include/common/fe_memutils.h |  1 +
 5 files changed, 80 insertions(+), 17 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index ab351e457e..fedbe67a07 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -1101,6 +1101,22 @@ WriteData(Archive *AHX, const void *data, size_t dLen)
 	AH->WriteDataPtr(AH, data, dLen);
 }
 
+#define fill_in_new_toc(a,b) \
+do { \
+	if (opts->b) \
+	{ \
+		newToc->a = pg_strdup_extended(opts->b, MCXT_ALLOC_NO_OOM); \
+		if (newToc->a == NULL) \
+		{ \
+			goto error; \
+		} \
+	} \
+	else \
+	{ \
+		newToc->a = NULL; \
+	} \
+} while(0)
+
 /*
  * Create a new TOC entry. The TOC was designed as a TOC, but is now the
  * repository for all metadata. But the name has stuck.
@@ -1118,7 +1134,11 @@ ArchiveEntry(Archive *AHX, CatalogId catalogId, DumpId dumpId,
 	ArchiveHandle *AH = (ArchiveHandle *) AHX;
 	TocEntry   *newToc;
 
-	newToc = (TocEntry *) pg_malloc0(sizeof(TocEntry));
+	newToc = (TocEntry *) pg_malloc_extended(sizeof(TocEntry), MCXT_ALLOC_NO_OOM|MCXT_ALLOC_ZERO);
+	if (newToc == NULL)
+	{
+		goto error;
+	}
 
 	AH->tocCount++;
 	if (dumpId > AH->maxDumpId)
@@ -1133,15 +1153,15 @@ ArchiveEntry(Archive *AHX, CatalogId catalogId, DumpId dumpId,
 	newToc->dumpId = dumpId;
 	newToc->section = opts->section;
 
-	newToc->tag = pg_strdup(opts->tag);
-	newToc->namespace = opts->namespace ? pg_strdup(opts->namespace) : NULL;
-	newToc->tablespace = opts->tablespace ? pg_strdup(opts->tablespace) : NULL;
-	newToc->tableam = opts->tableam ? pg_strdup(opts->tableam) : NULL;
-	newToc->owner = opts->owner ? pg_strdup(opts->owner) : NULL;
-	newToc->desc = pg_strdup(opts->description);
-	newToc->defn = opts->createStmt ? pg_strdup(opts->createStmt) : NULL;
-	newToc->dropStmt = opts->dropStmt ? pg_strdup(opts->dropStmt) : NULL;
-	newToc->copyStmt = opts->copyStmt ? pg_strdup(opts->copyStmt) : NULL;
+	fill_in_new_toc(tag, tag);
+	fill_in_new_toc(namespace, namespace);
+	fill_in_new_toc(tablespace, tablespace);
+	fill_in_new_toc(tableam, tableam);
+	fill_in_new_toc(owner, owner);
+	fill_in_new_toc(desc, description);
+	fill_in_new_toc(defn, createStmt);
+	fill_in_new_toc(dropStmt, dropStmt);
+	fill_in_new_toc(copyStmt, copyStmt);
 
 	if (opts->nDeps > 0)
 	{
@@ -1166,6 +1186,11 @@ ArchiveEntry(Archive *AHX, CatalogId catalogId, DumpId dumpId,
 		AH->ArchiveEntryPtr(AH, newToc);
 
 	return newToc;
+
+error:
+	pg_log_error("Could not add a new archive entry: %s", strerror(errno));
+	pg_log_error_hint(LO_OOM_HINT);
+	exit(1);
 }
 
 /* Public */
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 18b38c17ab..626fe2cb11 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -97,6 +97,10 @@ typedef struct _archiveHandle ArchiveHandle;
 typedef struct _tocEntry TocEntry;
 struct Paral

Out of memory error handling in frontend code

2023-09-28 Thread Frédéric Yhuel

Hello,

One of our customers recently complained that his pg_dump stopped 
abruptly with the message "out of memory".


After some time, we understood that the 20 million of large objects were 
responsible for the huge memory usage (more than 10 GB) by pg_dump.


I think a more useful error message would help for such cases. Indeed, 
it's not always possible to ask the client to run pg_dump with 
"valgrind --tool=massif" on its server.


Now, I understand that we don't want to add too much to the frontend 
code, it would be a lot of pain for not much gain.


But I wonder if we could add some checks in a few strategic places, as 
in the attached patch.


The idea would be to print a more useful error message in most of the 
cases, and keep the basic "out of memory" for the remaining ones.


I haven't try to get the patch ready for review, I know that the format 
of the messages isn't right, I'd like to know what do you think of the 
idea, first.


Best regards,
FrédéricFrom 81aa4ae59778f1193d6e1a8c81931502c941e997 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Mon, 18 Sep 2023 08:18:19 +0200
Subject: [PATCH] pg_dump: fix OOM handling

Exit with better error messages when there's not enough memory
to process large objects.
---
 src/bin/pg_dump/pg_backup_archiver.c | 10 +++---
 src/bin/pg_dump/pg_backup_archiver.h |  6 --
 src/bin/pg_dump/pg_dump.c| 13 +++--
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 39ebcfec32..8370e26075 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -1112,13 +1112,17 @@ WriteData(Archive *AHX, const void *data, size_t dLen)
 
 /* Public */
 TocEntry *
-ArchiveEntry(Archive *AHX, CatalogId catalogId, DumpId dumpId,
-			 ArchiveOpts *opts)
+ArchiveEntry2(Archive *AHX, CatalogId catalogId, DumpId dumpId,
+			 ArchiveOpts *opts, const char* caller)
 {
 	ArchiveHandle *AH = (ArchiveHandle *) AHX;
 	TocEntry   *newToc;
 
-	newToc = (TocEntry *) pg_malloc0(sizeof(TocEntry));
+	newToc = (TocEntry *) pg_malloc_extended(sizeof(TocEntry), MCXT_ALLOC_NO_OOM|MCXT_ALLOC_ZERO);
+	if (newToc == NULL)
+	{
+		pg_fatal("%s: could not add a new archive entry: %s", caller, strerror(errno));
+	}
 
 	AH->tocCount++;
 	if (dumpId > AH->maxDumpId)
diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 18b38c17ab..60872744a6 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -400,8 +400,10 @@ typedef struct _archiveOpts
 } ArchiveOpts;
 #define ARCHIVE_OPTS(...) &(ArchiveOpts){__VA_ARGS__}
 /* Called to add a TOC entry */
-extern TocEntry *ArchiveEntry(Archive *AHX, CatalogId catalogId,
-			  DumpId dumpId, ArchiveOpts *opts);
+extern TocEntry *ArchiveEntry2(Archive *AHX, CatalogId catalogId,
+	  DumpId dumpId, ArchiveOpts *opts, const char* caller);
+
+#define ArchiveEntry(a, b, c, d) ArchiveEntry2(a, b, c, d, __func__)
 
 extern void WriteHead(ArchiveHandle *AH);
 extern void ReadHead(ArchiveHandle *AH);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 7977d6a9c0..8413d4f115 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3566,7 +3566,11 @@ getLOs(Archive *fout)
 	/*
 	 * Each large object has its own "BLOB" archive entry.
 	 */
-	loinfo = (LoInfo *) pg_malloc(ntups * sizeof(LoInfo));
+	loinfo = (LoInfo *) pg_malloc_extended(ntups * sizeof(LoInfo), MCXT_ALLOC_NO_OOM);
+	if (loinfo == NULL)
+	{
+		pg_fatal("getLOs: out of memory");
+	}
 
 	for (i = 0; i < ntups; i++)
 	{
@@ -3606,7 +3610,12 @@ getLOs(Archive *fout)
 	 */
 	if (ntups > 0)
 	{
-		lodata = (DumpableObject *) pg_malloc(sizeof(DumpableObject));
+		lodata = (DumpableObject *) pg_malloc_extended(sizeof(DumpableObject), MCXT_ALLOC_NO_OOM);
+		if (lodata == NULL)
+		{
+			pg_fatal("getLOs: out of memory");
+		}
+
 		lodata->objType = DO_LARGE_OBJECT_DATA;
 		lodata->catId = nilCatalogId;
 		AssignDumpId(lodata);
-- 
2.39.2



Re: Allow parallel plan for referential integrity checks?

2023-08-17 Thread Frédéric Yhuel




On 8/17/23 14:00, Frédéric Yhuel wrote:



On 8/17/23 09:32, Frédéric Yhuel wrote:



On 8/10/23 17:06, Juan José Santamaría Flecha wrote:
Recently I restored a database from a directory format backup and 
having this feature would have been quite useful


Hi,

Thanks for resuming work on this patch. I forgot to mention this in my 
original email, but the motivation was also to speed up the restore 
process. Parallelizing the FK checks could make a huge difference in 
certain cases. We should probably provide such a test case (with perf 
numbers), and maybe this is it what Robert asked for.


I have attached two scripts which demonstrate the following problems:



Let me add the plans for more clarity:

1a. if the tables aren't analyzed nor vacuumed before the post-data 
step, then they are index-only scanned, with a lot of heap fetches 
(depending on their size, the planner sometimes chooses a seq scan 
instead).




https://explain.dalibo.com/plan/7491ga22c5293683#raw

1b. if the tables have been analyzed but not vacuumed before the 
post-data-step, then they are scanned sequentially. Usually better, but 
still not so good without a parallel plan.




https://explain.dalibo.com/plan/e92c5161g880bdhf#raw

2. if the visibility maps have been created, then the tables are 
index-only scanned without heap fetches, but this can still be slower 
than a parallel seq scan.




https://explain.dalibo.com/plan/de22bdb4ggc3dffg#raw

And at last, with the patch applied : 
https://explain.dalibo.com/plan/54612a09ffh2565a#raw





Re: Allow parallel plan for referential integrity checks?

2023-08-17 Thread Frédéric Yhuel



On 8/17/23 09:32, Frédéric Yhuel wrote:



On 8/10/23 17:06, Juan José Santamaría Flecha wrote:
Recently I restored a database from a directory format backup and 
having this feature would have been quite useful


Hi,

Thanks for resuming work on this patch. I forgot to mention this in my 
original email, but the motivation was also to speed up the restore 
process. Parallelizing the FK checks could make a huge difference in 
certain cases. We should probably provide such a test case (with perf 
numbers), and maybe this is it what Robert asked for.


I have attached two scripts which demonstrate the following problems:

1a. if the tables aren't analyzed nor vacuumed before the post-data 
step, then they are index-only scanned, with a lot of heap fetches 
(depending on their size, the planner sometimes chooses a seq scan instead).


1b. if the tables have been analyzed but not vacuumed before the 
post-data-step, then they are scanned sequentially. Usually better, but 
still not so good without a parallel plan.


2. if the visibility maps have been created, then the tables are 
index-only scanned without heap fetches, but this can still be slower 
than a parallel seq scan.


So it would be nice if pg_restore could vacuum analyze the tables before 
the post-data step. I believe it would be faster in most cases.


And it would be nice to allow a parallel plan for RI checks.

Best regards,
Frédéric

create_and_fill_tables.sql
Description: application/sql


post_data.sql
Description: application/sql


Re: Allow parallel plan for referential integrity checks?

2023-08-17 Thread Frédéric Yhuel




On 8/10/23 17:06, Juan José Santamaría Flecha wrote:
Recently I restored a database from a directory format backup and having 
this feature would have been quite useful


Hi,

Thanks for resuming work on this patch. I forgot to mention this in my 
original email, but the motivation was also to speed up the restore 
process. Parallelizing the FK checks could make a huge difference in 
certain cases. We should probably provide such a test case (with perf 
numbers), and maybe this is it what Robert asked for.





Re: Allow parallel plan for referential integrity checks?

2023-03-20 Thread Frédéric Yhuel




On 3/20/23 15:58, Gregory Stark (as CFM) wrote:

On Mon, 12 Dec 2022 at 11:37, Frédéric Yhuel  wrote:


I've planned to work on it full time on week 10 (6-10 March), if you
agree to bear with me. The idea would be to bootstrap my brain on it,
and then continue to work on it from time to time.


Any updates on this patch? 


I had the opportunity to discuss it with Melanie Plageman when she came 
to Paris last month and teach us (Dalibo's folk) about the patch review 
process.


She advised me to provide a good test case first, and to explain better 
how it would be useful, which I'm going to do.



Should we move it to next release at this
point? Even if you get time to work on it this commitfest do you think
it's likely to be committable in the next few weeks?



It is very unlikely. Maybe it's better to remove it from CF and put it 
back later if the test case I will provide does a better job at 
convincing the Postgres folks that RI checks should be parallelized.





Re: Allow parallel plan for referential integrity checks?

2022-12-12 Thread Frédéric Yhuel




On 12/11/22 06:29, Ian Lawrence Barwick wrote:

2022年7月26日(火) 20:58 Frédéric Yhuel :




On 4/14/22 14:25, Frédéric Yhuel wrote:



On 3/19/22 01:57, Imseih (AWS), Sami wrote:

I looked at your patch and it's a good idea to make foreign key
validation
use parallel query on large relations.

It would be valuable to add logging to ensure that the ActiveSnapshot
and TransactionSnapshot
is the same for the leader and the workers. This logging could be
tested in the TAP test.

Also, inside RI_Initial_Check you may want to set max_parallel_workers to
max_parallel_maintenance_workers.

Currently the work_mem is set to maintenance_work_mem. This will also
require
a doc change to call out.

/*
   * Temporarily increase work_mem so that the check query can be
executed
   * more efficiently.  It seems okay to do this because the query
is simple
   * enough to not use a multiple of work_mem, and one typically
would not
   * have many large foreign-key validations happening
concurrently.  So
   * this seems to meet the criteria for being considered a
"maintenance"
   * operation, and accordingly we use maintenance_work_mem.
However, we



Hello Sami,

Thank you for your review!

I will try to do as you say, but it will take time, since my daily job
as database consultant takes most of my time and energy.



Hi,

As suggested by Jacob, here is a quick message to say that I didn't find
enough time to work further on this patch, but I didn't completely
forget it either. I moved it to the next commitfest. Hopefully I will
find enough time and motivation in the coming months :-)


Hi Frédéric

This patch has been carried forward for a couple more commitfests since
your message; do you think you'll be able to work on it further during this
release cycle?



Hi Ian,

I've planned to work on it full time on week 10 (6-10 March), if you 
agree to bear with me. The idea would be to bootstrap my brain on it, 
and then continue to work on it from time to time.


Best regards,
Frédéric




Re: [PATCH] minor optimization for ineq_histogram_selectivity()

2022-11-24 Thread Frédéric Yhuel




On 11/23/22 16:59, Tom Lane wrote:

=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Yhuel?=  writes:

On 10/24/22 17:26, Frédéric Yhuel wrote:

When studying the weird planner issue reported here [1], I came up with
the attached patch. It reduces the probability of calling
get_actual_variable_range().



This isn't very useful anymore thanks to this patch:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9c6ad5eaa957bdc2132b900a96e0d2ec9264d39c


I hadn't looked at this patch before, but now that I have, I'm inclined
to reject it anyway.  It just moves the problem around: now, instead of
possibly doing an unnecessary index probe at the right end, you're
possibly doing an unnecessary index probe at the left end.


Indeed... it seemed to me that both versions would do an unnecessary 
index probe at the left end, but I wasn't careful enough :-/



It also
looks quite weird compared to the normal coding of binary search.



That's right.


I wonder if there'd be something to be said for leaving the initial
probe calculation alone and doing this:

 else if (probe == sslot.nvalues - 1 && sslot.nvalues > 2)
+   {
+   /* Don't probe the endpoint until we have to. */
+   if (probe > lobound)
+   probe--;
+   else
 have_end = get_actual_variable_range(root,
  vardata,
  sslot.staop,
  collation,
  NULL,
  &sslot.values[probe]);
+   }

On the whole though, it seems like a wart.




Yeah... it's probably wiser not risking introducing a bug, only to save 
an index probe in rare cases (and only 100 reads, thanks to 9c6ad5ea).


Thank you for having had a look at it.

Best regards,
Frédéric





Re: [PATCH] minor optimization for ineq_histogram_selectivity()

2022-11-23 Thread Frédéric Yhuel




On 10/24/22 17:26, Frédéric Yhuel wrote:

Hello,

When studying the weird planner issue reported here [1], I came up with 
the attached patch. It reduces the probability of calling 
get_actual_variable_range().


This isn't very useful anymore thanks to this patch: 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9c6ad5eaa957bdc2132b900a96e0d2ec9264d39c


Unless we want to save a hundred page reads in rare cases.




Re: [PATCH] minor optimization for ineq_histogram_selectivity()

2022-10-31 Thread Frédéric Yhuel




On 10/24/22 17:26, Frédéric Yhuel wrote:

Hello,

When studying the weird planner issue reported here [1], I came up with 
the attached patch. It reduces the probability of calling 
get_actual_variable_range().


The patch applies to the master branch.

How to test :

CREATE TABLE foo (a bigint, b TEXT) WITH (autovacuum_enabled = off);
INSERT INTO foo SELECT i%213, md5(i::text) from 
generate_series(1,100) i;

VACUUM ANALYZE foo;
SELECT * FROM pg_stats WHERE tablename = 'foo' AND attname='a'\gx
CREATE INDEX ON foo(a);
DELETE FROM foo WHERE a = 212;
EXPLAIN (BUFFERS) SELECT count(a) FROM foo WHERE a > 208;



With the above example, the variables "lobound", "hibound", and "probe" 
would vary like this :


without patch :

lobound hibound  probe
---
0   101  50
51  101  76
77  101  89
90  101  95
96  101  98
99  101  100
99  100  99
99  99


with patch :

lobound hibound  probe
---
0   101  50
51  101  75
76  101  88
89  101  94
95  101  97
98  101  99
98  99   98
99  99

So we find the correct right end of the histogram bin (99) in both 
cases, but "probe" doesn't reach 100 in the latter one, and

get_actual_variable_range() is never called.

Now, if we'd run the query SELECT count(a) FROM foo WHERE a > 211 :

without patch :

lobound hibound  probe
---
0   101  50
51  101  76
77  101  89
90  101  95
96  101  98
99  101  100
99  100  99
100 100

with patch :

lobound hibound  probe
---
0   101  50
51  101  75
76  101  88
89  101  94
95  101  97
98  101  99
100 101  100
100 100


Here, the correct right end of the histogram bin (100) is also found is 
both cases.


I'm well aware that an example doesn't prove the correctness of an 
algorithm, though.


Best regards,
Frédéric




Re: Transparent column encryption

2022-10-28 Thread Frédéric Yhuel

Hi,

Here are a few more things I noticed :

If a CEK is encrypted with cmk1 and cmk2, but cmk1 isn't found on the 
client,the following error is printed twice for the very first SELECT 
statement:


  could not open file "/path/to/cmk1.pem": No such file or directory

...and nothing is returned. The next queries in the same session would 
work  correctly (cmk2 is used for the decryption of the CEK). An INSERT 
statement si handled properly, though : one (and only one) error 
message, and line actually inserted in all cases).


For example :

  postgres=# SELECT * FROM customers ;
  could not open file "/path/to/cmk1.pem": No such file or directory

  could not open file "/path/to/cmk1.pem": No such file or directory

  postgres=# SELECT * FROM customers ;
   id | name  | creditcard_num
  +---+-
1 | toto  | 546843351354245
2 | babar | 546843351354245



  postgres=# INSERT INTO customers (id, name, creditcard_num) VALUES 
 ($1, $2, $3) \gencr '3' 'toto' '546888351354245';

  could not open file "/path/to/cmk1.pem": No such file or directory

  INSERT 0 1
  postgres=# SELECT * FROM customers ;
   id | name  | creditcard_num
  +---+-
1 | toto  | 546843351354245
2 | babar | 546843351354245
3 | toto  | 546888351354245


From the documentation of CREATE COLUMN MASTER KEY, it looks like the 
REALM is optional, but both

  CREATE COLUMN MASTER KEY cmk1;
and
  CREATE COLUMN MASTER KEY cmk1 WITH ();
returns a syntax error.


About AEAD, the documentation says :
> The “associated data” in these algorithms consists of 4 bytes: The 
ASCII letters P and G (byte values 80 and 71), followed by the algorithm 
ID as a 16-bit unsigned integer in network byte order.


My guess is that it serves no real purpose, did I misunderstand ?




[PATCH] minor optimization for ineq_histogram_selectivity()

2022-10-24 Thread Frédéric Yhuel

Hello,

When studying the weird planner issue reported here [1], I came up with 
the attached patch. It reduces the probability of calling 
get_actual_variable_range().


The patch applies to the master branch.

How to test :

CREATE TABLE foo (a bigint, b TEXT) WITH (autovacuum_enabled = off);
INSERT INTO foo SELECT i%213, md5(i::text) from 
generate_series(1,100) i;

VACUUM ANALYZE foo;
SELECT * FROM pg_stats WHERE tablename = 'foo' AND attname='a'\gx
CREATE INDEX ON foo(a);
DELETE FROM foo WHERE a = 212;
EXPLAIN (BUFFERS) SELECT count(a) FROM foo WHERE a > 208;

Without this patch, you will observe at least 4694 shared hits (which 
are mostly heap fetches). If you apply the patch, you will observe very 
few of them.


You should run the EXPLAIN on a standby, if you want to observe the heap 
fetches more than one time (because of killed index tuples being ignored).


Best regards,
Frédéric

[1] 
https://www.postgresql.org/message-id/flat/CAECtzeVPM4Oi6dTdqVQmjoLkDBVChNj7ed3hNs1RGrBbwCJ7Cw%40mail.gmail.comFrom d7602f0731a3c5cac59cf2fc81bc336f800cb11a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Mon, 24 Oct 2022 16:33:26 +0200
Subject: [PATCH] minor optimization for ineq_histogram_selectivity()

Avoid calling of get_actual_variable_range() when possible.

With this change, 'probe' can still reach sslot.nvalues - 1 if needed,
so the algorithm still works correctly.

But if the right end of the hitogram bin is equal
to sslot.values[sslot.nvalues - 2], 'probe' will stay strictly less
than sslot.nvalues - 1 in the while loop, and we save a call to
get_actual_variable_range().

Use of get_actual_variable_range() can sometimes lead to surprising
slow planning time (see [1] and [2])

[1] https://www.postgresql.org/message-id/flat/CAECtzeVPM4Oi6dTdqVQmjoLkDBVChNj7ed3hNs1RGrBbwCJ7Cw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/36adf760-680b-7a4a-e019-64f4eaaf6ff7%40gmail.com
---
 src/backend/utils/adt/selfuncs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 69e0fb98f5..c1bc67705c 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -1108,7 +1108,7 @@ ineq_histogram_selectivity(PlannerInfo *root,
 
 			while (lobound < hibound)
 			{
-int			probe = (lobound + hibound) / 2;
+int			probe = (lobound + hibound - 1) / 2;
 bool		ltcmp;
 
 /*
-- 
2.30.2



Re: [PATCH] minor bug fix for pg_dump --clean

2022-10-24 Thread Frédéric Yhuel

On 10/24/22 03:01, Tom Lane wrote:

=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Yhuel?=  writes:

When using pg_dump (or pg_restore) with option "--clean", there is some
SQL code to drop every objects at the beginning.


Yup ...


The DROP statement for a view involving circular dependencies is :
CREATE OR REPLACE VIEW [...]
(see commit message of d8c05aff for a much better explanation)
If the view is not in the "public" schema, and the target database is
empty, this statement fails, because the schema hasn't been created yet.
The attached patches are a TAP test which can be used to reproduce the
bug, and a proposed fix. They apply to the master branch.


I am disinclined to accept this as a valid bug, because there's never
been any guarantee that a --clean script would execute error-free in
a database that doesn't match what the source database contains.

(The pg_dump documentation used to say that in so many words.
I see that whoever added the --if-exists option was under the
fond illusion that that fixes all cases, which it surely does not.
We need to back off the promises a bit there.)

An example of a case that won't execute error-free is if the view
having a circular dependency includes a column of a non-built-in
data type.  If you try to run that in an empty database, you'll
get an error from the CREATE OR REPLACE VIEW's reference to that
data type.  For instance, if I adjust your test case to make
the "payload" column be of type hstore, I get something like

psql:dumpresult.sql:22: ERROR:  type "public.hstore" does not exist
LINE 4: NULL::public.hstore AS payload;
   ^

The same type of failure occurs for user-defined functions and
operators that use a non-built-in type, and I'm sure there are
more cases in the same vein.  But it gets *really* messy if
the target database isn't completely empty, but contains objects
with different properties than the dump script expects; for example,
if the view being discussed here exists with a different column set
than the script thinks, or if the dependency chains aren't all the
same.

If this fix were cleaner I might be inclined to accept it anyway,
but it's not very clean at all --- for example, it's far from
obvious to me what are the side-effects of changing the filter
in RestoreArchive like that.  Nor am I sure that the schema
you want to create is guaranteed to get dropped again later in
every use-case.



Hi Tom, Viktoria,

Thank you for your review Viktoria!

Thank you for this detailed explanation, Tom! I didn't have great hope 
for this patch. I thought that the TAP test could be accepted, but now I 
can see that it is clearly useless.




So I think mainly what we ought to do here is to adjust the
documentation to make it clearer that --clean is not guaranteed
to work without errors unless the target database has the same
set of objects as the source.  --if-exists can reduce the set
of error cases, but not eliminate it.  Possibly we should be
more enthusiastic about recommending --create --clean (ie,
drop and recreate the whole database) instead.



I beleive a documentation patch would be useful, indeed.

Best regards,
Frédéric




[PATCH] minor bug fix for pg_dump --clean

2022-09-01 Thread Frédéric Yhuel

Hello,

When using pg_dump (or pg_restore) with option "--clean", there is some 
SQL code to drop every objects at the beginning.


The DROP statement for a view involving circular dependencies is :

CREATE OR REPLACE VIEW [...]

(see commit message of d8c05aff for a much better explanation)

If the view is not in the "public" schema, and the target database is 
empty, this statement fails, because the schema hasn't been created yet.


The attached patches are a TAP test which can be used to reproduce the 
bug, and a proposed fix. They apply to the master branch.


Best regards,
FrédéricFrom 96a334e2411794bdc4648475be286f159daa2484 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Wed, 31 Aug 2022 15:57:39 +0200
Subject: [PATCH 2/2] pg_dump: fix up d8c05aff

---
 src/bin/pg_dump/pg_backup_archiver.c | 3 ++-
 src/bin/pg_dump/pg_dump.c| 4 
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 233198afc0..bb5031105e 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -585,7 +585,8 @@ RestoreArchive(Archive *AHX)
 			 */
 			if (strcmp(te->desc, "DEFAULT") == 0 ||
 strcmp(te->desc, "DATABASE PROPERTIES") == 0 ||
-strncmp(dropStmt, "CREATE OR REPLACE VIEW", 22) == 0)
+( strncmp(dropStmt, "CREATE SCHEMA IF NOT EXISTS", 27) == 0 &&
+  strstr(dropStmt+29, "CREATE OR REPLACE VIEW") ))
 appendPQExpBufferStr(ftStmt, dropStmt);
 			else
 			{
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d25709ad5f..8820984d9e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -17274,6 +17274,7 @@ dumpRule(Archive *fout, const RuleInfo *rinfo)
 	char	   *qtabname;
 	PGresult   *res;
 	char	   *tag;
+	char	   *qnspname;
 
 	/* Do nothing in data-only dump */
 	if (dopt->dataOnly)
@@ -17374,6 +17375,9 @@ dumpRule(Archive *fout, const RuleInfo *rinfo)
 		 */
 		PQExpBuffer result;
 
+		qnspname = pg_strdup(fmtId(tbinfo->dobj.namespace->dobj.name));
+		appendPQExpBuffer(delcmd, "CREATE SCHEMA IF NOT EXISTS %s;\n", qnspname);
+		free(qnspname);
 		appendPQExpBuffer(delcmd, "CREATE OR REPLACE VIEW %s",
 		  fmtQualifiedDumpable(tbinfo));
 		result = createDummyViewAsClause(fout, tbinfo);
-- 
2.30.2

From 463774feb3b968b577ac082a71fdc2de51c8bac4 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Wed, 31 Aug 2022 16:03:34 +0200
Subject: [PATCH 1/2] Add TAP test for pg_restore

---
 src/bin/pg_dump/t/002_pg_dump.pl | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 2873b662fb..ff826e4688 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -4042,6 +4042,28 @@ $node->command_ok(
 	'pg_dumpall: option --exclude-database handles database names with embedded dots'
 );
 
+
+# Test pg_restore with --clean --if-exists
+
+$node->psql('postgres', 'create database postgres2;');
+$node->psql('postgres', 'create database postgres3;');
+
+$node->psql('postgres3',
+		  'CREATE SCHEMA dump_test;'
+  . 'CREATE TABLE dump_test.foo (id INT primary key, payload TEXT);'
+  . 'CREATE VIEW dump_test.babar AS SELECT * FROM dump_test.foo GROUP BY id;',
+	  );
+
+$node->run_log(
+	[ 'pg_dump', '-p', "$port", '-Fc', '--no-sync', "--file=$tempdir/clean_if_exists.dump", 'postgres3' ]
+);
+
+$node->command_like(
+	[ 'pg_restore', '-p', "$port", '--clean', '--if-exists', '-d', 'postgres2', "$tempdir/clean_if_exists.dump" ],
+	'/^\s*$/',
+	'pg_restore should output no warnings on stderr'
+);
+
 #
 # Test invalid multipart schema names
 
-- 
2.30.2



Re: Allow parallel plan for referential integrity checks?

2022-07-26 Thread Frédéric Yhuel




On 4/14/22 14:25, Frédéric Yhuel wrote:



On 3/19/22 01:57, Imseih (AWS), Sami wrote:
I looked at your patch and it's a good idea to make foreign key 
validation

use parallel query on large relations.

It would be valuable to add logging to ensure that the ActiveSnapshot 
and TransactionSnapshot
is the same for the leader and the workers. This logging could be 
tested in the TAP test.


Also, inside RI_Initial_Check you may want to set max_parallel_workers to
max_parallel_maintenance_workers.

Currently the work_mem is set to maintenance_work_mem. This will also 
require

a doc change to call out.

/*
  * Temporarily increase work_mem so that the check query can be 
executed
  * more efficiently.  It seems okay to do this because the query 
is simple
  * enough to not use a multiple of work_mem, and one typically 
would not
  * have many large foreign-key validations happening 
concurrently.  So
  * this seems to meet the criteria for being considered a 
"maintenance"
  * operation, and accordingly we use maintenance_work_mem.  
However, we




Hello Sami,

Thank you for your review!

I will try to do as you say, but it will take time, since my daily job 
as database consultant takes most of my time and energy.




Hi,

As suggested by Jacob, here is a quick message to say that I didn't find 
enough time to work further on this patch, but I didn't completely 
forget it either. I moved it to the next commitfest. Hopefully I will 
find enough time and motivation in the coming months :-)


Best regards,
Frédéric




Re: Allow parallel plan for referential integrity checks?

2022-04-14 Thread Frédéric Yhuel




On 3/19/22 01:57, Imseih (AWS), Sami wrote:

I looked at your patch and it's a good idea to make foreign key validation
use parallel query on large relations.

It would be valuable to add logging to ensure that the ActiveSnapshot and 
TransactionSnapshot
is the same for the leader and the workers. This logging could be tested in the 
TAP test.

Also, inside RI_Initial_Check you may want to set max_parallel_workers to
max_parallel_maintenance_workers.

Currently the work_mem is set to maintenance_work_mem. This will also require
a doc change to call out.

/*
  * Temporarily increase work_mem so that the check query can be executed
  * more efficiently.  It seems okay to do this because the query is simple
  * enough to not use a multiple of work_mem, and one typically would not
  * have many large foreign-key validations happening concurrently.  So
  * this seems to meet the criteria for being considered a "maintenance"
  * operation, and accordingly we use maintenance_work_mem.  However, we



Hello Sami,

Thank you for your review!

I will try to do as you say, but it will take time, since my daily job 
as database consultant takes most of my time and energy.


Best regards,
Frédéric




Re: REINDEX blocks virtually any queries but some prepared queries.

2022-04-11 Thread Frédéric Yhuel




On 4/11/22 02:57, Michael Paquier wrote:

On Fri, Apr 08, 2022 at 04:23:48PM +0200, Frédéric Yhuel wrote:

Thank you Michael.


And done as of 8ac700a.
--


Thank you Micheal!

For reference purposes, we can see in the code of get_relation_info(), 
in plancat.c, that indeed every index of the table are opened, and 
therefore locked, regardless of the query.


Best regards,
Frédéric




Re: REINDEX blocks virtually any queries but some prepared queries.

2022-04-08 Thread Frédéric Yhuel




On 4/8/22 02:22, Michael Paquier wrote:

On Thu, Apr 07, 2022 at 05:29:36PM +0200, Guillaume Lelarge a écrit :

Le jeu. 7 avr. 2022 à 15:44, Frédéric Yhuel  a écrit 
:

On 4/7/22 14:40, Justin Pryzby wrote:
Thank you Justin! I applied your fixes in the v2 patch (attached).


v2 patch sounds good.


The location of the new sentence and its wording seem fine to me.  So
no objections from me to add what's suggested, as suggested.  I'll
wait for a couple of days first.



Thank you Michael.


Indeed ;) That being said, REINDEX CONCURRENTLY could give you an
invalid index, so sometimes you may be tempted to go for a simpler
REINDEX, especially if you believe that the SELECTs won't be blocked.


Agreed.


There are many factors to take into account, one is more expensive
than the other in terms of resources and has downsides, downsides
compensated by the reduction in the lock requirements.  There are
cases where REINDEX is a must-have, as CONCURRENTLY does not support
catalog indexes, and these tend to be easily noticed when corruption
spreads around.


Indeed!

Best regards,
Frédéric




Re: REINDEX blocks virtually any queries but some prepared queries.

2022-04-07 Thread Frédéric Yhuel



On 4/7/22 14:40, Justin Pryzby wrote:

On Thu, Apr 07, 2022 at 01:37:57PM +0200, Frédéric Yhuel wrote:

Maybe something along this line? (patch attached)

Some language fixes.


Thank you Justin! I applied your fixes in the v2 patch (attached).


I didn't verify the behavior, but +1 to document the practical consequences.
I guess this is why someone invented REINDEX CONCURRENTLY.



Indeed ;) That being said, REINDEX CONCURRENTLY could give you an 
invalid index, so sometimes you may be tempted to go for a simpler 
REINDEX, especially if you believe that the SELECTs won't be blocked.From 0b6c7d6e466fabc0233b6960b8a33141d512652f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Thu, 7 Apr 2022 13:30:59 +0200
Subject: [PATCH] Doc: Elaborate locking considerations for REINDEX

---
 doc/src/sgml/ref/reindex.sgml | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index e6b25ee670..d3c63c4deb 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -275,7 +275,12 @@ REINDEX [ ( option [, ...] ) ] { IN
considerations are rather different.  REINDEX locks out writes
but not reads of the index's parent table.  It also takes an
ACCESS EXCLUSIVE lock on the specific index being processed,
-   which will block reads that attempt to use that index.  In contrast,
+   which will block reads that attempt to use that index. In particular,
+   the PostgreSQL query planner tries to take an ACCESS SHARE
+   lock on every index of the table, regardless of the query, and so
+   REINDEX blocks virtually any queries except for some prepared
+   queries whose plan has been cached and which don't use this very index.
+   In contrast,
DROP INDEX momentarily takes an
ACCESS EXCLUSIVE lock on the parent table, blocking both
writes and reads.  The subsequent CREATE INDEX locks out
-- 
2.30.2



Re: REINDEX blocks virtually any queries but some prepared queries.

2022-04-07 Thread Frédéric Yhuel



On 4/6/22 17:03, Peter Geoghegan wrote:

On Wed, Apr 6, 2022 at 7:49 AM Frédéric Yhuel  wrote:

  From the documentation
(https://www.postgresql.org/docs/current/sql-reindex.html#id-1.9.3.162.7),
it sounds like REINDEX won't block read queries that don't need the
index. But it seems like the planner wants to take an ACCESS SHARE lock
on every indexes, regardless of the query, and so REINDEX actually
blocks any queries but some prepared queries whose plan have been cached.

I wonder if it is a bug, or if the documentation should be updated. What
do you think?


I've always thought that the docs for REINDEX, while technically
accurate, are very misleading in practice.



Maybe something along this line? (patch attached)From 4930bb8de182b78228d215bce1ab65263baabde7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Thu, 7 Apr 2022 13:30:59 +0200
Subject: [PATCH] Doc: Elaborate locking considerations for REINDEX

---
 doc/src/sgml/ref/reindex.sgml | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index e6b25ee670..06c223d4a3 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -275,7 +275,11 @@ REINDEX [ ( option [, ...] ) ] { IN
considerations are rather different.  REINDEX locks out writes
but not reads of the index's parent table.  It also takes an
ACCESS EXCLUSIVE lock on the specific index being processed,
-   which will block reads that attempt to use that index.  In contrast,
+   which will block reads that attempt to use that index. In particular,
+   the PostgreSQL query planner wants to take an ACCESS SHARE
+   lock on every indexes of the table, regardless of the query, and so
+   REINDEX blocks virtually any queries but some prepared queries
+   whose plan have been cached and which don't use this very index. In contrast,
DROP INDEX momentarily takes an
ACCESS EXCLUSIVE lock on the parent table, blocking both
writes and reads.  The subsequent CREATE INDEX locks out
-- 
2.30.2



REINDEX blocks virtually any queries but some prepared queries.

2022-04-06 Thread Frédéric Yhuel

Hello,

From the documentation 
(https://www.postgresql.org/docs/current/sql-reindex.html#id-1.9.3.162.7), 
it sounds like REINDEX won't block read queries that don't need the 
index. But it seems like the planner wants to take an ACCESS SHARE lock 
on every indexes, regardless of the query, and so REINDEX actually 
blocks any queries but some prepared queries whose plan have been cached.


I wonder if it is a bug, or if the documentation should be updated. What 
do you think?


Here is a simple demo (tested with postgres 10 and master):

Session #1
===

srcpg@postgres=# CREATE TABLE flights (id INT generated always as 
identity, takeoff DATE);

CREATE TABLE

srcpg@postgres=# INSERT INTO flights (takeoff) SELECT date '2022-03-01' 
+ interval '1 day' * i FROM generate_series(1,1000) i;

INSERT 0 1000

srcpg@postgres=# CREATE INDEX ON flights(takeoff);
CREATE INDEX

srcpg@postgres=# BEGIN;
BEGIN
srcpg@postgres=# REINDEX INDEX flights_takeoff_idx ;
REINDEX



Session #2
===

srcpg@postgres=# SELECT pg_backend_pid();
 pg_backend_pid

4114695

srcpg@postgres=# EXPLAIN SELECT id FROM flights;
--> it blocks


Session #3
===

srcpg@postgres=# SELECT locktype, relname, mode, granted FROM pg_locks 
LEFT JOIN pg_class ON (oid = relation) WHERE pid = 4114695;

  locktype  |   relname   |  mode   | granted
+-+-+-
 virtualxid | ∅   | ExclusiveLock   | t
 relation   | flights_takeoff_idx | AccessShareLock | f
 relation   | flights | AccessShareLock | t




Re: Allow parallel plan for referential integrity checks?

2022-03-03 Thread Frédéric Yhuel

Hello, sorry for the late reply.

On 2/14/22 15:33, Robert Haas wrote:

On Mon, Feb 7, 2022 at 5:26 AM Frédéric Yhuel  wrote:

I noticed that referential integrity checks aren't currently
parallelized. Is it on purpose?


It's not 100% clear to me that it is safe. But on the other hand, it's
also not 100% clear to me that it is unsafe.

Generally, I only added CURSOR_OPT_PARALLEL_OK in places where I was
confident that nothing bad would happen, and this wasn't one of those
places. It's something of a nested-query environment -- your criterion
#6. How do we know that the surrounding query isn't already parallel?
Perhaps because it must be DML,


Did you mean DDL?


but I think it must be more important
to support parallel DML than to support this.




I'm not sure what the right thing to do here is, but I think it would
be good if your patch included a test case.



Would that be a regression test? (in src/test/regress ?)

If yes, what should I check? Is it good enough to load auto_explain and 
check that the query triggered by a foreign key addition is parallelized?


Best regards,
Frédéric




Re: Allow parallel plan for referential integrity checks?

2022-02-14 Thread Frédéric Yhuel




On 2/11/22 00:16, Andreas Karlsson wrote:

On 2/7/22 11:26, Frédéric Yhuel wrote:
Attached is a (naive) patch that aims to fix the case of a FK 
addition, but the handling of the flag CURSOR_OPT_PARALLEL_OK, 
generally speaking, looks rather hackish.


Thanks, for the patch. You can add it to the current open commitfest 
(https://commitfest.postgresql.org/37/).




OK, I just did. Please let me know if I did something wrong.

Best regards,
Frédéric




Should pg_restore vacuum the tables before the post-data stage?

2022-02-08 Thread Frédéric Yhuel

Hello,

I was wondering if pg_restore should call VACUUM ANALYZE for all tables, 
after the "COPY" stage, and before the "post-data" stage.


Indeed, without such a VACUUM, the visibility map isn't available. 
Depending on the size of the tables and on the configuration, a foreign 
key constraint check would have to perform either:


 * a seq scan of the referencing table, even if an index exists for it;
 * a index-only scan, with 100% of heap fetches.

And it's the same for the referenced table, of course, excepts the index 
always exists.


In both cases, it could be very slow.

What's more, the seq scan isn't parallelized [1].

It seems to me that in most cases, autovacuum doesn't have enough time 
to process the tables before the post-data stage, or only some of them.


What do you think?

If the matter has already been discussed previously, can you point me to 
the thread discussing it?


Best regards,
Frédéric

[1] 
https://www.postgresql.org/message-id/flat/0d21e3b4-dcde-290c-875e-6ed5013e8e52%40dalibo.com





Allow parallel plan for referential integrity checks?

2022-02-07 Thread Frédéric Yhuel

Hello,

I noticed that referential integrity checks aren't currently 
parallelized. Is it on purpose?


From the documentation [1], the planner will not generate a parallel 
plan for a given query if any of the following are true:


1) The system is running in single-user mode.
2) max_parallel_workers_per_gather = 0.
3) The query writes any data or locks any database rows.
4) The query might be suspended during execution (cursor or  PL/pgSQL loop).
5) The query uses any function marked PARALLEL UNSAFE.
6) The query is running inside of another query that is already parallel.

From my understanding of the code, it seems to me that the fourth 
condition is not always accurately detected. In particular, the query 
triggered by a foreign key addition (a Left Excluding JOIN between the 
two tables) isn't parallelized, because the flag CURSOR_OPT_PARALLEL_OK 
hasn't been set at some point. I couldn't find any reason why not, but 
maybe I missed something.


Attached is a (naive) patch that aims to fix the case of a FK addition, 
but the handling of the flag CURSOR_OPT_PARALLEL_OK, generally speaking, 
looks rather hackish.


Best regards,
Frédéric

[1] 
https://www.postgresql.org/docs/current/when-can-parallel-query-be-used.htmlFrom 1353a4b7e7d08b13cbaa85a5f9ae26819b5cf2c8 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Yhuel?= 
Date: Sun, 6 Feb 2022 13:31:55 +0100
Subject: [PATCH] Allow parallel plan for referential integrity checks

---
 src/backend/utils/adt/ri_triggers.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 96269fc2ad..84c16b6962 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -1317,6 +1317,7 @@ RI_Initial_Check(Trigger *trigger, Relation fk_rel, Relation pk_rel)
 	char		workmembuf[32];
 	int			spi_result;
 	SPIPlanPtr	qplan;
+	SPIPrepareOptions options;
 
 	riinfo = ri_FetchConstraintInfo(trigger, fk_rel, false);
 
@@ -1483,10 +1484,12 @@ RI_Initial_Check(Trigger *trigger, Relation fk_rel, Relation pk_rel)
 	 * Generate the plan.  We don't need to cache it, and there are no
 	 * arguments to the plan.
 	 */
-	qplan = SPI_prepare(querybuf.data, 0, NULL);
+	memset(&options, 0, sizeof(options));
+	options.cursorOptions = CURSOR_OPT_PARALLEL_OK;
+	qplan = SPI_prepare_extended(querybuf.data, &options);
 
 	if (qplan == NULL)
-		elog(ERROR, "SPI_prepare returned %s for %s",
+		elog(ERROR, "SPI_prepare_extended returned %s for %s",
 			 SPI_result_code_string(SPI_result), querybuf.data);
 
 	/*
-- 
2.30.2