from:"\?\?\?\?"

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-07-12 Thread Jeff Davis

On Fri, 2024-07-12 at 16:11 -0700, Noah Misch wrote:
> Since refresh->relation is a RangeVar, this departs from the standard
> against
> repeated name lookups, from CVE-2014-0062 (commit 5f17304).

Interesting, thank you.

I did a rough refactor and attached v3. Aside from cleanup issues, is
this what you had in mind?

Regards,
Jeff Davis

From cf1722bf0716897f42ce89282f361af4be56b60b Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 9 Jul 2024 11:12:46 -0700
Subject: [PATCH v3 1/2] Add missing RestrictSearchPath() calls.

Reported-by: Noah Misch
Backpatch-through: 17
Discussion: https://postgr.es/m/20240630222344.db.nmi...@google.com
---
 src/backend/commands/indexcmds.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2caab88aa58..c5a56c75f69 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1230,6 +1230,7 @@ DefineIndex(Oid tableId,
 	 */
 	AtEOXact_GUC(false, root_save_nestlevel);
 	root_save_nestlevel = NewGUCNestLevel();
+	RestrictSearchPath();
 
 	/* Add any requested comment */
 	if (stmt->idxcomment != NULL)
@@ -2027,6 +2028,7 @@ ComputeIndexAttrs(IndexInfo *indexInfo,
 			{
 SetUserIdAndSecContext(save_userid, save_sec_context);
 *ddl_save_nestlevel = NewGUCNestLevel();
+RestrictSearchPath();
 			}
 		}
 
@@ -2074,6 +2076,7 @@ ComputeIndexAttrs(IndexInfo *indexInfo,
 		{
 			SetUserIdAndSecContext(save_userid, save_sec_context);
 			*ddl_save_nestlevel = NewGUCNestLevel();
+			RestrictSearchPath();
 		}
 
 		/*
@@ -2104,6 +2107,7 @@ ComputeIndexAttrs(IndexInfo *indexInfo,
 			{
 SetUserIdAndSecContext(save_userid, save_sec_context);
 *ddl_save_nestlevel = NewGUCNestLevel();
+RestrictSearchPath();
 			}
 
 			/*
-- 
2.34.1

From e1f89d13d46cdf040572fc1ce760d5ebc1cacc76 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 12 Jul 2024 14:23:17 -0700
Subject: [PATCH v3 2/2] For materialized views, use REFRESH to load data
 during creation.

Previously, CREATE MATERIALIZED VIEW ... WITH DATA populated the MV
during creation. Instead, use REFRESH, which locks down
security-restricted operations and restricts the
search_path. Otherwise, it's possible to create MVs that cannot be
refreshed.

Reported-by: Noah Misch
Backpatch-through: 17
Discussion: https://postgr.es/m/20240630222344.db.nmi...@google.com
---
 src/backend/commands/createas.c | 33 +--
 src/backend/commands/matview.c  | 43 +++--
 src/include/commands/matview.h  |  3 ++
 src/test/regress/expected/namespace.out |  4 +--
 4 files changed, 47 insertions(+), 36 deletions(-)

diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 62050f4dc59..54a554f4b67 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -225,10 +225,8 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt *stmt,
 	Query	   *query = castNode(Query, stmt->query);
 	IntoClause *into = stmt->into;
 	bool		is_matview = (into->viewQuery != NULL);
+	bool		do_refresh = false;
 	DestReceiver *dest;
-	Oid			save_userid = InvalidOid;
-	int			save_sec_context = 0;
-	int			save_nestlevel = 0;
 	ObjectAddress address;
 	List	   *rewritten;
 	PlannedStmt *plan;
@@ -263,18 +261,13 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt *stmt,
 	Assert(query->commandType == CMD_SELECT);
 
 	/*
-	 * For materialized views, lock down security-restricted operations and
-	 * arrange to make GUC variable changes local to this command.  This is
-	 * not necessary for security, but this keeps the behavior similar to
-	 * REFRESH MATERIALIZED VIEW.  Otherwise, one could create a materialized
-	 * view not possible to refresh.
+	 * For materialized views, always skip data during table creation, and use
+	 * REFRESH instead.
 	 */
 	if (is_matview)
 	{
-		GetUserIdAndSecContext(_userid, _sec_context);
-		SetUserIdAndSecContext(save_userid,
-			   save_sec_context | SECURITY_RESTRICTED_OPERATION);
-		save_nestlevel = NewGUCNestLevel();
+		do_refresh = !into->skipData;
+		into->skipData = true;
 	}
 
 	if (into->skipData)
@@ -346,13 +339,19 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt *stmt,
 		PopActiveSnapshot();
 	}
 
-	if (is_matview)
+	/*
+	 * For materialized views, use REFRESH, which locks down
+	 * security-restricted operations and restricts the search_path.
+	 * Otherwise, one could create a materialized view not possible to
+	 * refresh.
+	 */
+	if (do_refresh)
 	{
-		/* Roll back any GUC changes */
-		AtEOXact_GUC(false, save_nestlevel);
+		RefreshMatViewByOid(address.objectId, false, false,
+			pstate->p_sourcetext, NULL, qc);
 
-		/* Restore userid and security context */
-		SetUserIdAndSecContext(save_userid, save_sec_context);
+		if (qc)
+			qc->commandTag = CMDTAG_SELECT;
 	}
 
 	return address;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-07-12 Thread Noah Misch

On Fri, Jul 12, 2024 at 02:50:52PM -0700, Jeff Davis wrote:
> On Thu, 2024-07-11 at 05:52 -0700, Noah Misch wrote:
> > > I could try to refactor it into two statements and execute them
> > > separately, or I could try to rewrite the statement to use a fully-
> > > qualified destination table before execution. Thoughts?
> > 
> > Those sound fine.  Also fine: just adding a comment on why creation
> > namespace
> > considerations led to not doing it there.
> 
> Attached. 0002 separates the CREATE MATERIALIZED VIEW ... WITH DATA
> into (effectively):
> 
>CREATE MATERIALIZED VIEW ... WITH NO DATA;
>REFRESH MATERIALIZED VIEW ...;
> 
> Using refresh also achieves the stated goal more directly: to (mostly)
> ensure that a subsequent REFRESH will succeed.
> 
> Note: the creation itself no longer executes in a security-restricted
> context, but I don't think that's a problem. The only reason it's using
> the security restricted context is so the following REFRESH will
> succeed, right?

Right, that's the only reason.

> @@ -346,13 +339,21 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt 
> *stmt,
>   PopActiveSnapshot();
>   }
>  
> - if (is_matview)
> + /*
> +  * For materialized views, use REFRESH, which locks down
> +  * security-restricted operations and restricts the search_path.
> +  * Otherwise, one could create a materialized view not possible to
> +  * refresh.
> +  */
> + if (do_refresh)
>   {
> - /* Roll back any GUC changes */
> - AtEOXact_GUC(false, save_nestlevel);
> + RefreshMatViewStmt *refresh = makeNode(RefreshMatViewStmt);
>  
> - /* Restore userid and security context */
> - SetUserIdAndSecContext(save_userid, save_sec_context);
> + refresh->relation = into->rel;
> + ExecRefreshMatView(refresh, pstate->p_sourcetext, NULL, qc);
> +
> + if (qc)
> + qc->commandTag = CMDTAG_SELECT;
>   }

Since refresh->relation is a RangeVar, this departs from the standard against
repeated name lookups, from CVE-2014-0062 (commit 5f17304).

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-07-12 Thread Melih Mutlu

Hi Robert,

Robert Haas , 11 Tem 2024 Per, 23:09 tarihinde şunu
yazdı:

> > Ideally, no CTE would be needed here, but unfortunately, there's no
> > way to know the CacheMemoryContext's ID beforehand.  We could make the
> > ID more stable if we did a breadth-first traversal of the context.
> > i.e., assign IDs in level order.  This would stop TopMemoryContext's
> > 2nd child getting a different ID if its first child became a parent
> > itself.
>
> Do we ever have contexts with the same name at the same level? Could
> we just make the path an array of strings, so that you could then say
> something like this...
>
> SELECT * FROM pg_backend_memory_contexts where path[2] =
> 'CacheMemoryContext'
>
> ...and get all the things with that in the path?
>

I just ran the below  to see if we have any context with the same level and
name.

postgres=# select level, name, count(*) from pg_backend_memory_contexts
group by level, name having count(*)>1;
 level |name | count
---+-+---
 3 | index info  |90
 5 | ExprContext | 5

Seems like it's a possible case. But those contexts might not be the most
interesting ones. I guess the contexts that most users would be interested
in will likely be unique on their levels and with their name. So we might
not be concerned with the contexts, like those two from the above result,
and chose using names instead of transient IDs. But I think that we can't
guarantee name-based path column would be completely reliable in all cases.


> > select * from pg_backend_memory_contexts;
> > -- Observe that CacheMemoryContext has ID=22 and level=2. Get the
> > total of that and all of its descendants.
> > select sum(total_bytes) from pg_backend_memory_contexts where path[2] =
> 22;
> > -- or just it and direct children
> > select sum(total_bytes) from pg_backend_memory_contexts where path[2]
> > = 22 and level <= 3;
>
> I'm doubtful about this because nothing prevents the set of memory
> contexts from changing between one query and the next. We should try
> to make it so that it's easy to get what you want in a single query.
>

Correct. Nothing will not prevent contexts from changing between each
execution. With David's change to use breadth-first traversal, contexts at
upper levels are less likely to change. Knowing this may be useful in some
cases. IMHO there is no harm in making those IDs slightly more "stable",
even though there is no guarantee. My concern is whether we should document
this situation. If we should, how do we explain that the IDs are transient
and can change but also may not change if they're closer to
TopMemoryContext? If it's better not to mention this in the documentation,
does it really matter since most users would not be aware?


I've been also thinking if we should still have the parent column, as
finding out the parent is also possible via looking into the path. What do
you think?

Thanks,
-- 
Melih Mutlu
Microsoft

Re: CREATE INDEX CONCURRENTLY on partitioned index

2024-07-12 Thread Ilya Gladyshev



On 12.07.2024 01:01, Michael Paquier wrote:

Please let's move this point to its own thread and deal with it with
an independent patch.  Hiding that in a thread that's already quite
long is not a good idea.  This needs proper review, and a separate
thread with a good subject to describe the problem will attract a
better audience to deal with the problem you are seeing.

I was not paying much attention, until you've mentioned that this was
an issue with HEAD.
--
Michael


Sure, created a separate thread [1]. Please disregard the second patch 
in this thread. Duplicating the last version of the relevant patch here 
to avoid any confusion.


[1] 
https://www.postgresql.org/message-id/b72f2d89-820a-4fa2-9058-b155cf646f4f%40gmail.com


From acf5cf5d4a984c0f8635a25e03c23409601c0c93 Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Thu, 23 May 2024 18:13:41 +0100
Subject: [PATCH v5 1/2] Allow CREATE INDEX CONCURRENTLY on partitioned table

---
 doc/src/sgml/ddl.sgml |  10 +-
 doc/src/sgml/ref/create_index.sgml|  14 +-
 src/backend/commands/indexcmds.c  | 228 +-
 .../isolation/expected/partitioned-cic.out| 135 +++
 src/test/isolation/isolation_schedule |   1 +
 src/test/isolation/specs/partitioned-cic.spec |  57 +
 src/test/regress/expected/indexing.out| 127 +-
 src/test/regress/sql/indexing.sql |  26 +-
 8 files changed, 520 insertions(+), 78 deletions(-)
 create mode 100644 src/test/isolation/expected/partitioned-cic.out
 create mode 100644 src/test/isolation/specs/partitioned-cic.spec

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 6aab79e901..904978c6e5 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -4314,14 +4314,12 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
  As mentioned earlier, it is possible to create indexes on partitioned
  tables so that they are applied automatically to the entire hierarchy.
  This can be very convenient as not only will all existing partitions be
- indexed, but any future partitions will be as well.  However, one
- limitation when creating new indexes on partitioned tables is that it
- is not possible to use the CONCURRENTLY
- qualifier, which could lead to long lock times.  To avoid this, you can
- use CREATE INDEX ON ONLY the partitioned table, which
+ indexed, but any future partitions will be as well. For more control over
+ locking of the partitions you can use CREATE INDEX ON ONLY
+ on the partitioned table, which
  creates the new index marked as invalid, preventing automatic application
  to existing partitions.  Instead, indexes can then be created individually
- on each partition using CONCURRENTLY and
+ on each partition and
  attached to the partitioned index on the parent
  using ALTER INDEX ... ATTACH PARTITION.  Once indexes for
  all the partitions are attached to the parent index, the parent index will
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 621bc0e253..2366cfd9b5 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -645,7 +645,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] 
 If a problem arises while scanning the table, such as a deadlock or a
 uniqueness violation in a unique index, the CREATE INDEX
-command will fail but leave behind an invalid index. This index
+command will fail but leave behind an invalid index.
+If this happens while build an index concurrently on a partitioned
+table, the command can also leave behind valid or
+invalid indexes on table partitions.  The invalid index
 will be ignored for querying purposes because it might be incomplete;
 however it will still consume update overhead. The psql
 \d command will report such an index as INVALID:
@@ -692,15 +695,6 @@ Indexes:
 cannot.

 
-   
-Concurrent builds for indexes on partitioned tables are currently not
-supported.  However, you may concurrently build the index on each
-partition individually and then finally create the partitioned index
-non-concurrently in order to reduce the time where writes to the
-partitioned table will be locked out.  In this case, building the
-partitioned index is a metadata only operation.
-   
-
   
  
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 309389e20d..dcb4ea89e9 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -95,6 +95,11 @@ static char *ChooseIndexName(const char *tabname, Oid namespaceId,
 			 bool primary, bool isconstraint);
 static char *ChooseIndexNameAddition(const List *colnames);
 static List *ChooseIndexColumnNames(const List *indexElems);
+static void DefineIndexConcurrentInternal(Oid relationId,
+		  Oid indexRelationId,

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-07-12 Thread Melih Mutlu

Hi David,

Thanks for v8 patch. Please see attached v9.

David Rowley , 11 Tem 2024 Per, 04:16 tarihinde şunu
yazdı:

> I did a bit more work in the attached.  I changed "level" to be
> 1-based and because it's the column before "path" I find it much more
> intuitive (assuming no prior knowledge) that the "path" column relates
> to "level" somehow as it's easy to see that "level" is the same number
> as the number of elements in "path". With 0-based levels, that's not
> the case.
>
> Please see the attached patch.  I didn't update any documentation.


I updated documentation for path and level columns and also fixed the tests
as level starts from 1.

+ while (queue != NIL)
> + {
> + List *nextQueue = NIL;
> + ListCell *lc;
> +
> + foreach(lc, queue)
> + {


I don't think we need this outer while loop. Appending to the end of a
queue naturally results in top-to-bottom order anyway, keeping two lists,
"queue" and "nextQueue", might not be necessary. I believe that it's safe
to append to a list while iterating over that list in a foreach loop. v9
removes nextQueue and appends directly into queue.

Thanks,
-- 
Melih Mutlu
Microsoft


v9-0001-Add-path-column-into-pg_backend_memory_contexts.patch
Description: Binary data

REINDEX not updating partition progress

2024-07-12 Thread Ilya Gladyshev


Hi,

While working on CIC for partitioned tables [1], I noticed that REINDEX 
for partitioned tables is not tracking keeping progress of partitioned 
tables, so I'm creating a separate thread for this fix as suggested.


The way REINDEX for partitioned tables works now ReindexMultipleInternal 
treats every partition as an independent table without keeping track of 
partition_done/total counts, because this is just generic code for 
processing multiple tables. The patch addresses that by passing down the 
knowledge a flag to distinguish partitions from independent tables 
in ReindexMultipleInternal and its callees. I also noticed that the 
partitioned CREATE INDEX progress tracking could also benefit from 
progress_index_partition_done function that zeroizes params in addition 
to incrementing the counter, so I applied it there as well.


[1] 
https://www.postgresql.org/message-id/55cfae76-2ffa-43ed-a7e7-901bffbebee4%40gmail.com
From 18baa028e1cc5c39347b9126ec1a96eb99e8e3e1 Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Tue, 11 Jun 2024 17:48:08 +0100
Subject: [PATCH] make REINDEX track partition progress

---
 src/backend/catalog/index.c  | 11 --
 src/backend/commands/indexcmds.c | 61 +---
 src/include/catalog/index.h  |  1 +
 3 files changed, 65 insertions(+), 8 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index a819b4197c..90488f9140 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3560,6 +3560,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 	volatile bool skipped_constraint = false;
 	PGRUsage	ru0;
 	bool		progress = ((params->options & REINDEXOPT_REPORT_PROGRESS) != 0);
+	bool		partition = ((params->options & REINDEXOPT_PARTITION) != 0);
 	bool		set_tablespace = false;
 
 	pg_rusage_init();
@@ -3605,8 +3606,9 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 			indexId
 		};
 
-		pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX,
-	  heapId);
+		if (!partition)
+			pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX,
+		  heapId);
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
@@ -3846,8 +3848,11 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 	index_close(iRel, NoLock);
 	table_close(heapRelation, NoLock);
 
-	if (progress)
+	if (progress && !partition)
+	{
+		/* progress for partitions is tracked in the caller */
 		pgstat_progress_end_command();
+	}
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2caab88aa5..2196279108 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -113,6 +113,7 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 		const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
 static inline void set_indexsafe_procflags(void);
+static inline void progress_index_partition_done(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -1569,7 +1570,7 @@ DefineIndex(Oid tableId,
 		else
 		{
 			/* Update progress for an intermediate partitioned index itself */
-			pgstat_progress_incr_param(PROGRESS_CREATEIDX_PARTITIONS_DONE, 1);
+			progress_index_partition_done();
 		}
 
 		return address;
@@ -1590,7 +1591,7 @@ DefineIndex(Oid tableId,
 		if (!OidIsValid(parentIndexId))
 			pgstat_progress_end_command();
 		else
-			pgstat_progress_incr_param(PROGRESS_CREATEIDX_PARTITIONS_DONE, 1);
+			progress_index_partition_done();
 
 		return address;
 	}
@@ -3213,6 +3214,14 @@ ReindexPartitions(const ReindexStmt *stmt, Oid relid, const ReindexParams *param
 	ListCell   *lc;
 	ErrorContextCallback errcallback;
 	ReindexErrorInfo errinfo;
+	ReindexParams newparams;
+	int			progress_params[3] = {
+		PROGRESS_CREATEIDX_COMMAND,
+		PROGRESS_CREATEIDX_PHASE,
+		PROGRESS_CREATEIDX_PARTITIONS_TOTAL
+	};
+	int64		progress_values[3];
+	Oid			heapId = relid;
 
 	Assert(RELKIND_HAS_PARTITIONS(relkind));
 
@@ -3274,11 +3283,28 @@ ReindexPartitions(const ReindexStmt *stmt, Oid relid, const ReindexParams *param
 		MemoryContextSwitchTo(old_context);
 	}
 
+	if (relkind == RELKIND_PARTITIONED_INDEX)
+	{
+		heapId = IndexGetRelation(relid, true);
+	}
+
+	progress_values[0] = (params->options & REINDEXOPT_CONCURRENTLY) != 0 ?
+		PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY :
+		PROGRESS_CREATEIDX_COMMAND_REINDEX;
+	progress_values[1] = 0;
+	progress_values[2] = list_length(partitions);
+	pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, heapId);
+	pgstat_progress_update_multi_param(3, progress_params, progress_values);
+
 	/*
 	 * Process each partition listed in a separate transaction.  Note that
 	 * this commits and then starts a new transaction immediately.
 	 */
-	ReindexMultipleInternal(stmt, partitions, params);
+	newparams = *params;
+	newparams.options |= REINDEXOPT_PARTITION;
+	ReindexMultipleInternal(stmt, partitions,

Re: MAINTAIN privilege -- what do we need to un-revert it?

2024-07-12 Thread Jeff Davis

On Thu, 2024-07-11 at 05:52 -0700, Noah Misch wrote:
> > I could try to refactor it into two statements and execute them
> > separately, or I could try to rewrite the statement to use a fully-
> > qualified destination table before execution. Thoughts?
> 
> Those sound fine.  Also fine: just adding a comment on why creation
> namespace
> considerations led to not doing it there.

Attached. 0002 separates the CREATE MATERIALIZED VIEW ... WITH DATA
into (effectively):

   CREATE MATERIALIZED VIEW ... WITH NO DATA;
   REFRESH MATERIALIZED VIEW ...;

Using refresh also achieves the stated goal more directly: to (mostly)
ensure that a subsequent REFRESH will succeed.

Note: the creation itself no longer executes in a security-restricted
context, but I don't think that's a problem. The only reason it's using
the security restricted context is so the following REFRESH will
succeed, right?

Regards,
Jeff Davis

From cf1722bf0716897f42ce89282f361af4be56b60b Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 9 Jul 2024 11:12:46 -0700
Subject: [PATCH v2 1/2] Add missing RestrictSearchPath() calls.

Reported-by: Noah Misch
Backpatch-through: 17
Discussion: https://postgr.es/m/20240630222344.db.nmi...@google.com
---
 src/backend/commands/indexcmds.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2caab88aa58..c5a56c75f69 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1230,6 +1230,7 @@ DefineIndex(Oid tableId,
 	 */
 	AtEOXact_GUC(false, root_save_nestlevel);
 	root_save_nestlevel = NewGUCNestLevel();
+	RestrictSearchPath();
 
 	/* Add any requested comment */
 	if (stmt->idxcomment != NULL)
@@ -2027,6 +2028,7 @@ ComputeIndexAttrs(IndexInfo *indexInfo,
 			{
 SetUserIdAndSecContext(save_userid, save_sec_context);
 *ddl_save_nestlevel = NewGUCNestLevel();
+RestrictSearchPath();
 			}
 		}
 
@@ -2074,6 +2076,7 @@ ComputeIndexAttrs(IndexInfo *indexInfo,
 		{
 			SetUserIdAndSecContext(save_userid, save_sec_context);
 			*ddl_save_nestlevel = NewGUCNestLevel();
+			RestrictSearchPath();
 		}
 
 		/*
@@ -2104,6 +2107,7 @@ ComputeIndexAttrs(IndexInfo *indexInfo,
 			{
 SetUserIdAndSecContext(save_userid, save_sec_context);
 *ddl_save_nestlevel = NewGUCNestLevel();
+RestrictSearchPath();
 			}
 
 			/*
-- 
2.34.1

From 00323c08d07f1fb53e29e89e44b9393d442c15b1 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Fri, 12 Jul 2024 14:23:17 -0700
Subject: [PATCH v2 2/2] For materialized views, use REFRESH to load data
 during creation.

Previously, CREATE MATERIALIZED VIEW ... WITH DATA populated the MV
during creation. Instead, use REFRESH, which locks down
security-restricted operations and restricts the
search_path. Otherwise, it's possible to create MVs that cannot be
refreshed.

Reported-by: Noah Misch
Backpatch-through: 17
Discussion: https://postgr.es/m/20240630222344.db.nmi...@google.com
---
 src/backend/commands/createas.c | 35 +
 src/test/regress/expected/namespace.out |  4 +--
 2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 62050f4dc59..166ade5faa7 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -225,10 +225,8 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt *stmt,
 	Query	   *query = castNode(Query, stmt->query);
 	IntoClause *into = stmt->into;
 	bool		is_matview = (into->viewQuery != NULL);
+	bool		do_refresh = false;
 	DestReceiver *dest;
-	Oid			save_userid = InvalidOid;
-	int			save_sec_context = 0;
-	int			save_nestlevel = 0;
 	ObjectAddress address;
 	List	   *rewritten;
 	PlannedStmt *plan;
@@ -263,18 +261,13 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt *stmt,
 	Assert(query->commandType == CMD_SELECT);
 
 	/*
-	 * For materialized views, lock down security-restricted operations and
-	 * arrange to make GUC variable changes local to this command.  This is
-	 * not necessary for security, but this keeps the behavior similar to
-	 * REFRESH MATERIALIZED VIEW.  Otherwise, one could create a materialized
-	 * view not possible to refresh.
+	 * For materialized views, always skip data during table creation, and use
+	 * REFRESH instead.
 	 */
 	if (is_matview)
 	{
-		GetUserIdAndSecContext(_userid, _sec_context);
-		SetUserIdAndSecContext(save_userid,
-			   save_sec_context | SECURITY_RESTRICTED_OPERATION);
-		save_nestlevel = NewGUCNestLevel();
+		do_refresh = !into->skipData;
+		into->skipData = true;
 	}
 
 	if (into->skipData)
@@ -346,13 +339,21 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt *stmt,
 		PopActiveSnapshot();
 	}
 
-	if (is_matview)
+	/*
+	 * For materialized views, use REFRESH, which locks down
+	 * security-restricted operations and restricts the search_path.
+	 * Otherwise, one could create a materialized view not possible to
+	 * refresh.
+	 */

Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2024-07-12 Thread Sutou Kouhei

Hi,

I'm reviewing patches in Commitfest 2024-07 from top to bottom:
https://commitfest.postgresql.org/48/

This is the 2st patch:
https://commitfest.postgresql.org/48/4573/

FYI: https://commitfest.postgresql.org/48/4681/ is my patch.

In 
  "Re: pg_ctl start may return 0 even if the postmaster has been already 
started on Windows" on Tue, 4 Jun 2024 08:30:19 +0900,
  Michael Paquier  wrote:

> During a live review of this patch last week, as part of the Advanced
> Patch Workshop of pgconf.dev, it has been mentioned by Tom that we may
> be able to simplify the check on pmstart if the detection of the
> postmaster PID started by pg_ctl is more stable using the WIN32
> internals that this patch relies on.  I am not sure that this
> suggestion is right, though, because we should still care about the
> clock skew case as written in the surrounding comments?  Even if
> that's OK, I would assume that this should be an independent patch,
> written on top of the proposed v6-0001.

I reviewed the latest patch set and I felt different
impression.

start_postmaster() on Windows uses cmd.exe for redirection
based on the comment in the function:

> /*
>  * As with the Unix case, it's easiest to use the shell (CMD.EXE) to
>  * handle redirection etc.  Unfortunately CMD.EXE lacks any equivalent of
>  * "exec", so we don't get to find out the postmaster's PID immediately.
>  */

It seems that we can use redirection by CreateProcess()
family functions without cmd.exe based on the following
documentation:

https://learn.microsoft.com/en-us/windows/win32/procthread/creating-a-child-process-with-redirected-input-and-output

How about changing start_postmaster() for Windows to start
postgres.exe directly so that it returns the postgres.exe's
PID not cmd.exe's PID? If we can do it, we don't need
pgwin32_find_postmaster_pid() in the patch set.


Thanks,
-- 
kou

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-07-12 Thread Daniel Gustafsson

> On 11 Jul 2024, at 23:22, Peter Eisentraut  wrote:

> The 0001 patch removes the functions pgtls_init_library() and pgtls_init() 
> but keeps the declarations in libpq-int.h.  This should be fixed.

Ah, nice catch. Done in the attached rebase.

--
Daniel Gustafsson



v14-0002-Remove-pg_strong_random-initialization.patch
Description: Binary data


v14-0001-Remove-support-for-OpenSSL-1.0.2.patch
Description: Binary data

Re: Internal error codes triggered by tests

2024-07-12 Thread Daniel Gustafsson

> On 10 Jul 2024, at 06:42, Michael Paquier  wrote:

>> SELECT format('BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
>> SET TRANSACTION SNAPSHOT ''%s''', repeat('-', 1000))
>> \gexec
>> ERROR:  XX000: could not open file "pg_snapshots/-...---" for reading: 
>> File name too long
>> LOCATION:  ImportSnapshot, snapmgr.c:1428
> 
> This one is fun.  errcode_for_file_access() does not know about
> ENAMETOOLONG, as an effect of the errno returned by AllocateFile().
> Perhaps it should map to ERRCODE_NAME_TOO_LONG?

Mapping this case to ERRCODE_NAME_TOO_LONG seems like a legit improvement, even
though the error is likely to be quite rare in production.

The rest mentioned upthread seems either not worth the effort or are likely to
be bugs warranting proper investigation.

--
Daniel Gustafsson

Re: Restart pg_usleep when interrupted

2024-07-12 Thread Sami Imseih


> What does your testing show when you don't have
> the extra check, i.e.,
> 
>   struct timespec delay;
>   struct timespec remain;
> 
>   delay.tv_sec = microsec / 100L;
>   delay.tv_nsec = (microsec % 100L) * 1000;
> 
>   while (nanosleep(, ) == -1 && errno == EINTR)
>   delay = remain;
> 


This is similar to the first attempt [1], 

+pg_usleep_handle_interrupt(long microsec, bool force)
 {
if (microsec > 0)
{
 #ifndef WIN32
struct timespec delay;
+   struct timespec remaining;
 
delay.tv_sec = microsec / 100L;
delay.tv_nsec = (microsec % 100L) * 1000;
-   (void) nanosleep(, NULL);
+
+   if (force)
+   while (nanosleep(, ) == -1 && errno == EINTR)
+   delay = remaining;


but Bertrand found long drifts [2[ which I could not reproduce.
To safeguard the long drifts, continue to use the  time with an 
additional safeguard to make sure the actual sleep does not exceed the 
requested sleep time.

[1] 
https://www.postgresql.org/message-id/7D50DC5B-80C6-47B5-8DA8-A6C68A115EE5%40gmail.com
[2] 
https://www.postgresql.org/message-id/ZoPC5IeP4k7sZpki%40ip-10-97-1-34.eu-west-3.compute.internal


Regards,

Sami

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-07-12 Thread Daniel Gustafsson

> On 11 Jul 2024, at 23:16, Peter Eisentraut  wrote:

> It would be worth checking the discussion at 
> 
>  about strtok()/strtok_r() issues.  First, for list parsing, it sometimes 
> gives the wrong semantics, which I think might apply here.  Maybe it's worth 
> comparing this with the semantics that OpenSSL provides natively. And second, 
> strtok_r() is not available on Windows without the workaround provided in 
> that thread.
> 
> I'm doubtful that it's worth replicating all this list parsing logic instead 
> of just letting OpenSSL do it.  This is a very marginal feature after all.

The original author added the string parsing in order to provide a good error
message in case of an error in the list, and since that seemed like a nice idea
I kept in my review revision.  With what you said above I agree it's not worth
the extra complexity it brings so the attached revision removes it.

--
Daniel Gustafsson



v4-0001-Support-multiple-ECDH-curves.patch
Description: Binary data


v4-0002-Support-TLSv1.3-cipher-suites.patch
Description: Binary data

Re: Restart pg_usleep when interrupted

2024-07-12 Thread Nathan Bossart

On Fri, Jul 12, 2024 at 12:14:56PM -0500, Sami Imseih wrote:
> 1/ TimestampDifference has a dependency on gettimeofday, 
> while my proposal utilizes clock_gettime. There are old discussions
> that did not reach a conclusion comparing both mechanisms. 
> My main conclusion from these hacker discussions [1], [2] and other 
> online discussions on the topic is clock_gettime should replace
> getimeofday when possible. Precision is the main reason.
> 
> 2/ It no longer uses the remain time. I think the remain time
> is still required here. I did a unrealistic stress test which shows 
> the original proposal can handle frequent interruptions much better.

My comment was mostly about coding style and not about gettimeofday()
versus clock_gettime().  What does your testing show when you don't have
the extra check, i.e.,

struct timespec delay;
struct timespec remain;

delay.tv_sec = microsec / 100L;
delay.tv_nsec = (microsec % 100L) * 1000;

while (nanosleep(, ) == -1 && errno == EINTR)
delay = remain;

-- 
nathan

Re: Amcheck verification of GiST and GIN

2024-07-12 Thread Tomas Vondra

OK, one more issue report. I originally thought it's a bug in my patch
adding parallel builds for GIN indexes, but it turns out it happens even
with serial builds on master ...

If I build any GIN index, and then do gin_index_parent_check() on it, I
get this error:

create index jsonb_hash on messages using gin (msg_headers jsonb_path_ops);

select gin_index_parent_check('jsonb_hash');
ERROR:  index "jsonb_hash" has wrong tuple order, block 43932, offset 328

I did try investigating usinng pageinspect - the page seems to be the
right-most in the tree, judging by rightlink = InvalidBlockNumber:

test=# select gin_page_opaque_info(get_raw_page('jsonb_hash', 43932));
 gin_page_opaque_info
--
 (4294967295,0,{})
(1 row)

But gin_leafpage_items() apparently only works with compressed leaf
pages, so I'm not sure what's in the page. In any case, the index seems
to be working fine, so it seems like a bug in this patch.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: gcc 13 warnings

2024-07-12 Thread Andres Freund

Hi,

On 2024-07-05 14:19:12 +0300, Aleksander Alekseev wrote:
> There is still a warning previously reported by Melanie:
>
> ```
> [1391/1944] Compiling C object src/pl/plpgsql/src/plpgsql.so.p/pl_exec.c.o
> In file included from ../src/include/access/htup_details.h:22,
>  from ../src/pl/plpgsql/src/pl_exec.c:21:
> In function ‘assign_simple_var’,
> inlined from ‘exec_set_found’ at ../src/pl/plpgsql/src/pl_exec.c:8382:2:
> ../src/include/varatt.h:230:36: warning: array subscript 0 is outside
> array bounds of ‘char[0]’ [-Warray-bounds]
>   230 | (((varattrib_1b_e *) (PTR))->va_tag)
>   |^
> ../src/include/varatt.h:94:12: note: in definition of macro 
> ‘VARTAG_IS_EXPANDED’
>94 | (((tag) & ~1) == VARTAG_EXPANDED_RO)
>   |^~~
> ../src/include/varatt.h:284:57: note: in expansion of macro ‘VARTAG_1B_E’
>   284 | #define VARTAG_EXTERNAL(PTR)
> VARTAG_1B_E(PTR)
>   | ^~~
> ../src/include/varatt.h:301:57: note: in expansion of macro ‘VARTAG_EXTERNAL’
>   301 | (VARATT_IS_EXTERNAL(PTR) &&
> !VARTAG_IS_EXPANDED(VARTAG_EXTERNAL(PTR)))
>   | 
> ^~~
> ../src/pl/plpgsql/src/pl_exec.c:8570:17: note: in expansion of macro
> ‘VARATT_IS_EXTERNAL_NON_EXPANDED’
>  8570 |
> VARATT_IS_EXTERNAL_NON_EXPANDED(DatumGetPointer(newvalue)))
>   | ^~~
> [1687/1944] Compiling C object
> src/test/modules/test_dsa/test_dsa.so.p/test_dsa.c.o^C
> ninja: build stopped: interrupted by user.
> ``

> The overall environment is Raspberry Pi 5 with pretty much default
> configuration - Raspbian etc.
>
> How to fix it? Absolutely no idea :)

I think it's actually a somewhat reasonable warning - the compiler can't know
that in exec_set_found() we'll always deal with typlen == 1 and thus can't
ever reach the inside of the branch it warns about.

Once the compiler knows about that "restriction", the warning vanishes. Try
adding the following to exec_set_found():

/*
 * Prevent spurious warning due to compiler not realizing
 * VARATT_IS_EXTERNAL_NON_EXPANDED() branch in assign_simple_var() isn't
 * reachable due to "found" being byvalue.
 */
if (var->datatype->typlen != 1)
pg_unreachable();

I'm somewhat inclined to think it'd be worth adding something along those
lines to avoid this warning ([1]).

Greetings,

Andres Freund

[1]

In general we're actually hiding a lot of useful information from the compiler
in release builds, due to asserts not being enabled. I've been wondering about
a version of Assert() that isn't completely removed in release builds but
instead transform into the pg_unreachable() form for compilers with an
"efficient" pg_unreachable() (i.e. not using abort()).

We can't just do that for all asserts though, it only makes sense for ones
that are "trivial" in some form (i.e. the compiler can realize it doesn't
have side effects and doesn't need be generated).

That's about generating more optimized code though.

Re: Restart pg_usleep when interrupted

2024-07-12 Thread Sami Imseih

> 
> I'm imagining something like this:
> 
>struct timespec delay;
>TimestampTz end_time;
> 
>end_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), msec);
> 
>do
>{
>longsecs;
>int microsecs;
> 
>TimestampDifference(GetCurrentTimestamp(), end_time,
>, );
> 
>delay.tv_sec = secs;
>delay.tv_nsec = microsecs * 1000;
> 
>} while (nanosleep(, NULL) == -1 && errno == EINTR);
> 

I do agree that this is cleaner code, but I am not sure I like this.


1/ TimestampDifference has a dependency on gettimeofday, 
while my proposal utilizes clock_gettime. There are old discussions
that did not reach a conclusion comparing both mechanisms. 
My main conclusion from these hacker discussions [1], [2] and other 
online discussions on the topic is clock_gettime should replace
getimeofday when possible. Precision is the main reason.

2/ It no longer uses the remain time. I think the remain time
is still required here. I did a unrealistic stress test which shows 
the original proposal can handle frequent interruptions much better.

#1 in one session kicked off a vacuum

set vacuum_cost_delay = 10;
set vacuum_cost_limit = 1;
set client_min_messages = log;
update large_tbl set version = 1;
vacuum (verbose, parallel 4) large_tbl;

#2 in another session, ran a loop to continually
interrupt the vacuum leader. This was during the
“heap scan” phase of the vacuum.

PID=< pid of vacuum leader >
while :
do
kill -USR1 $PID
done


Using the proposed loop with the remainder, I noticed that
the actual time reported remains close to the requested
delay time.

LOG:  10.00,10.013420
LOG:  10.00,10.011188
LOG:  10.00,10.010860
LOG:  10.00,10.014839
LOG:  10.00,10.004542
LOG:  10.00,10.006035
LOG:  10.00,10.012230
LOG:  10.00,10.014535
LOG:  10.00,10.009645
LOG:  10.00,10.000817
LOG:  10.00,10.002162
LOG:  10.00,10.011721
LOG:  10.00,10.011655

Using the approach mentioned by Nathan, there
are large differences between requested and actual time.

LOG:  10.00,17.801778
LOG:  10.00,12.795450
LOG:  10.00,11.793723
LOG:  10.00,11.796317
LOG:  10.00,13.785993
LOG:  10.00,11.803775
LOG:  10.00,15.782767
LOG:  10.00,31.783901
LOG:  10.00,19.792440
LOG:  10.00,21.795795
LOG:  10.00,18.800412
LOG:  10.00,16.782886
LOG:  10.00,10.795197
LOG:  10.00,14.79
LOG:  10.00,29.806556
LOG:  10.00,18.810784
LOG:  10.00,11.804956
LOG:  10.00,24.809812
LOG:  10.00,25.815600
LOG:  10.00,22.809493
LOG:  10.00,22.790908
LOG:  10.00,19.699097
LOG:  10.00,23.795613
LOG:  10.00,24.797078

Let me know what you think?

[1] https://www.postgresql.org/message-id/flat/31856.1400021891%40sss.pgh.pa.us
[2] 
https://www.postgresql.org/message-id/flat/E1cO7fR-0003y0-9E%40gemulon.postgresql.org



Regards,

Sami

Re: Remove dependence on integer wrapping

2024-07-12 Thread Nathan Bossart

On Sat, Jul 06, 2024 at 07:04:38PM -0400, Joseph Koshakow wrote:
> I've added another patch, 0004, to resolve the jsonb wrap-arounds.
> 
> The other patches, 0001, 0002, and 0003 are unchanged but have their
> version number incremented.

IIUC some of these changes are bug fixes.  Can we split out the bug fixes
to their own patches so that they can be back-patched?

-- 
nathan

Re: Allow non-superuser to cancel superuser tasks.

2024-07-12 Thread Nathan Bossart

On Fri, Jul 12, 2024 at 02:21:09PM +0900, Michael Paquier wrote:
> On Thu, Jul 11, 2024 at 08:50:57PM -0500, Nathan Bossart wrote:
>> I'm not following how this is guaranteed to trigger an autovacuum quickly.
>> Shouldn't we set autovacuum_vacuum_insert_threshold to 1 so that it is
>> eligible for autovacuum?
> 
> You are right, this is not going to influence a faster startup of a
> worker, so we could remove that entirely.  On closer look, the main
> bottlebeck is that the test is spending a lot of time waiting on the
> naptime even if we are using the minimum value of 1s, as the scan of
> pg_stat_activity checking for workers keeps looping.

I suppose it would be silly to allow even lower values for
autovacuum_naptime (e.g., by moving it to ConfigureNamesReal and setting
the minimum to 0.1).

> I have one trick in my sleeve for this one to make the launcher more
> responsive in checking the timestamps of the database list.  That's
> not perfect, but it reduces the wait behind the worker startups by
> 400ms (1s previously as of the naptime, 600ms now) with a reload to
> set the launcher's latch after the injection point has been attached.
> The difference in runtime is noticeable.

That's a neat trick.  I was confused why this test generates an autovacuum
worker at all, but I now see that you are pausing it before we even gather
the list of tables that need to be vacuumed.

>> Is it worth testing cancellation, too?
> 
> The point is to check after pg_signal_backend, so I am not sure it is
> worth the extra cycles for the cancellation.

Agreed.

> What do you think?

Looks reasonable to me.

-- 
nathan

Re: Amcheck verification of GiST and GIN

2024-07-12 Thread Tomas Vondra

OK, one mere comment - it seems the 0001 patch has incorrect indentation
in bt_index_check_callback, triggering this:

--
verify_nbtree.c: In function ‘bt_index_check_callback’:
verify_nbtree.c:331:25: warning: this ‘if’ clause does not guard...
[-Wmisleading-indentation]
  331 | if (indrel->rd_opfamily[i] ==
INTERVAL_BTREE_FAM_OID)
  | ^~
In file included from ../../src/include/postgres.h:46,
 from verify_nbtree.c:24:
../../src/include/utils/elog.h:142:9: note: ...this statement, but the
latter is misleadingly indented as if it were guarded by the ‘if’
  142 | do { \
  | ^~
../../src/include/utils/elog.h:164:9: note: in expansion of macro
‘ereport_domain’
  164 | ereport_domain(elevel, TEXTDOMAIN, __VA_ARGS__)
  | ^~
verify_nbtree.c:333:33: note: in expansion of macro ‘ereport’
  333 | ereport(ERROR,
  | ^~~
--

This seems to be because the ereport() happens to be indented as if it
was in the "if", but should have been at the "for" level.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Can't find bugs to work on

2024-07-12 Thread David G. Johnston

On Fri, Jul 12, 2024 at 8:44 AM Tom Lane  wrote:

>
> As this example shows, it's now standard for PG commit log messages
> to include a link to the relevant email thread, so all you need
> is the message-ID of the first message in the thread to search
> the commit log with.
>
>
Cross-posting my comment from Hackers Discord:

"""
The specific request here seems fairly doable as something we could offer
as a custom search mode on the mailing list website.  For a given time
period show me all messages that began a new message thread where there are
zero replies to the message.  Apply to the -bugs list and you have at least
most of what is being asked for.  [...]
"""

David J.

Re: Missed opportunity for bsearch() in TransactionIdIsCurrentTransactionId()?

2024-07-12 Thread Nathan Bossart

On Fri, Jul 12, 2024 at 12:01:11PM +0200, Antonin Houska wrote:
> Nathan Bossart  wrote:
>> My concern with switching them to bsearch() would be the performance impact
>> of using a function pointer for the comparisons.  Perhaps we could add a
>> couple of inlined binary search implementations for common types to replace
>> many of the open-coded ones.
> 
> bsearch() appears to be used widely, but o.k., I don't insist on using it to
> replace the existing open-coded searches.

Sorry, I didn't mean to say that I was totally against switching to
bsearch(), but I do think we need to understand whether there is any
performance impact before doing so, especially in hot paths.  It would be
nice if we could reduce the number of open-coded binary searches in some
fashion, and if we can maintain or improve performance by creating a
handful of static inline functions, that would be even better.  If there's
no apparent performance difference, bsearch() is probably fine.

-- 
nathan

Re: Can't find bugs to work on

2024-07-12 Thread Tom Lane

Jacob Champion  writes:
> On Fri, Jul 12, 2024 at 7:59 AM Mohab Yaser
>  wrote:
>> So if anyone faced the same problem please let me know.

> I agree that it's rough at the moment. I don't pretend to have any
> solutions, but you might check the Bug Fixes section of the current
> Commitfest, to help review:
> https://commitfest.postgresql.org/48/

Certainly, the CF app is a good place to find out about bugs that
someone else is already working on.  As far as checking the status
of reports in the bugs mailing list goes, we don't have a tracker
for that (yes, it's been discussed).  If the mailing list thread
about a bug doesn't make the status obvious, you could search the
commit log to see if the thread is referenced anywhere.  For
example, this set of commits closed out bug #18467:

-
Author: Etsuro Fujita 
Branch: master Release: REL_17_BR [8cfbac149] 2024-06-07 17:45:00 +0900
Branch: REL_16_STABLE [8405d5a37] 2024-06-07 17:45:02 +0900
Branch: REL_15_STABLE [b33c141cc] 2024-06-07 17:45:04 +0900
Branch: REL_14_STABLE [269e2c391] 2024-06-07 17:45:06 +0900
Branch: REL_13_STABLE [2b461efc5] 2024-06-07 17:45:08 +0900

postgres_fdw: Refuse to send FETCH FIRST WITH TIES to remote servers.

Previously, when considering LIMIT pushdown, postgres_fdw failed to
check whether the query has this clause, which led to pushing false
LIMIT clauses, causing incorrect results.

This clause has been supported since v13, so we need to do a
remote-version check before deciding that it will be safe to push such a
clause, but we do not currently have a way to do the check (without
accessing the remote server); disable pushing such a clause for now.

Oversight in commit 357889eb1.  Back-patch to v13, where that commit
added the support.

Per bug #18467 from Onder Kalaci.

Patch by Japin Li, per a suggestion from Tom Lane, with some changes to
the comments by me.  Review by Onder Kalaci, Alvaro Herrera, and me.

Discussion: https://postgr.es/m/18467-7bb89084ff03a08d%40postgresql.org
-

(This output is from our tool src/tools/git_changelog, which
I like because it knows how to collapse similar commits across
various branches into one entry.)

As this example shows, it's now standard for PG commit log messages
to include a link to the relevant email thread, so all you need
is the message-ID of the first message in the thread to search
the commit log with.

regards, tom lane

Re: Can't find bugs to work on

2024-07-12 Thread Jacob Champion

On Fri, Jul 12, 2024 at 7:59 AM Mohab Yaser
 wrote:
>
> So if anyone faced the same problem please let me know.

I agree that it's rough at the moment. I don't pretend to have any
solutions, but you might check the Bug Fixes section of the current
Commitfest, to help review:

https://commitfest.postgresql.org/48/

or our TODO list:

https://wiki.postgresql.org/wiki/Todo

or the Live Issues section of our v17 release items:

https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items#Live_issues

(but that last one is unlikely to be beginner-friendly). None of these
are ideal for your use case, to be honest, but maybe someone else here
has a better idea.

HTH,
--Jacob

Re: Addressing SECURITY DEFINER Function Vulnerabilities in PostgreSQL Extensions

2024-07-12 Thread Robert Haas

On Mon, Jun 24, 2024 at 3:10 PM Jeff Davis  wrote:
> A special search_path variable "$extension_schema" would be a better
> solution to this problem. We need something like that anyway, in case
> the extension is relocated, so that we don't have to dig through the
> catalog and update all the settings.

+1. I think we need that in order for this proposal to make any progress.

And it seems like the patch has something like that, but I don't
really understand exactly what's going on here, because it's also got
hunks like this in a bunch of places:

+ if (strcmp($2, "$extension_schema") == 0)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("search_path cannot be set to $extension_schema"),
+ parser_errposition(@2)));

So it seems like the patch is trying to support $extension_schema in
some cases but not others, which seems like an odd choice that, as far
as I can see, is not explained anywhere: not in the commit message,
not in the comments (which are pretty minimally updated by the patch),
and not in the documentation (which the patch doesn't update at all).

Ashutosh, can we please get some of that stuff updated, or at least a
clearer explanation of what's going on with $extension_schema here?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Can't find bugs to work on

2024-07-12 Thread Mohab Yaser

Hey there,

I am new to contributing in PostgreSQL and I've just installed the source
code and built the system successfully from the source code and trying to
find a bug to start working on but I face some problems with the pgsql-bugs
mailing list as I can't filter them by which are solved and which are not,
which are assigned to somebody and which are not, which is first good issue
and which are advanced, so I am totally lost in this mailing list
with nearly no useful information that help me get an issue and start
working on it to submit my first patch. So if anyone faced the same problem
please let me know.

Thanks.

Re: Send duration output to separate log files

2024-07-12 Thread Greg Sabino Mullane

On Thu, Jul 11, 2024 at 6:47 AM Alastair Turner  wrote:

>  The other category of logging which would benefit from a separate file is
> audit. It also can create massive volumes of log content. Splitting audit
> information off into a separate file for use by a separate team or function
> is also a request I have heard from some financial institutions adopting
> Postgres. With audit being provided by an extension, this would become
> quite an intrusive change.
>

Thanks for the feedback. I agree pgaudit is another thing that can create
massive log files, and should be solved at some point.  However, I wanted
to keep this patch self-contained to in-core stuff. And pgaudit is already
an odd duck, in that it puts CSV into your stderr stream (and into your
json!). Ideally it would put a single CSV stream into a separate csv file.
Perhaps even something that did not necessarily live in log_directory.

Would an extension be able to safely modify the message_type field you have
>> added using emit_log_hook? If so, the field becomes more of a log
>> destination label than a type marker. If an extension could hook into the
>> log file creation/rotation process, that would be a nice basis for enabling
>> extensions (I'm particularly thinking of pgAudit) to manage separate
>> logging destinations.
>>
>
Yes, I had more than duration in mind when I created errmessagetype. A hook
to set it would be the obvious next step, and then some sort of way of
mapping that to arbitrary log files. But I see that as mostly orthagonal to
this patch (and certainly a much larger endeavor).

(wades in anyways). I'm not sure about hooking into the log rotation
process so much as registering something on startup, then letting Postgres
handle all the log files in the queue. Although as I alluded to above,
sometimes having large log files NOT live in the data directory (or more
specifically, not hang out with the log_directory crowd), could be a plus
for space, efficiency, and security reasons. That makes log rotation
harder, however. And do we / should we put extension-driven logfiles into
current_logfiles? Do we still fall back to stderr even for extension logs?
Lots to ponder. :)

Cheers,
Greg

Re: CREATE OR REPLACE MATERIALIZED VIEW

2024-07-12 Thread Said Assemlal





That is expected because AccessExclusiveLock is acquired on the existing
matview.  This is also the case for CREATE OR REPLACE VIEW.


Right, had this case many times.



My initial idea, while writing the patch, was that one could replace the
matview without populating it and then run the concurrent refresh, like
this:

 CREATE OR REPLACE MATERIALIZED VIEW foo AS ... WITH NO DATA;
 REFRESH MATERIALIZED VIEW CONCURRENTLY foo;

But that won't work because concurrent refresh requires an already
populated matview.

Right now the patch either populates the replaced matview or leaves it
in an unscannable state.  Technically, it's also possible to skip the
refresh and leave the old data in place, perhaps by specifying
WITH *OLD* DATA.  New columns would just be null.  Of course you can't
tell if you got stale data without knowing how the matview was replaced.
Thoughts?



I believe the expectation is to get materialized views updated whenever 
it gets replaced so likely to confuse users ?

Re: Removing unneeded self joins

2024-07-12 Thread Alexander Korotkov

On Fri, Jul 12, 2024 at 1:30 PM Alexander Korotkov  wrote:
> On Fri, Jul 12, 2024 at 6:05 AM Andrei Lepikhov  wrote:
> > On 7/11/24 14:43, jian he wrote:
> > > On Tue, Jul 9, 2024 at 2:06 PM Andrei Lepikhov  wrote:
> > >>
> > >> On 7/2/24 07:25, jian he wrote:
> > >>> to make sure it's correct, I have added a lot of tests,
> > >>> Some of this may be contrived, maybe some of the tests are redundant.
> > >> Thanks for your job!
> > >> I passed through the patches and have some notes:
> > >> 1. Patch 0001 has not been applied anymore since the previous week's
> > >> changes in the core. Also, there is one place with trailing whitespace.
> > >
> > > thanks.
> > > because the previous thread mentioned the EPQ problem.
> > > in remove_useless_self_joins, i make it can only process CMD_SELECT query.
> > I would like to oppose here: IMO, it is just a mishap which we made
> > because of a long history of patch transformations. There we lost the
> > case where RowMark exists for only one of candidate relations.
> > Also, after review I think we don't need so many new tests. Specifically
> > for DML we already have one:
> >
> > EXPLAIN (COSTS OFF)
> > UPDATE sj sq SET b = 1 FROM sj as sz WHERE sq.a = sz.a;
> >
> > And we should just add something to elaborate it a bit.
> > See the patch in attachment containing my proposal to improve v4-0001
> > main SJE patch. I think it resolved the issue with EPQ assertion as well
> > as problems with returning value.
>
> I tried this.  I applied 0001 from [1] and 0002 from [2].  Then I
> tried the concurrent test case [3].  It still fails with assert for
> me.  But assert and related stuff is the least problem.  The big
> problem, as described in [3], is semantical change in query.  When EPQ
> is applied, we fetch the latest tuple of the target relation
> regardless snapshot.  But for the self-joined relation we should still
> use the snapshot-satisfying tuple.  I don't see even attempt to
> address this in your patch.  And as I pointed before, this appears
> quite complex.

Oh, sorry, I used wrong binaries during the check.  My test case works
correctly, because SJE doesn't apply to the target relation.

# explain update test set val = t.val + 1 from test t where test.id = t.id;
 QUERY PLAN
-
 Update on test  (cost=60.85..105.04 rows=0 width=0)
   ->  Hash Join  (cost=60.85..105.04 rows=2260 width=16)
 Hash Cond: (test.id = t.id)
 ->  Seq Scan on test  (cost=0.00..32.60 rows=2260 width=10)
 ->  Hash  (cost=32.60..32.60 rows=2260 width=14)
   ->  Seq Scan on test t  (cost=0.00..32.60 rows=2260 width=14)
(6 rows)

Previously, patch rejected applying SJE for result relation, which as
I see now is wrong.  Andrei's patch rejects SJE for target relation on
the base of row marks, which seems correct to me as the first glance.
So, this doesn't change anything regarding my conclusions regarding
applying SJE for target relation.  But the Andrei's patch yet looks
good indeed.

--
Regards,
Alexander Korotkov
Supabase

Re: Pluggable cumulative statistics

2024-07-12 Thread Dmitry Dolgov

> On Thu, Jul 11, 2024 at 04:42:22PM GMT, Michael Paquier wrote:
>
> So we are down to the remaining parts of the patch, and this is going
> to need a consensus about a few things because this impacts the
> developer experience when implementing one's own custom stats:
> - Are folks OK with the point of fixing the kind IDs in time like
> RMGRs with a control in the wiki?  Or should a more artistic approach
> be used like what I am mentioning at the bottom of [1].  The patch
> allows a range of IDs to be used, to make the access to the stats
> faster even if some area of memory may not be used.

I think it's fine. Although this solution feels a bit uncomfortable,
after thinking back and forth I don't see any significantly better
option. Worth noting that since the main goal is to maintain uniqueness,
fixing the kind IDs could be accomplished in more than one way, with
varying amount of control over the list of custom IDs:

* One coud say "lets keep it in wiki and let the community organize
  itself somehow", and it's done.

* Another way would be to keep it in wiki, and introduce some
  maintenance rules, e.g. once per release someone is going to cleanup
  the list from old unmaintained extensions, correct errors if needed,
  etc. Not sure if such cleanup would be needed, but it's not impossible
  to image.

* Even more closed option would be to keep the kind IDs in some separate
  git repository, and let committers add new records on demand,
  expressed via some request form.

As far as I understand the current proposal is about the first option,
on one side of the spectrum.

> - The fixed-numbered custom stats kinds are stored in an array in
> PgStat_Snapshot and PgStat_ShmemControl, so as we have something
> consistent with the built-in kinds.  This makes the tracking of the
> validity of the data in the snapshots split into parts of the
> structure for builtin and custom kinds.  Perhaps there are better
> ideas than that?  The built-in fixed-numbered kinds have no
> redirection.

Are you talking about this pattern?

   if (pgstat_is_kind_builtin(kind))
   ptr = // get something from a snapshot/shmem by offset
   else
   ptr = // get something from a custom_data by kind

Maybe it would be possible to hide it behind some macros or an inlinable
function with the offset and kind as arguments (and one of them will not
be used)?

Re: Flush pgstats file during checkpoints

2024-07-12 Thread Bertrand Drouvot

Hi,

On Fri, Jul 12, 2024 at 12:10:26PM +, Bertrand Drouvot wrote:
> Need to spend more time and thoughts on 0002+.

I think there is a corner case, say:

1. shutdown checkpoint at LSN1
2. startup->reads the stat file (contains LSN1)->all good->read stat file and
remove it
3. crash (no checkpoint occured between 2. and 3.) 
4. startup (last checkpoint is still LSN1)->no stat file (as removed in 2.) 

In that case we start with empty stats.

Instead of removing the stat file, should we keep it around until the first call
to pgstat_write_statsfile()?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Optimize mul_var() for var1ndigits >= 8

2024-07-12 Thread Dean Rasheed

On Sun, 7 Jul 2024 at 20:46, Joel Jacobson  wrote:
>
> This patch adds a mul_var_large() that is dispatched to from mul_var()
> for var1ndigits >= 8, regardless of rscale.
>
> -- var1ndigits == var2ndigits == 16384
> SELECT COUNT(*) FROM n_max WHERE product = var1 * var2;
> Time: 3191.145 ms (00:03.191) -- HEAD
> Time: 1836.404 ms (00:01.836) -- mul_var_large
>

That's pretty nice. For some reason though, this patch seems to
consistently make the numeric_big regression test a bit slower:

ok 224   - numeric_big   280 ms  [HEAD]
ok 224   - numeric_big   311 ms  [patch]

though I do get a lot of variability when I run that test.

I think this is related to this code:

+   res_ndigits = var1ndigits + var2ndigits + 1;
+   res_weight = var1->weight + var2->weight + 2 +
+((res_ndigits * 2) - (var1->ndigits + var2->ndigits + 1));
+   maxdigits = res_weight + 1 + (rscale + DEC_DIGITS - 1) / DEC_DIGITS +
+   MUL_GUARD_DIGITS;
+   res_ndigits = Min(res_ndigits, maxdigits);

In mul_var_large(), var1ndigits, var2ndigits, and res_ndigits are
counts of NBASE^2 digits, whereas maxdigits is a count of NBASE
digits, so it's not legit to compare them like that. I think it's
pretty confusing to use the same variable names as are used elsewhere
for different things.

I also don't like basically duplicating the whole of mul_var() in a
second function.

The fact that this is close to the current speed for numbers with
around 8 digits is encouraging though. My first thought was that if it
could be made just a little faster, maybe it could replace mul_var()
rather than duplicating it.

I had a go at that in the attached v2 patch, which now gives a very
nice speedup when running numeric_big:

ok 224   - numeric_big   159 ms  [v2 patch]

The v2 patch includes the following additional optimisations:

1). Use unsigned integers, rather than signed integers, as discussed
over in [1].

2). Attempt to fix the formulae incorporating maxdigits mentioned
above. This part really made my brain hurt, and I'm still not quite
sure that I've got it right. In particular, it needs double-checking
to ensure that it's not losing accuracy in the reduced-rscale case.

3). Instead of converting var1 to base NBASE^2 and storing it, just
compute each base-NBASE^2 digit at the point where it's needed, at the
start of the outer loop.

4). Instead of converting all of var2 to base NBASE^2, just convert
the part that actually contributes to the final result. That can make
a big difference when asked for a reduced-rscale result.

5). Use 32-bit integers instead of 64-bit integers to hold the
converted digits of var2.

6). Merge the final carry-propagation pass with the code to convert
the result back to base NBASE.

7). Mark mul_var_short() as pg_noinline. That seemed to make a
difference in some tests, where this patch seemed to cause the
compiler to generate slightly less efficient code for mul_var_short()
when it was inlined. In any case, it seems preferable to separate the
two, especially if mul_var_short() grows in the future.

Overall, those optimisations made quite a big difference for large inputs:

-- var1ndigits1=16383, var2ndigits2=16383
call rate=23.991785  -- HEAD
call rate=35.19989   -- v1 patch
call rate=75.50344   -- v2 patch

which is now a 3.1x speedup relative to HEAD.

For small inputs (above mul_var_short()'s 4-digit threshold), it's
pretty-much performance-neutral:

-- var1ndigits1=5, var2ndigits2=5
call rate=6.045675e+06   -- HEAD
call rate=6.1231815e+06  -- v2 patch

which is pretty-much in the region of random noise. It starts to
become a more definite win for anything larger (in either input):

-- var1ndigits1=5, var2ndigits2=10
call rate=5.437945e+06   -- HEAD
call rate=5.6906255e+06  -- v2 patch

-- var1ndigits1=6, var2ndigits2=6
call rate=5.748427e+06   -- HEAD
call rate=5.953112e+06   -- v2 patch

-- var1ndigits1=7, var2ndigits2=7
call rate=5.3638645e+06  -- HEAD
call rate=5.700681e+06   -- v2 patch

What's less clear is whether there are any platforms for which this
makes it significantly slower.

I tried testing with SIMD disabled, which ought to not affect the
small-input cases, but actually seemed to favour the patch slightly:

-- var1ndigits1=5, var2ndigits2=5 [SIMD disabled]
call rate=6.0269715e+06  -- HEAD
call rate=6.2982245e+06  -- v2 patch

For large inputs, disabling SIMD makes everything much slower, but it
made the relative difference larger:

-- var1ndigits1=16383, var2ndigits2=16383 [SIMD disabled]
call rate=8.24175  -- HEAD
call rate=36.3987   -- v2 patch

which is now 4.3x faster with the patch.

Then I tried compiling with -m32, and unfortunately this made the
patch slower than HEAD for small inputs:

-- var1ndigits1=5, var2ndigits2=5 [-m32, SIMD disabled]
call rate=5.052332e+06  -- HEAD
call rate=3.883459e+06  -- v2 patch

-- var1ndigits1=6, var2ndigits2=6 [-m32, SIMD disabled]
call

Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?

2024-07-12 Thread Masahiko Sawada

On Thu, Jul 11, 2024 at 8:14 PM Fujii Masao  wrote:
>
>
>
> On 2024/07/10 22:35, Masahiko Sawada wrote:
> > BTW the new regression tests don't check the table and index names.
> > Isn't it better to show table and index names for better
> > diagnosability?
>
> Sounds good to me. I've updated the patch as suggested.
> Please see the attached patch.
>

Thank you for updating the patch! LGTM.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Amcheck verification of GiST and GIN

2024-07-12 Thread Tomas Vondra

On 7/10/24 18:01, Tomas Vondra wrote:
> ...
>
> That's all for now. I'll add this to the stress-testing tests of my
> index build patches, and if that triggers more issues I'll report those.
> 

As mentioned a couple days ago, I started using this patch to validate
the patches adding parallel builds to GIN and GiST indexes - I scripts
to stress-test the builds, and I added the new amcheck functions as
another validation step.

For GIN indexes it didn't find anything new (in either this or my
patches), aside from the assert crash I already reported.

But for GiST it turned out to be very valuable - it did actually find an
issue in my patches, or rather confirm my hypothesis that the way the
patch generates fake LSN may not be quite right.

In particular, I've observed these two issues:

  ERROR:  heap tuple (13315,38) from table "planet_osm_roads" lacks
  matching index tuple within index "roads_7_1_idx"

  ERROR:  index "roads_7_7_idx" has inconsistent records on page 23723
  offset 113

And those consistency issues are real - I've managed to reproduce issues
with incorrect query results (by comparing the results to an index built
without parallelism).

So that's nice - it shows the value of this patch, and I like it.

One thing I've been wondering about is that currently amcheck (in
general, not just these new GIN/GiST functions) errors out on the first
issue, because it does ereport(ERROR). Which is good enough to decide if
there is some corruption, but a bit inconvenient if you need to assess
how much corruption is there. For example when investigating the issue
in my patch it would have been great to know if there's just one broken
page, or if there are dozens/hundreds/thousands of them.

I'd imagine we could have a flag which says whether to fail on the first
issue, or keep looking at future pages. Essentially, whether to do
ereport(ERROR) or ereport(WARNING). But maybe that's a dead-end, and
once we find the first issue it's futile to inspect the rest of the
index, because it can be garbage. Not sure. In any case, it's not up to
this patch to invent that.

I don't have additional comments, the patch seems to be clean and likely
ready to go. There's a couple committers already involved in this
thread, I wonder if one of them already planned to take care of this?
Peter and Andres, either of you interested in this?

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Flush pgstats file during checkpoints

2024-07-12 Thread Bertrand Drouvot

Hi,

On Fri, Jul 12, 2024 at 03:42:21PM +0900, Michael Paquier wrote:
> On Fri, Jul 05, 2024 at 01:52:31PM +0900, Michael Paquier wrote:
> > On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:
> >> I think those are two independent issues - knowing that the snapshot is
> >> from the last checkpoint, and knowing that it's correct (not corrupted).
> >> And yeah, we should be careful about fsync/durable_rename.
> > 
> > Yeah, that's bugging me as well.  I don't really get why we would not
> > want durability at shutdown for this data.  So how about switching the
> > end of pgstat_write_statsfile() to use durable_rename()?  Sounds like
> > an independent change, worth on its own.
> 
> Please find attached a rebased patch set with the durability point
> addressed in 0001.  There were also some conflicts.

Thanks!

Looking at 0001:

+   /* error logged already */

Maybe mention it's already logged by durable_rename() (like it's done in
InstallXLogFileSegment(), BaseBackup() for example).

Except this nit, 0001 LGTM.

Need to spend more time and thoughts on 0002+.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: race condition when writing pg_control

2024-07-12 Thread Noah Misch

On Sat, May 18, 2024 at 05:29:12PM +1200, Thomas Munro wrote:
> On Fri, May 17, 2024 at 4:46 PM Thomas Munro  wrote:
> > The specific problem here is that LocalProcessControlFile() runs in
> > every launched child for EXEC_BACKEND builds.  Windows uses
> > EXEC_BACKEND, and Windows' NTFS file system is one of the two file
> > systems known to this list to have the concurrent read/write data
> > mashing problem (the other being ext4).

> First idea idea I've come up with to avoid all of that: pass a copy of
> the "proto-controlfile", to coin a term for the one read early in
> postmaster startup by LocalProcessControlFile().  As far as I know,
> the only reason we need it is to suck some settings out of it that
> don't change while a cluster is running (mostly can't change after
> initdb, and checksums can only be {en,dis}abled while down).  Right?
> Children can just "import" that sucker instead of calling
> LocalProcessControlFile() to figure out the size of WAL segments yada
> yada, I think?  Later they will attach to the real one in shared
> memory for all future purposes, once normal interlocking is allowed.

I like that strategy, particularly because it recreates what !EXEC_BACKEND
backends inherit from the postmaster.  It might prevent future bugs that would
have been specific to EXEC_BACKEND.

> I dunno.  Draft patch attached.  Better plans welcome.  This passes CI
> on Linux systems afflicted by EXEC_BACKEND, and Windows.  Thoughts?

Looks reasonable.  I didn't check over every detail, given the draft status.

Partition-wise join with whole row vars

2024-07-12 Thread Alexander Pyhalov


Hello.

I was looking at enabling partition-wise join with whole row vars. My 
main motivation
was to enable push down DML with partition-wise join in postgres_fdw. 
The work is

based on the earlier patches of Ashutosh Bapat [1].

Partition-wise joins are disabled when whole row vars from relations, 
participating in join, are required to be selected.
Using whole row vars is not frequent by itself, but happens when FDW is 
used in DML or SELECT
FOR UPDATE/ FOR SHARE queries. I'd like to remove this restriction, 
because this will make it possible

to do direct modify operations DML joins in postgres_fdw.
Currently there's an issue in PostgreSQL that when we want to get whole 
row var of child table, we should
translate the record type of child table to the record type of the 
parent table. So, there appear additional
ConvertRowtyeExpressions to cast types and different places in planner 
(and in postgres_fdw) now should know about this.


Here's the series of patches, originally proposed by Ashutosh Bapat, 
rebased and updated to work with current master.


What was done:
1) Reverting changes, that forbid PWJ when whole row vars are required. 
This also restores logic in setrefs.c, necessary to

deal with ConvertRowtyeExpressions.

2) Applying modified patch from the original series to handle 
ConvertRowtypeExprs in pull_vars_clause(). Unlike original patch,
default pull_var_clause_walker() behavior is unchanged. However, when 
PVC_INCLUDE_CONVERTROWTYPES flag is specified, converted
whole row references are returned as is and are not recursed to. This is 
done in such a way to avoid modifying all function consumers.


3) Modified one of the original patches to handle ConvertRowtypeExpr in 
find_computable_ec_member().


4) The next patch modifies logic to search for ConvertRowtypeExpr in 
search_indexed_tlist_for_non_var() - at least for rowid vars 
varnullingrels
can be omitted, so we avoid looking at them when comparing different 
ConvertRowtypeExprs. It seems to be working as expected, but I'm not
deadly sure about this. Perhaps, we also should consider 
context->nrm_match?


5) The next patch is the original one - pulling ConvertRowtypeExpr in  
build_tlist_to_deparse() and deparsing ConvertRowtypeExpr().


6) The rest is fix for postgres_fdw to correctly deparse DML.
The first tricky part here is that in UPDATE SET clause arbitrary 
relations can appear, and currently there's no API to handle this.
So I've modified get_translated_update_targetlist() to do proper 
adjustments.
The second is work with returning lists - we should set tableoids for 
the returned values. This is done by modifying returning filters

to save this information and later to set it in returned tuples.

What this gives us - is the posibility of partition-wise joins for 
foreign DML, like following:


EXPLAIN (COSTS OFF, VERBOSE)
DELETE FROM fprt1 USING fprt2 WHERE fprt1.a = fprt2.b AND fprt2.a % 30 = 
29;


QUERY PLAN
--
 Delete on public.fprt1
   Foreign Delete on public.ftprt1_p1 fprt1_1
   Foreign Delete on public.ftprt1_p2 fprt1_2
->  Append
  ->  Foreign Delete
   Remote SQL: DELETE FROM public.fprt1_p1 r3 USING 
public.fprt2_p1 r5 WHERE ((r3.a = r5.b)) AND (((r5.a % 30) = 29))

  ->  Foreign Delete
   Remote SQL: DELETE FROM public.fprt1_p2 r4 USING 
public.fprt2_p2 r6 WHERE ((r4.a = r6.b)) AND (((r6.a % 30) = 29))



[1] 
https://www.postgresql.org/message-id/CAFjFpRc8ZoDm0%2Bzhx%2BMckwGyEqkOzWcpVqbvjaxwdGarZSNrmA%40mail.gmail.com

--
Best regards,
Alexander Pyhalov,
Postgres ProfessionalFrom e9695c968d2705b0c8986dfbb3fd2409e22acb63 Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov 
Date: Wed, 27 Dec 2023 17:31:23 +0300
Subject: [PATCH 6/6] postgres_fdw: fix partition-wise DML

---
 contrib/postgres_fdw/deparse.c|  12 +-
 .../postgres_fdw/expected/postgres_fdw.out| 160 ++
 contrib/postgres_fdw/postgres_fdw.c   |  98 ++-
 contrib/postgres_fdw/sql/postgres_fdw.sql |  38 +
 src/backend/optimizer/util/appendinfo.c   |  32 +++-
 src/include/optimizer/appendinfo.h|   1 +
 6 files changed, 323 insertions(+), 18 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index f879ff3100f..d4c657fa4d5 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -2309,7 +2309,7 @@ deparseDirectUpdateSql(StringInfo buf, PlannerInfo *root,
 
 	appendStringInfoString(buf, "UPDATE ");
 	deparseRelation(buf, rel);
-	if (foreignrel->reloptkind == RELOPT_JOINREL)
+	if (IS_JOIN_REL(foreignrel))
 		appendStringInfo(buf, " %s%d", REL_ALIAS_PREFIX, rtindex);
 	appendStringInfoString(buf, " SET ");
 
@@ -2336,7 +2336,7 @@ deparseDirectUpdateSql(StringInfo buf, PlannerInfo *root,

Re: Removing unneeded self joins

2024-07-12 Thread Alexander Korotkov

Hi, Andrei!

On Fri, Jul 12, 2024 at 6:05 AM Andrei Lepikhov  wrote:
>
> On 7/11/24 14:43, jian he wrote:
> > On Tue, Jul 9, 2024 at 2:06 PM Andrei Lepikhov  wrote:
> >>
> >> On 7/2/24 07:25, jian he wrote:
> >>> to make sure it's correct, I have added a lot of tests,
> >>> Some of this may be contrived, maybe some of the tests are redundant.
> >> Thanks for your job!
> >> I passed through the patches and have some notes:
> >> 1. Patch 0001 has not been applied anymore since the previous week's
> >> changes in the core. Also, there is one place with trailing whitespace.
> >
> > thanks.
> > because the previous thread mentioned the EPQ problem.
> > in remove_useless_self_joins, i make it can only process CMD_SELECT query.
> I would like to oppose here: IMO, it is just a mishap which we made
> because of a long history of patch transformations. There we lost the
> case where RowMark exists for only one of candidate relations.
> Also, after review I think we don't need so many new tests. Specifically
> for DML we already have one:
>
> EXPLAIN (COSTS OFF)
> UPDATE sj sq SET b = 1 FROM sj as sz WHERE sq.a = sz.a;
>
> And we should just add something to elaborate it a bit.
> See the patch in attachment containing my proposal to improve v4-0001
> main SJE patch. I think it resolved the issue with EPQ assertion as well
> as problems with returning value.

I tried this.  I applied 0001 from [1] and 0002 from [2].  Then I
tried the concurrent test case [3].  It still fails with assert for
me.  But assert and related stuff is the least problem.  The big
problem, as described in [3], is semantical change in query.  When EPQ
is applied, we fetch the latest tuple of the target relation
regardless snapshot.  But for the self-joined relation we should still
use the snapshot-satisfying tuple.  I don't see even attempt to
address this in your patch.  And as I pointed before, this appears
quite complex.

Links.
1. 
https://www.postgresql.org/message-id/96250a42-20e3-40f0-9d45-f53ae852f8ed%40gmail.com
2. 
https://www.postgresql.org/message-id/5b49501c-9cb3-4c5d-9d56-49704ff08143%40gmail.com
3. 
https://www.postgresql.org/message-id/CAPpHfduM6X82ExT0r9UzFLJ12wOYPvRw5vT2Htq0gAPBgHhKeQ%40mail.gmail.com

--
Regards,
Alexander Korotkov
Supabase

Re: Flush pgstats file during checkpoints

2024-07-12 Thread Bertrand Drouvot

Hi,

On Fri, Jul 05, 2024 at 01:52:31PM +0900, Michael Paquier wrote:
> On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:
> > I think those are two independent issues - knowing that the snapshot is
> > from the last checkpoint, and knowing that it's correct (not corrupted).
> > And yeah, we should be careful about fsync/durable_rename.
> 
> Yeah, that's bugging me as well.  I don't really get why we would not
> want durability at shutdown for this data.  So how about switching the
> end of pgstat_write_statsfile() to use durable_rename()?  Sounds like
> an independent change, worth on its own.
> 
> > Yeah, I was wondering about the same thing - can't this mean we fail to
> > start autovacuum? Let's say we delete a significant fraction of a huge
> > table, but then we crash before the next checkpoint. With this patch we
> > restore the last stats snapshot, which can easily have
> > 
> > n_dead_tup  | 0
> > n_mod_since_analyze | 0
> > 
> > for the table. And thus we fail to notice the table needs autovacuum.
> > AFAIK we run autovacuum on all tables with missing stats (or am I
> > wrong?). That's what's happening on replicas after switchover/failover
> > too, right?
> 
> That's the opposite, please look at relation_needs_vacanalyze().  If a
> table does not have any stats found in pgstats, like on HEAD after a
> crash, then autoanalyze is skipped and autovacuum happens only for the
> anti-wrap case.
> 
> For the database case, rebuild_database_list() uses
> pgstat_fetch_stat_dbentry() three times, discards entirely databases
> that have no stats.  Again, there should be a stats entry post a
> crash upon a reconnection.
> 
> So there's an argument in saying that the situation does not get worse
> here and that we actually may improve odds by keeping a trace of the
> previous stats, no? 

I agree, we still could get autoanalyze/autovacuum skipped due to 
obsolete/stales
stats being restored from the last checkpoint but that's better than having them
always skipped (current HEAD).

> In most cases, there would be a stats entry
> post-crash as an INSERT or a scan would have re-created it,

Yeap.

> but the
> stats would reflect the state registered since the last crash recovery
> (even on HEAD, a bulk delete bumping the dead tuple counter would not
> be detected post-crash).

Right.

> The case of spiky workloads may impact the 
> decision-making, of course, but at least we'd allow autovacuum to take
> some decision over giving up entirely based on some previous state of
> the stats.  That sounds like a win for me with steady workloads, less
> for bulby workloads..

I agree and it is not worst (though not ideal) that currently on HEAD.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: MERGE/SPLIT partition commands should create new partitions in the parent's tablespace?

2024-07-12 Thread Junwang Zhao

On Wed, Jul 10, 2024 at 9:36 PM Masahiko Sawada  wrote:
>
> On Wed, Jul 10, 2024 at 5:14 PM Masahiko Sawada  wrote:
> >
> > On Wed, Jul 10, 2024 at 4:14 PM Fujii Masao  
> > wrote:
> > >
> > >
> > >
> > > On 2024/07/10 12:13, Masahiko Sawada wrote:
> > > > On Sat, Jul 6, 2024 at 4:06 PM Fujii Masao 
> > > >  wrote:
> > > >>
> > > >> Hi,
> > > >>
> > > >> I noticed that ALTER TABLE MERGE PARTITIONS and SPLIT PARTITION 
> > > >> commands
> > > >> always create new partitions in the default tablespace, regardless of
> > > >> the parent's tablespace. However, the indexes of these new partitions 
> > > >> inherit
> > > >> the tablespaces of their parent indexes. This inconsistency seems odd.
> > > >> Is this an oversight or intentional?
> > > >>
> > > >> Here are the steps I used to test this:
> > > >>
> > > >> ---
> > > >> CREATE TABLESPACE tblspc LOCATION '/tmp/tblspc';
> > > >> CREATE TABLE t (i int PRIMARY KEY USING INDEX TABLESPACE tblspc)
> > > >> PARTITION BY RANGE (i) TABLESPACE tblspc;
> > > >>
> > > >> CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
> > > >> CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);
> > > >>
> > > >> ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2;
> > > >>
> > > >> SELECT tablename, tablespace FROM pg_tables WHERE tablename IN ('t', 
> > > >> 'tp_0_2') ORDER BY tablename;
> > > >>tablename | tablespace
> > > >> ---+
> > > >>t | tblspc
> > > >>tp_0_2| (null)
> > > >> (2 rows)
> > > >>
> > > >> SELECT indexname, tablespace FROM pg_indexes WHERE tablename IN ('t', 
> > > >> 'tp_0_2') ORDER BY indexname;
> > > >> indexname  | tablespace
> > > >> -+
> > > >>t_pkey  | tblspc
> > > >>tp_0_2_pkey | tblspc
> > > >> ---
> > > >>
> > > >>
> > > >> If it's an oversight, I've attached a patch to ensure these commands 
> > > >> create
> > > >> new partitions in the parent's tablespace.
> > > >
> > > > +1
> > > >
> > > > Since creating a child table through the CREATE TABLE statement sets
> > > > its parent table's tablespace as the child table's tablespace, it is
> > > > logical to set the parent table's tablespace as the merged table's
> > > > tablespace.
>
> One expectation I had for MERGE PARTITION was that if all partition
> tables to be merged are in the same tablespace, the merged table is
> also created in the same tablespace. But it would be an exceptional
> case in a sense, and I agree with the proposed behavior as it's
> consistent. It might be a good idea that we can specify the tablespace
> for each merged/split table in the future.

I agree this is a good idea, so I tried to support this feature.

The attached patch v3-0001 is exactly the same as v2-0001, v3-0002 is
a patch for specifying tablespace for each merged/split table.

I'm not sure this addressed David's concern about the tablespace choice
in ca4103025 though.

>
> BTW the new regression tests don't check the table and index names.
> Isn't it better to show table and index names for better
> diagnosability?
>
> +-- Check the new partition inherits parent's tablespace
> +CREATE TABLE t (i int PRIMARY KEY USING INDEX TABLESPACE regress_tblspace)
> +  PARTITION BY RANGE (i) TABLESPACE regress_tblspace;
> +CREATE TABLE tp_0_1 PARTITION OF t FOR VALUES FROM (0) TO (1);
> +CREATE TABLE tp_1_2 PARTITION OF t FOR VALUES FROM (1) TO (2);
> +ALTER TABLE t MERGE PARTITIONS (tp_0_1, tp_1_2) INTO tp_0_2;
> +SELECT tablespace FROM pg_tables WHERE tablename IN ('t', 'tp_0_2')
> ORDER BY tablespace;
> +tablespace
> +--
> + regress_tblspace
> + regress_tblspace
> +(2 rows)
> +
> +SELECT tablespace FROM pg_indexes WHERE tablename IN ('t', 'tp_0_2')
> ORDER BY tablespace;
> +tablespace
> +--
> + regress_tblspace
> + regress_tblspace
> +(2 rows)
> +
> +DROP TABLE t;
>
>
> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
>
>


-- 
Regards
Junwang Zhao


v3-0002-support-specify-tablespace-for-each-merged-split-.patch
Description: Binary data


v3-0001-Ensure-MERGE-SPLIT-partition-commands-create-new-.patch
Description: Binary data

RE: Showing applied extended statistics in explain Part 2

2024-07-12 Thread Masahiro.Ikeda

Hi,

Thanks for working the feature. As a user, I find it useful, and I'd like to use
it in v18! Although I've just started start looking into it, I have a few 
questions.


(1)

Is it better to make the order of output consistent? For example, even
though there are three clauses shown in the below case, the order does not
match.
* "Filter" shows that "id1" is first.
* "Ext Stats" shows that "id2" is first.

-- An example
DROP TABLE IF EXISTS test;
CREATE TABLE test (id1 int2, id2 int4, id3 int8, value varchar(32));
INSERT INTO test (SELECT i%11, i%103, i%1009, 'hello' FROM 
generate_series(1,100) s(i));
create statistics test_s1 on id1, id2 from test; analyze;

=# EXPLAIN (STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR id2 > 10);
  QUERY PLAN
   
---
 Gather  (cost=1000.00..23092.77 rows=84311 width=20)
   Workers Planned: 2
   ->  Parallel Seq Scan on test  (cost=0.00..13661.67 rows=35130 width=20)
 Filter: ((id1 = 1) AND ((id2 = 2) OR (id2 > 10)))  
  -- here
 Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND 
(id1 = 1))-- here
(5 rows)



(2)

Do we really need the schema names without VERBOSE option? As in the above case,
"Ext Stats" shows schema name "public", even though the table name "test" isn't
shown with its schema name.

Additionally, if the VERBOSE option is specified, should the column names also 
be
printed with namespace?

=# EXPLAIN (VERBOSE, STATS) SELECT * FROM test WHERE id1 = 1 AND (id2 = 2 OR 
id2 > 10);
  QUERY PLAN
   
---
 Gather  (cost=1000.00..22947.37 rows=82857 width=20)
   Output: id1, id2, id3, value
   Workers Planned: 2
   ->  Parallel Seq Scan on public.test  (cost=0.00..13661.67 rows=34524 
width=20)
 Output: id1, id2, id3, value
 Filter: ((test.id1 = 1) AND ((test.id2 = 2) OR (test.id2 > 10)))
 Ext Stats: public.test_s1  Clauses: (((id2 = 2) OR (id2 > 10)) AND 
(id1 = 1))   -- here
(7 rows)



(3)

I might be misunderstanding something, but do we need the clauses? Is there any
case where users would want to know the clauses? For example, wouldn't the
following be sufficient?

> Ext Stats: id1, id2 using test_s1



(4)

The extended statistics with "dependencies" or "ndistinct" option don't seem to
be shown in EXPLAIN output. Am I missing something? (Is this expected?)

I tested the examples in the documentation. Although it might work with
"mcv" option, I can't confirm that it works because "unrecognized node type"
error occurred in my environment.
https://www.postgresql.org/docs/current/sql-createstatistics.html

(It might be wrong since I'm beginner with extended stats codes.)
IIUC, the reason is that the patch only handles 
statext_mcv_clauselist_selectivity(),
and doesn't handle dependencies_clauselist_selectivity() and 
estimate_multivariate_ndistinct().


-- doesn't work with "dependencies" option?
=# \dX
List of extended statistics
 Schema |  Name   | Definition | Ndistinct | Dependencies |   MCV   
+-++---+--+-
 public | s1  | a, b FROM t1   | (null)| defined  | (null)
(2 rows)

=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t1 WHERE (a = 1) AND (b = 0);
QUERY PLAN  
   
---
 Gather  (cost=1000.00..11685.00 rows=100 width=8) (actual time=0.214..50.327 
rows=100 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t1  (cost=0.00..10675.00 rows=42 width=8) (actual 
time=30.300..46.610 rows=33 loops=3)
 Filter: ((a = 1) AND (b = 0))
 Rows Removed by Filter: 00
 Planning Time: 0.246 ms
 Execution Time: 50.361 ms
(8 rows)

-- doesn't work with "ndistinct"?
=# \dX
 List of extended statistics
 Schema | Name |Definition  
  | Ndistinct | Dependencies |  MCV   
+--+--+---+--+
 public | s3   | date_trunc('month'::text, a), date_trunc('day'::text, a) FROM 
t3 | defined   | (null)   | (null)
(1 row)

postgres(437635)=# EXPLAIN (STATS, ANALYZE) SELECT * FROM t3
  WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
QUERY PLAN

Re: Missed opportunity for bsearch() in TransactionIdIsCurrentTransactionId()?

2024-07-12 Thread Antonin Houska

Nathan Bossart  wrote:

> On Wed, Jul 10, 2024 at 05:00:13PM +0200, Antonin Houska wrote:
> > I don't quite understand why TransactionIdIsCurrentTransactionId() 
> > implements
> > binary search in ParallelCurrentXids "from scratch" instead of using
> > bsearch().
> 
> I believe there are a number of open-coded binary searches in the tree.

Not sure if there are many, but I could find some:

* TransactionIdIsCurrentTransactionId()

* RelationHasSysCache()

* pg_dump.c:getRoleName()

> My concern with switching them to bsearch() would be the performance impact
> of using a function pointer for the comparisons.  Perhaps we could add a
> couple of inlined binary search implementations for common types to replace
> many of the open-coded ones.

bsearch() appears to be used widely, but o.k., I don't insist on using it to
replace the existing open-coded searches.

What actually bothers me more than the absence of bsearch() is that
TransactionIdIsCurrentTransactionId() implements the binary search from
scratch. Even w/o bsearch(), it can still call TransactionIdInArray(). I ran
into the problem when working on [1], which adds one more XID array.

Does the attached patch seem worth being applied separately, or at all?

[1] https://www.postgresql.org/message-id/82651.1720540558%40antos

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

>From 470a66d0086bd8b656f88c86aec9f5861a763df7 Mon Sep 17 00:00:00 2001
From: Antonin Houska 
Date: Fri, 12 Jul 2024 10:06:32 +0200
Subject: [PATCH] Refactor search in a sorted XID array.

First, avoid duplicate definition of TransactionIdInArray().

Second, prefer open-coded implementation to bsearch(): it's probably cheaper
to compare two XIDs directly than via a bsearch() callback.
---
 src/backend/access/heap/heapam_visibility.c   | 10 --
 src/backend/access/transam/xact.c | 29 ++---
 .../replication/logical/reorderbuffer.c   | 11 ---
 src/include/access/transam.h  | 32 +++
 4 files changed, 35 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 9243feed01..d4653a28db 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1559,16 +1559,6 @@ HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple)
 	return true;
 }
 
-/*
- * check whether the transaction id 'xid' is in the pre-sorted array 'xip'.
- */
-static bool
-TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num)
-{
-	return num > 0 &&
-		bsearch(, xip, num, sizeof(TransactionId), xidComparator) != NULL;
-}
-
 /*
  * See the comments for HeapTupleSatisfiesMVCC for the semantics this function
  * obeys.
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d119ab909d..795c04a879 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -961,33 +961,10 @@ TransactionIdIsCurrentTransactionId(TransactionId xid)
 
 	/*
 	 * In parallel workers, the XIDs we must consider as current are stored in
-	 * ParallelCurrentXids rather than the transaction-state stack.  Note that
-	 * the XIDs in this array are sorted numerically rather than according to
-	 * transactionIdPrecedes order.
+	 * ParallelCurrentXids rather than the transaction-state stack.
 	 */
-	if (nParallelCurrentXids > 0)
-	{
-		int			low,
-	high;
-
-		low = 0;
-		high = nParallelCurrentXids - 1;
-		while (low <= high)
-		{
-			int			middle;
-			TransactionId probe;
-
-			middle = low + (high - low) / 2;
-			probe = ParallelCurrentXids[middle];
-			if (probe == xid)
-return true;
-			else if (probe < xid)
-low = middle + 1;
-			else
-high = middle - 1;
-		}
-		return false;
-	}
+	if (TransactionIdInArray(xid, ParallelCurrentXids, nParallelCurrentXids))
+		return true;
 
 	/*
 	 * We will return true for the Xid of the current subtransaction, any of
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 00a8327e77..bd3edd0b47 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -5136,17 +5136,6 @@ ApplyLogicalMappingFile(HTAB *tuplecid_data, Oid relid, const char *fname)
  errmsg("could not close file \"%s\": %m", path)));
 }
 
-
-/*
- * Check whether the TransactionId 'xid' is in the pre-sorted array 'xip'.
- */
-static bool
-TransactionIdInArray(TransactionId xid, TransactionId *xip, Size num)
-{
-	return bsearch(, xip, num,
-   sizeof(TransactionId), xidComparator) != NULL;
-}
-
 /*
  * list_sort() comparator for sorting RewriteMappingFiles in LSN order.
  */
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd..78739b2490 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -370,6 +370,38 @@ FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2024-07-12 Thread Aleksander Alekseev

Hi,

> A comment about v3-0001.
>
> -* If true, use long segment filenames formed from lower 48 bits of 
> the
> -* segment number, e.g. pg_xact/1234. Otherwise, use short
> -* filenames formed from lower 16 bits of the segment number e.g.
> -* pg_xact/1234.
> +* If true, use long segment filenames. Use short filenames otherwise.
> +* See SlruFileName().
>
> We're losing some details here even if SlruFileName() has some
> explanations, because one would need to read through the snprintf's
> 04X to know that short file names include 4 characters.  I'm OK to
> mention SlruFileName() rather than duplicate the knowledge here, but
> SlruFileName() should also be updated to mention the same level of
> details with some examples of file names.

Fair enough. Here is the updated patchset.


--
Best regards,
Aleksander Alekseev
From 0b8d67ab6cc582b356e639175f85d30982f2fb2c Mon Sep 17 00:00:00 2001
From: Aleksander Alekseev 
Date: Wed, 26 Jun 2024 12:59:15 +0300
Subject: [PATCH v4 1/2] Refactor the comments about formatting SLRU segment
 filenames

The comment for SlruCtlData.long_segment_names was just wrong, an oversight
of commit 4ed8f0913bfd.

While on it, additionally clarify what SlruFileName() does and how it uses
SlruCtlData.long_segment_names.

Aleksander Alekseev. Reported by Noah Misch. Reviewed by Michael Paquier.
---
 src/backend/access/transam/slru.c | 22 ++
 src/include/access/slru.h | 15 +++
 2 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 77b05cc0a7..168df26065 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -72,6 +72,28 @@
 #include "storage/shmem.h"
 #include "utils/guc_hooks.h"
 
+/*
+ * Converts segment number to the filename of the segment.
+ *
+ * path: should point to a buffer at least MAXPGPATH characters long.
+ *
+ * If ctl->long_segment_names is true, segno can be in 0 .. (2^60-1) range and
+ * the resulting filename will be like:
+ *
+ *   slru_dir/123456789ABCDEF (15 characters)
+ *
+ * This is the recommended way of formatting for all new code.
+ *
+ * If ctl->long_segment_names is false, segno can be in 0 .. (2^24-1) range.
+ * The resulting filename will be one of:
+ *
+ *  slru_dir/1234 if0 <= segno < 2^16
+ *  slru_dir/12345if 2^16 <= segno < 2^20
+ *  slru_dir/123456   if 2^20 <= segno < 2^24
+ *
+ * This is the legacy way of formatting that is kept for backward compatibility.
+ * New code shouldn't use it.
+ */
 static inline int
 SlruFileName(SlruCtl ctl, char *path, int64 segno)
 {
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 8a8d191873..c836fa8534 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -134,10 +134,17 @@ typedef struct SlruCtlData
 	bits16		bank_mask;
 
 	/*
-	 * If true, use long segment filenames formed from lower 48 bits of the
-	 * segment number, e.g. pg_xact/1234. Otherwise, use short
-	 * filenames formed from lower 16 bits of the segment number e.g.
-	 * pg_xact/1234.
+	 * If true, use long segment filenames. Use short filenames otherwise.
+	 *
+	 * Each segment contains SLRU_PAGES_PER_SEGMENT pages, i.e.:
+	 *
+	 * Segment #Contains pages
+	 * 00 .. (SLRU_PAGES_PER_SEGMENT-1)
+	 * 1SLRU_PAGES_PER_SEGMENT .. (SLRU_PAGES_PER_SEGMENT*2-1)
+	 * ... etc ..
+	 *
+	 * The filename of the segment is formed from its number. For more details
+	 * regarding exact rules see SlruFileName().
 	 */
 	bool		long_segment_names;
 
-- 
2.45.2

From 6811fa964cce2f2f308793ca8d627d7c6dc3f1b2 Mon Sep 17 00:00:00 2001
From: Aleksander Alekseev 
Date: Wed, 26 Jun 2024 14:02:42 +0300
Subject: [PATCH v4 2/2] Use int64 for page numbers in clog.c & async.c

Oversight of 4ed8f0913bfd

Aleksander Alekseev, Noah Misch, Michael Paquier
---
 src/backend/access/transam/clog.c |  4 ++--
 src/backend/commands/async.c  | 22 +++---
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 44c253246b..e6f79320e9 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -445,7 +445,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	PGPROC	   *proc = MyProc;
 	uint32		nextidx;
 	uint32		wakeidx;
-	int			prevpageno;
+	int64		prevpageno;
 	LWLock	   *prevlock = NULL;
 
 	/* We should definitely have an XID whose status needs to be updated. */
@@ -577,7 +577,7 @@ TransactionGroupUpdateXidStatus(TransactionId xid, XidStatus status,
 	while (nextidx != INVALID_PROC_NUMBER)
 	{
 		PGPROC	   *nextproc = >allProcs[nextidx];
-		int			thispageno = nextproc->clogGroupMemberPage;
+		int64		thispageno = nextproc->clogGroupMemberPage;
 
 		/*
 		 * If the page to update belongs to a different bank than the previous

Re: Restart pg_usleep when interrupted

2024-07-12 Thread Alvaro Herrera

On 2024-Jul-11, Nathan Bossart wrote:

> I'm imagining something like this:
> 
> struct timespec delay;
> TimestampTz end_time;
> 
> end_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), msec);
> 
> do
> {
> longsecs;
> int microsecs;
> 
> TimestampDifference(GetCurrentTimestamp(), end_time,
> , );
> 
> delay.tv_sec = secs;
> delay.tv_nsec = microsecs * 1000;
> 
> } while (nanosleep(, NULL) == -1 && errno == EINTR);

This looks nicer.  We could deal with clock drift easily (in case the
sysadmin winds the clock back) by testing that tv_sec+tv_nsec is not
higher than the initial time to sleep.  I don't know how common this
situation is nowadays, but I remember debugging a system years ago where
autovacuum was sleeping for a very long time because of that.  I can't
remember now if we did anything in the code to cope, or just told
sysadmins not to do that anymore :-)

FWIW my (Linux's) nanosleep() manpage contains this note:

   If  the interval specified in req is not an exact multiple of the granu‐
   larity underlying clock (see time(7)), then the interval will be rounded
   up  to the next multiple.  Furthermore, after the sleep completes, there
   may still be a delay before the CPU becomes free to once  again  execute
   the calling thread.

It's not clear to me what happens if the time to sleep is zero, so maybe
there should be a "if tv_sec == 0 && tv_nsec == 0 then break" statement
at the bottom of the loop, to quit without sleeping one more time than
needed.

For Windows, this [1] looks like an interesting and possibly relevant
read (though maybe SleepEx already does what we want to do here.)

[1] 
https://randomascii.wordpress.com/2020/10/04/windows-timer-resolution-the-great-rule-change/

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Having your biases confirmed independently is how scientific progress is
made, and hence made our great society what it is today" (Mary Gardiner)

Re: Things I don't like about \du's "Attributes" column

2024-07-12 Thread Rafia Sabih

On Fri, 12 Jul 2024 at 09:06, Pavel Luzanov  wrote:
>
> On 11.07.2024 15:07, Rafia Sabih wrote:
>
> This looks much better than the current version.
>
> Thank you, for looking into this.
>
> Only thing is, I find
> the column name Valid until confusing. With that name I am in danger
> of taking it as the role's validity and not the passwords'.
> How about naming it to something like Password validity...?
>
> Yes, my first attempt was to name this column "Password expiration date"
> for the same reason.
>
> But then I decided that the column name should match the attribute name.
> Otherwise, you need to make some effort to understand which columns
> of the table correspond to which attributes of the roles.
>
> It is also worth considering translation into other languages.
> If the role attribute and the column have the same name, then they will
> probably be translated the same way. But the translation may be different
> for different terms, which will further confuse the situation.
>
> We can probably change the column name, but still the root of the confusion
> is caused by the attribute name, not the column name.
>
> What do you think?
Yes you are right in this. I too carry the opinion that column names
should be the same as attribute names as much as possible.
So, then it is good that way.

Other thoughts came to my mind, should we have this column in \du+
instead, maybe connection limit too.
I know in the current version we have all this in \du itself, but then
all clubbed in one column. But now since
our table has got wider, it might be aesthetic to have it in the
extended version. Also, their usage wise might not
be the first thing to be looked at for a user/role.

What are your thoughts on that?
>
> --
> Pavel Luzanov
> Postgres Professional: https://postgrespro.com



-- 
Regards,
Rafia Sabih

Re: pg_rewind WAL segments deletion pitfall

2024-07-12 Thread Alexander Kukushkin

Hi Sutou,

Thank you for picking it up!

On Fri, 12 Jul 2024 at 09:24, Sutou Kouhei  wrote:

Here are my review comments:
>
> @@ -217,6 +221,26 @@ findLastCheckpoint(const char *datadir, XLogRecPtr
> forkptr, int tliIndex,
> +   charxlogfname[MAXFNAMELEN];
> +
> +   tli = xlogreader->seg.ws_tli;
> +   segno = xlogreader->seg.ws_segno;
> +
> +   snprintf(xlogfname, MAXPGPATH, XLOGDIR "/");
> +   XLogFileName(xlogfname + strlen(xlogfname),
> +xlogreader->seg.ws_tli,
> +xlogreader->seg.ws_segno,
> WalSegSz);
> +
> +   /*
> +* Make sure pg_rewind doesn't remove this file,
> because it is
> +* required for postgres to start after rewind.
> +*/
> +   insert_keepwalhash_entry(xlogfname);
>
> MAXFNAMELEN is 64 and MAXPGPATH is 1024. strlen(XLOGDIR "/")
> is 7 because XLOGDIR is "pg_wal". So xlogfname has enough
> size but snprintf(xlogfname, MAXPGPATH) is wrong usage.
> (And XLogFileName() uses snprintf(xlogfname, MAXFNAMELEN)
> internally.)
>

Nice catch!

I don't think we need another buffer here, just need to use MAXFNAMELEN,
because strlen("pg_wal/$wal_filename") + 1 = 32 perfectly fits into 64
bytes.

The new version is attached.

Regards,
--
Alexander Kukushkin
From 19e9a3472782bdc92d0811e27cc3dd0fdcbf965d Mon Sep 17 00:00:00 2001
From: Alexander Kukushkin 
Date: Fri, 12 Jul 2024 10:50:22 +0200
Subject: [PATCH v10] Be more picky with WAL segment deletion in pg_rewind

Make pg_rewind to be a bit wiser in terms of creating filemap:
preserve on the target all WAL segments that contain records between the
last common checkpoint and the point of divergence.

Co-authored-by: Polina Bungina 
---
 src/bin/pg_rewind/filemap.c   | 62 +-
 src/bin/pg_rewind/filemap.h   |  3 +
 src/bin/pg_rewind/meson.build |  1 +
 src/bin/pg_rewind/parsexlog.c | 24 +++
 src/bin/pg_rewind/pg_rewind.c |  3 +
 src/bin/pg_rewind/t/010_keep_recycled_wals.pl | 65 +++
 6 files changed, 157 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/pg_rewind/t/010_keep_recycled_wals.pl

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 4458324c9d..b357c28338 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -63,6 +63,28 @@ static file_entry_t *lookup_filehash_entry(const char *path);
 static int	final_filemap_cmp(const void *a, const void *b);
 static bool check_file_excluded(const char *path, bool is_source);
 
+typedef struct skipwal_t
+{
+	const char *path;
+	uint32		status;
+}			skipwal_t;
+
+#define SH_PREFIX		keepwalhash
+#define SH_ELEMENT_TYPE	skipwal_t
+#define SH_KEY_TYPE		const char *
+#define	SH_KEY			path
+#define SH_HASH_KEY(tb, key)	hash_string(key)
+#define SH_EQUAL(tb, a, b)		(strcmp(a, b) == 0)
+#define	SH_SCOPE		static inline
+#define SH_RAW_ALLOCATOR	pg_malloc0
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static keepwalhash_hash * keepwalhash = NULL;
+
+static bool keepwalhash_entry_exists(const char *path);
+
 /*
  * Definition of one element part of an exclusion list, used to exclude
  * contents when rewinding.  "name" is the name of the file or path to
@@ -206,6 +228,35 @@ lookup_filehash_entry(const char *path)
 	return filehash_lookup(filehash, path);
 }
 
+/* Initialize a hash table to store WAL file names that must be kept */
+void
+keepwalhash_init(void)
+{
+	keepwalhash = keepwalhash_create(FILEHASH_INITIAL_SIZE, NULL);
+}
+
+/* Prevent a given file deletion during rewind */
+void
+insert_keepwalhash_entry(const char *path)
+{
+	skipwal_t  *entry;
+	bool		found;
+
+	/* Should only be called with keepwalhash initialized */
+	Assert(keepwalhash);
+
+	entry = keepwalhash_insert(keepwalhash, path, );
+
+	if (!found)
+		entry->path = pg_strdup(path);
+}
+
+static bool
+keepwalhash_entry_exists(const char *path)
+{
+	return keepwalhash_lookup(keepwalhash, path) != NULL;
+}
+
 /*
  * Callback for processing source file list.
  *
@@ -685,7 +736,16 @@ decide_file_action(file_entry_t *entry)
 	}
 	else if (entry->target_exists && !entry->source_exists)
 	{
-		/* File exists in target, but not source. Remove it. */
+		/* File exists in target, but not source. */
+
+		if (keepwalhash_entry_exists(path))
+		{
+			/* This is a WAL file that should be kept. */
+			pg_log_debug("Not removing %s because it is required for recovery", path);
+			return FILE_ACTION_NONE;
+		}
+
+		/* Otherwise remove an unexpected file. */
 		return FILE_ACTION_REMOVE;
 	}
 	else if (!entry->target_exists && !entry->source_exists)
diff --git a/src/bin/pg_rewind/filemap.h b/src/bin/pg_rewind/filemap.h
index 007e0f17cf..0cb6fcae00 100644
---

Re: Restart pg_usleep when interrupted

2024-07-12 Thread Bertrand Drouvot

Hi,

On Thu, Jul 11, 2024 at 10:15:41AM -0500, Sami Imseih wrote:
> 
> > I did a few tests with the patch and did not see any "large" drifts like the
> > ones observed above.
> 
> Thanks for testing.
> 
> > I think it could be "simplified" by making use of instr_time instead of 
> > timespec
> > for current and absolute. Then I think it would be enough to compare their
> > ticks.
> 
> Correct I attached a v2 of this patch that uses instr_time to check the 
> elapsed
> time and break out of the loop. It needs some more benchmarking.

Thanks!

Outside of Nathan's comment:

1 ===

+* However, since nanosleep is susceptible to time drift when 
interrupted
+* frequently, we add a safeguard to break out of the nanosleep 
whenever the

I'm not sure that "nanosleep is susceptible to time drift when interrupted 
frequently"
is a correct wording.

What about?

"
However, since the time between frequent interruptions and restarts of the
nanosleep calls can substantially lead to drift in the time when the sleep
finally completes, we add"

2 ===

+static void vacuum_sleep(double msec);

What about a more generic name that could be used outside of vacuum?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: pg_rewind WAL segments deletion pitfall

2024-07-12 Thread Sutou Kouhei

Hi,

I'm reviewing patches in Commitfest 2024-07 from top to bottom:
https://commitfest.postgresql.org/48/

This is the 1st patch:
https://commitfest.postgresql.org/48/3874/

The latest patch can't be applied on master:
https://www.postgresql.org/message-id/CAFh8B=nnjtm9ke4_1mhpwgz2pv9yoyf6hmnyh5xact0aa4v...@mail.gmail.com

I've rebased on master. See the attached patch.
Here are changes for it:

* Resolve conflict
* Update copyright year to 2024 from 2023
* Add an added test to meson.build
* Run pgindent

Here are my review comments:

@@ -217,6 +221,26 @@ findLastCheckpoint(const char *datadir, XLogRecPtr 
forkptr, int tliIndex,
+   charxlogfname[MAXFNAMELEN];
+
+   tli = xlogreader->seg.ws_tli;
+   segno = xlogreader->seg.ws_segno;
+
+   snprintf(xlogfname, MAXPGPATH, XLOGDIR "/");
+   XLogFileName(xlogfname + strlen(xlogfname),
+xlogreader->seg.ws_tli,
+xlogreader->seg.ws_segno, 
WalSegSz);
+
+   /*
+* Make sure pg_rewind doesn't remove this file, 
because it is
+* required for postgres to start after rewind.
+*/
+   insert_keepwalhash_entry(xlogfname);

MAXFNAMELEN is 64 and MAXPGPATH is 1024. strlen(XLOGDIR "/")
is 7 because XLOGDIR is "pg_wal". So xlogfname has enough
size but snprintf(xlogfname, MAXPGPATH) is wrong usage.
(And XLogFileName() uses snprintf(xlogfname, MAXFNAMELEN)
internally.)

How about using one more buffer?


charxlogpath[MAXPGPATH];
charxlogfname[MAXFNAMELEN];

tli = xlogreader->seg.ws_tli;
segno = xlogreader->seg.ws_segno;

XLogFileName(xlogfname,
 xlogreader->seg.ws_tli,
 xlogreader->seg.ws_segno, WalSegSz);
snprintf(xlogpath, MAXPGPATH, "%s/%s", XLOGDIR, xlogfname);

/*
 * Make sure pg_rewind doesn't remove this file, because it is
 * required for postgres to start after rewind.
 */
insert_keepwalhash_entry(xlogpath);



Thanks,
-- 
kou

In 
  "Re: pg_rewind WAL segments deletion pitfall" on Tue, 23 Jan 2024 09:23:29 
+0100,
  Alexander Kukushkin  wrote:

> Hi Peter,
> 
> On Mon, 22 Jan 2024 at 00:38, Peter Smith  wrote:
> 
>> 2024-01 Commitfest.
>>
>> Hi, This patch has a CF status of "Ready for Committer", but it is
>> currently failing some CFbot tests [1]. Please have a look and post an
>> updated version..
>>
>> ==
>> [1]
>> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/3874
>>
>>
> From what I can see all failures are not related to this patch:
> 1. Windows build failed with
> [10:52:49.679] 126/281 postgresql:recovery / recovery/019_replslot_limit
> ERROR 185.84s (exit status 255 or signal 127 SIGinvalid)
> 2. FreeBSD build failed with
> [09:11:57.656] 190/285 postgresql:psql / psql/010_tab_completion ERROR
> 0.46s exit status 2
> [09:11:57.656] 220/285 postgresql:authentication /
> authentication/001_password ERROR 0.57s exit status 2
> 
> In fact, I don't even see this patch being applied for these builds and the
> introduced TAP test being executed.
> 
> Regards,
> --
> Alexander Kukushkin
>From 38edc07c49970ad0ac1818284f2e59e26591b3a4 Mon Sep 17 00:00:00 2001
From: Alexander Kukushkin 
Date: Sun, 6 Aug 2023 15:56:39 +0100
Subject: [PATCH v9] Be more picky with WAL segment deletion in pg_rewind

Make pg_rewind to be a bit wiser in terms of creating filemap:
preserve on the target all WAL segments that contain records between the
last common checkpoint and the point of divergence.

Co-authored-by: Polina Bungina 
---
 src/bin/pg_rewind/filemap.c   | 62 +-
 src/bin/pg_rewind/filemap.h   |  3 +
 src/bin/pg_rewind/meson.build |  1 +
 src/bin/pg_rewind/parsexlog.c | 24 +++
 src/bin/pg_rewind/pg_rewind.c |  3 +
 src/bin/pg_rewind/t/010_keep_recycled_wals.pl | 65 +++
 6 files changed, 157 insertions(+), 1 deletion(-)
 create mode 100644 src/bin/pg_rewind/t/010_keep_recycled_wals.pl

diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 4458324c9d8..b357c28338a 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -63,6 +63,28 @@ static file_entry_t *lookup_filehash_entry(const char *path);
 static int	final_filemap_cmp(const void *a, const void *b);
 static bool check_file_excluded(const char *path, bool is_source);
 
+typedef struct skipwal_t
+{
+	const char *path;
+	uint32		status;
+}			skipwal_t;
+
+#define SH_PREFIX		keepwalhash
+#define SH_ELEMENT_TYPE	skipwal_t
+#define SH_KEY_TYPE		const char *
+#define	SH_KEY			path
+#define SH_HASH_KEY(tb, key)	hash_string(key)
+#define SH_EQUAL(tb, a, b)		(strcmp(a, b) == 0)
+#define	SH_SCOPE		static inline
+#define SH_RAW_ALLOCATOR	pg_malloc0

Re: Check lateral references within PHVs for memoize cache keys

2024-07-12 Thread Richard Guo

On Fri, Jul 12, 2024 at 11:18 AM Andrei Lepikhov  wrote:
> I'm not sure about stability of output format of AVG aggregate across
> different platforms. Maybe better to return the result of comparison
> between the AVG() and expected value?

I don't think this is a problem.  AFAIK we use AVG() a lot in the
existing test cases.

Thanks
Richard

Re: Things I don't like about \du's "Attributes" column

2024-07-12 Thread Pavel Luzanov


On 11.07.2024 15:07, Rafia Sabih wrote:

This looks much better than the current version.


Thank you, for looking into this.


Only thing is, I find
the column name Valid until confusing. With that name I am in danger
of taking it as the role's validity and not the passwords'.
How about naming it to something like Password validity...?


Yes, my first attempt was to name this column "Passwordexpirationdate" for the same reason. But then I decided that the 
column name should match the attribute name. Otherwise, you need to make 
some effort to understand which columns of the table correspond to which 
attributes of the roles. It is also worth considering translation into 
other languages. If the role attribute and the column have the same 
name, then they will probably be translated the same way. But the 
translation may be different for different terms, which will further 
confuse the situation. We can probably change the column name, but still 
the root of the confusion is caused by the attribute name, not the 
column name. What do you think?


--
Pavel Luzanov
Postgres Professional:https://postgrespro.com

Re: Pgoutput not capturing the generated columns

2024-07-12 Thread Peter Smith

Hi Shubham.

Thanks for separating the new BMS 'columns' modification.

Here are my review comments for the latest patch v17-0001.

==

1. src/backend/replication/pgoutput/pgoutput.c

  /*
  * Columns included in the publication, or NULL if all columns are
  * included implicitly.  Note that the attnums in this bitmap are not
+ * publication and include_generated_columns option: other reasons should
+ * be checked at user side.  Note that the attnums in this bitmap are not
  * shifted by FirstLowInvalidHeapAttributeNumber.
  */
  Bitmapset  *columns;
With this latest 0001 there is now no change to the original
interpretation of RelationSyncEntry BMS 'columns'. So, I think this
field comment should remain unchanged; i.e. it should be the same as
the current HEAD comment.

==
src/test/subscription/t/011_generated.pl

nitpick - comment changes for 'tab2' and 'tab3' to make them more consistent.

==
99.
Please refer to the attached diff patch which implements any nitpicks
described above.

==
Kind Regards,
Peter Smith.
Fujitsu Australia
diff --git a/src/test/subscription/t/011_generated.pl 
b/src/test/subscription/t/011_generated.pl
index f449969..fe32987 100644
--- a/src/test/subscription/t/011_generated.pl
+++ b/src/test/subscription/t/011_generated.pl
@@ -28,14 +28,16 @@ $node_subscriber->safe_psql('postgres',
"CREATE TABLE tab1 (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a * 
22) STORED, c int)"
 );
 
-# publisher-side tab2 has generated col 'b' but subscriber-side tab2 has 
NON-generated col 'b'.
+# tab2:
+# publisher-side tab2 has generated col 'b'.
+# subscriber-side tab2 has non-generated col 'b'.
 $node_publisher->safe_psql('postgres',
"CREATE TABLE tab2 (a int, b int GENERATED ALWAYS AS (a * 2) STORED)");
 $node_subscriber->safe_psql('postgres', "CREATE TABLE tab2 (a int, b int)");
 
 # tab3:
-# publisher-side tab3 has generated col 'b' but
-# subscriber-side tab3 has DIFFERENT COMPUTATION generated col 'b'.
+# publisher-side tab3 has generated col 'b'.
+# subscriber-side tab3 has generated col 'b', using a different computation.
 $node_publisher->safe_psql('postgres',
"CREATE TABLE tab3 (a int, b int GENERATED ALWAYS AS (a + 10) STORED)");
 $node_subscriber->safe_psql('postgres',
@@ -97,8 +99,11 @@ is( $result, qq(1|22|
 6|132|), 'generated columns replicated');
 
 #
-# TEST tab2: the publisher-side col 'b' is generated, and the subscriber-side
-# col 'b' is not generated, so confirm that col 'b' IS replicated.
+# TEST tab2:
+# publisher-side tab2 has generated col 'b'.
+# subscriber-side tab2 has non-generated col 'b'.
+#
+# Confirm that col 'b' is replicated.
 #
 $node_publisher->safe_psql('postgres', "INSERT INTO tab2 VALUES (4), (5)");
 $node_publisher->wait_for_catchup('sub2');
@@ -110,10 +115,13 @@ is( $result, qq(4|8
 );
 
 #
-# TEST tab3: the publisher-side col 'b' is generated, and the subscriber-side
-# col 'b' is also generated, so confirmed that col 'b' IS NOT replicated. We
-# can know this because the result value is the subscriber-side computation
-# (which is not the same as the publisher-side computation for col 'b').
+# TEST tab3:
+# publisher-side tab3 has generated col 'b'.
+# subscriber-side tab3 has generated col 'b', using a different computation.
+#
+# Confirm that col 'b' is NOT replicated. We can know this because the result
+# value is the subscriber-side computation (which is different from the
+# publisher-side computation for this column).
 #
 $node_publisher->safe_psql('postgres', "INSERT INTO tab3 VALUES (4), (5)");
 $node_publisher->wait_for_catchup('sub3');

Re: Flush pgstats file during checkpoints

2024-07-12 Thread Michael Paquier

On Fri, Jul 05, 2024 at 01:52:31PM +0900, Michael Paquier wrote:
> On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:
>> I think those are two independent issues - knowing that the snapshot is
>> from the last checkpoint, and knowing that it's correct (not corrupted).
>> And yeah, we should be careful about fsync/durable_rename.
> 
> Yeah, that's bugging me as well.  I don't really get why we would not
> want durability at shutdown for this data.  So how about switching the
> end of pgstat_write_statsfile() to use durable_rename()?  Sounds like
> an independent change, worth on its own.

Please find attached a rebased patch set with the durability point
addressed in 0001.  There were also some conflicts.

Note that I have applied the previous 0002 adding an assert in
pgstat_write_statsfile() as 734c057a8935, as I've managed to break
again this assumption while hacking more on this area..
--
Michael
From b03087f1746233bce0d1fbbb69789795296e779b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 12 Jul 2024 15:31:27 +0900
Subject: [PATCH v2 1/4] Make write of pgstats file durable

This switches the pgstats code to use durable_rename() rather than
rename(), to make sure that the stats file is not lost should the host
by plugged off after PostgreSQL has been stopped.
---
 src/backend/utils/activity/pgstat.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index ed7baa6e04..66cda798fb 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1478,12 +1478,9 @@ pgstat_write_statsfile(void)
 		tmpfile)));
 		unlink(tmpfile);
 	}
-	else if (rename(tmpfile, statfile) < 0)
+	else if (durable_rename(tmpfile, statfile, LOG) < 0)
 	{
-		ereport(LOG,
-(errcode_for_file_access(),
- errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-		tmpfile, statfile)));
+		/* error logged already */
 		unlink(tmpfile);
 	}
 }
-- 
2.45.2

From 4340ebd842049b5c706ada324bba36a5c5a78bb9 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 12 Jul 2024 15:22:12 +0900
Subject: [PATCH v2 2/4] Add redo LSN to pgstats file

This is used in the startup process to check that the file we are
loading is the one referring to the latest checkpoint.

Bump PGSTAT_FILE_FORMAT_ID.
---
 src/include/pgstat.h|  5 +++--
 src/backend/access/transam/xlog.c   |  2 +-
 src/backend/utils/activity/pgstat.c | 26 +++---
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6b99bb8aad..043d39a565 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -235,7 +236,7 @@ typedef struct PgStat_TableXactStatus
  * 
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAD
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAE
 
 typedef struct PgStat_ArchiverStats
 {
@@ -466,7 +467,7 @@ extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
 /* Functions called during server startup / shutdown */
-extern void pgstat_restore_stats(void);
+extern void pgstat_restore_stats(XLogRecPtr redo);
 extern void pgstat_discard_stats(void);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 33e27a6e72..f137551e8d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5641,7 +5641,7 @@ StartupXLOG(void)
 	if (didCrash)
 		pgstat_discard_stats();
 	else
-		pgstat_restore_stats();
+		pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 66cda798fb..2952604a51 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -94,6 +94,7 @@
 #include 
 
 #include "access/xact.h"
+#include "access/xlog.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -171,8 +172,8 @@ typedef struct PgStat_SnapshotEntry
  * --
  */
 
-static void pgstat_write_statsfile(void);
-static void pgstat_read_statsfile(void);
+static void pgstat_write_statsfile(XLogRecPtr redo);
+static void pgstat_read_statsfile(XLogRecPtr redo);
 
 static void pgstat_reset_after_failure(void);
 
@@ -448,9 +449,9 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
  * Should only be called by the startup process or in single user mode.
  */
 void
-pgstat_restore_stats(void)
+pgstat_restore_stats(XLogRecPtr redo)
 {
-	pgstat_read_statsfile();
+	pgstat_read_statsfile(redo);
 }
 
 /*
@@ -526,7 +527,7 @@

Re: Optimize WindowAgg's use of tuplestores

2024-07-12 Thread Ashutosh Bapat

On Thu, Jul 11, 2024 at 5:39 PM David Rowley  wrote:
>
> On Wed, 10 Jul 2024 at 02:42, Ashutosh Bapat
>  wrote:
> > Observations
> > 1. The numbers corresponding to 10 and 100 partitions are higher when
> > patched. That might be just noise. I don't see any reason why it would
> > impact negatively when there are a small number of partitions. The
> > lower partition cases also have a higher number of rows per partition,
> > so is the difference between MemoryContextDelete() vs
> > MemoryContextReset() making any difference here. May be worth
> > verifying those cases carefully. Otherwise upto 1000 partitions, it
> > doesn't show any differences.
>
> I think this might just be noise as a result of rearranging code. In
> terms of C code, I don't see any reason for it to be slower.  If you
> look at GenerationDelete() (as what is getting called from
> MemoryContextDelete()), it just calls GenerationReset(). So resetting
> is going to always be less work than deleting the context, especially
> given we don't need to create the context again when we reset it.
>
> I wrote the attached script to see if I can also see the slowdown and
> I do see the patched code come out slightly slower (within noise
> levels) in lower partition counts.
>
> To get my compiler to produce code in a more optimal order for the
> common case, I added unlikely() to the "if (winstate->all_first)"
> condition.  This is only evaluated on the first time after a rescan,
> so putting that code at the end of the function makes more sense.  The
> attached v2 patch has it this way.  You can see the numbers look
> slightly better in the attached graph.

The change to all_first seems unrelated to the tuplestore
optimization. But it's bringing the results inline with the master for
lower number of partitions.

Thanks for the script. I have similar results on my laptop.
>From master
Testing with 100 partitions
latency average = 505.738 ms
latency average = 509.407 ms
latency average = 522.461 ms
Testing with 10 partitions
latency average = 329.026 ms
latency average = 327.504 ms
latency average = 342.556 ms
Testing with 1 partitions
latency average = 299.496 ms
latency average = 298.266 ms
latency average = 306.773 ms
Testing with 1000 partitions
latency average = 299.006 ms
latency average = 302.188 ms
latency average = 301.701 ms
Testing with 100 partitions
latency average = 305.411 ms
latency average = 286.935 ms
latency average = 302.432 ms
Testing with 10 partitions
latency average = 288.091 ms
latency average = 294.506 ms
latency average = 305.082 ms
Testing with 1 partitions
latency average = 301.121 ms
latency average = 319.615 ms
latency average = 301.141 ms

Patched
Testing with 100 partitions
latency average = 351.683 ms
latency average = 352.516 ms
latency average = 352.086 ms
Testing with 10 partitions
latency average = 300.626 ms
latency average = 303.584 ms
latency average = 306.959 ms
Testing with 1 partitions
latency average = 289.560 ms
latency average = 302.248 ms
latency average = 297.423 ms
Testing with 1000 partitions
latency average = 308.600 ms
latency average = 299.215 ms
latency average = 289.681 ms
Testing with 100 partitions
latency average = 301.216 ms
latency average = 286.240 ms
latency average = 291.232 ms
Testing with 10 partitions
latency average = 305.260 ms
latency average = 296.707 ms
latency average = 300.266 ms
Testing with 1 partitions
latency average = 316.199 ms
latency average = 314.043 ms
latency average = 309.425 ms

Now that you are also seeing the slowdown with your earlier patch, I
am wondering whether adding unlikely() by itself is a good
optimization. There might be some other reason behind the perceived
slowdown. How do the numbers look when you just add unlikely() without
any other changes?

-- 
Best Wishes,
Ashutosh Bapat

Re: Optimize WindowAgg's use of tuplestores

2024-07-11 Thread Tatsuo Ishii

> @@ -2684,6 +2726,14 @@ ExecEndWindowAgg(WindowAggState *node)
>   PlanState  *outerPlan;
>   int i;
>  
> + if (node->buffer != NULL)
> + {
> + tuplestore_end(node->buffer);
> +
> + /* nullify so that release_partition skips the 
> tuplestore_clear() */
> + node->buffer = NULL;
> + }
> +
> 
> Is it possible that node->buffer == NULL in ExecEndWindowAgg()?  If
> not, probably making it an Assert() or just removing the if() should
> be fine.

Of course it it possible, for example there's no row in a partition.
Sorry for noise.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: speed up a logical replica setup

2024-07-11 Thread Amit Kapila

On Thu, Jul 11, 2024 at 7:04 PM Euler Taveira  wrote:
>
> On Thu, Jul 11, 2024, at 8:22 AM, Zhijie Hou (Fujitsu) wrote:
>
> The analysis and suggestion look reasonable to me.
> Here is a small patch which does the same.
>
>
> LGTM. However, I would add a comment above the INSERT you moved around to 
> remind
> you that a transaction cannot be added there because it will break this test.
>

Pushed after adding a comment.

-- 
With Regards,
Amit Kapila.

Re: Allow non-superuser to cancel superuser tasks.

2024-07-11 Thread Michael Paquier

On Thu, Jul 11, 2024 at 08:50:57PM -0500, Nathan Bossart wrote:
> I'm not following how this is guaranteed to trigger an autovacuum quickly.
> Shouldn't we set autovacuum_vacuum_insert_threshold to 1 so that it is
> eligible for autovacuum?

You are right, this is not going to influence a faster startup of a
worker, so we could remove that entirely.  On closer look, the main
bottlebeck is that the test is spending a lot of time waiting on the
naptime even if we are using the minimum value of 1s, as the scan of
pg_stat_activity checking for workers keeps looping.

[ ..thinks.. ]

I have one trick in my sleeve for this one to make the launcher more
responsive in checking the timestamps of the database list.  That's
not perfect, but it reduces the wait behind the worker startups by
400ms (1s previously as of the naptime, 600ms now) with a reload to
set the launcher's latch after the injection point has been attached.
The difference in runtime is noticeable.

>> +# Wait for the autovacuum worker to exit before scanning the logs.
>> +$node->poll_query_until('postgres',
>> +"SELECT count(*) = 0 FROM pg_stat_activity "
>> +. "WHERE pid = $av_pid AND backend_type = 'autovacuum worker';");
> 
> WFM.  Even if the PID is quickly reused, this should work.  We just might
> end up waiting a little longer.
> 
> Is it worth testing cancellation, too?

The point is to check after pg_signal_backend, so I am not sure it is
worth the extra cycles for the cancellation.

Attaching the idea, with a fix for the comment you have mentioned and
appending "regress_" the role names for the warnings generated by
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS, while on it.

What do you think?
--
Michael
From 8a827a8a59da4d3c2f2115151e1241cf663c11a1 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 12 Jul 2024 14:19:51 +0900
Subject: [PATCH v3] Add tap test for pg_signal_autovacuum role

---
 src/backend/postmaster/autovacuum.c   |  7 ++
 .../test_misc/t/006_signal_autovacuum.pl  | 94 +++
 2 files changed, 101 insertions(+)
 create mode 100644 src/test/modules/test_misc/t/006_signal_autovacuum.pl

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 928754b51c..4e4a0ccbef 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -100,6 +100,7 @@
 #include "utils/fmgroids.h"
 #include "utils/fmgrprotos.h"
 #include "utils/guc_hooks.h"
+#include "utils/injection_point.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
@@ -1902,6 +1903,12 @@ do_autovacuum(void)
 	/* Start a transaction so our commands have one to play into. */
 	StartTransactionCommand();
 
+	/*
+	 * This injection point is put in a transaction block to work with a wait
+	 * that uses a condition variable.
+	 */
+	INJECTION_POINT("autovacuum-worker-start");
+
 	/*
 	 * Compute the multixact age for which freezing is urgent.  This is
 	 * normally autovacuum_multixact_freeze_max_age, but may be less if we are
diff --git a/src/test/modules/test_misc/t/006_signal_autovacuum.pl b/src/test/modules/test_misc/t/006_signal_autovacuum.pl
new file mode 100644
index 00..ded42b1be9
--- /dev/null
+++ b/src/test/modules/test_misc/t/006_signal_autovacuum.pl
@@ -0,0 +1,94 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test signaling autovacuum worker with pg_signal_autovacuum_worker.
+#
+# Only roles with privileges of pg_signal_autovacuum_worker are allowed to
+# signal autovacuum workers.  This test uses an injection point located
+# at the beginning of the autovacuum worker startup.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+# Initialize postgres
+my $psql_err = '';
+my $psql_out = '';
+my $node = PostgreSQL::Test::Cluster->new('node');
+$node->init;
+
+# This ensures a quick worker spawn.
+$node->append_conf(
+	'postgresql.conf', 'autovacuum_naptime = 1');
+$node->start;
+$node->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+$node->safe_psql(
+	'postgres', qq(
+CREATE ROLE regress_regular_role;
+CREATE ROLE regress_worker_role;
+GRANT pg_signal_autovacuum_worker TO regress_worker_role;
+));
+
+# From this point, autovacuum worker will wait at startup.
+$node->safe_psql('postgres',
+	"SELECT injection_points_attach('autovacuum-worker-start', 'wait');");
+
+$node->reload();
+
+# Wait until an autovacuum worker starts.
+$node->wait_for_event('autovacuum worker', 'autovacuum-worker-start');
+
+my $av_pid = $node->safe_psql(
+	'postgres', qq(
+SELECT pid FROM pg_stat_activity WHERE backend_type = 'autovacuum worker';
+));
+
+# Regular role cannot terminate autovacuum worker.
+my $terminate_with_no_pg_signal_av = $node->psql(
+	'postgres', qq(
+SET ROLE regress_regular_role;
+SELECT

RE: Adding skip scan (including MDAM style range skip scan) to nbtree

2024-07-11 Thread Masahiro.Ikeda

Hi,

Since I'd like to understand the skip scan to improve the EXPLAIN output
for multicolumn B-Tree Index[1], I began to try the skip scan with some
queries and look into the source code.

I have some feedback and comments.

(1)

At first, I was surprised to look at your benchmark result because the skip scan
index can improve much performance. I agree that there are many users to be
happy with the feature for especially OLAP use-case. I expected to use v18.


(2)

I found the cost is estimated to much higher if the number of skipped attributes
is more than two. Is it expected behavior?

# Test result. The attached file is the detail of tests.

-- Index Scan
-- The actual time is low since the skip scan works well
-- But the cost is higher than one of seqscan
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
  QUERY PLAN
   
---
 Index Scan using idx_id1_id2_id3 on public.test  (cost=0.42..26562.77 rows=984 
width=20) (actual time=0.051..15.533 rows=991 loops=1)
   Output: id1, id2, id3, value
   Index Cond: (test.id3 = 101)
   Buffers: shared hit=4402
 Planning:
   Buffers: shared hit=7
 Planning Time: 0.234 ms
 Execution Time: 15.711 ms
(8 rows)

-- Seq Scan
-- actual time is high, but the cost is lower than one of the above Index Scan.
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT * FROM test WHERE id3 = 101;
  QUERY PLAN
   
---
 Gather  (cost=1000.00..12676.73 rows=984 width=20) (actual time=0.856..113.861 
rows=991 loops=1)
   Output: id1, id2, id3, value
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=6370
   ->  Parallel Seq Scan on public.test  (cost=0.00..11578.33 rows=410 
width=20) (actual time=0.061..102.016 rows=330 loops=3)
 Output: id1, id2, id3, value
 Filter: (test.id3 = 101)
 Rows Removed by Filter: 333003
 Buffers: shared hit=6370
 Worker 0:  actual time=0.099..98.014 rows=315 loops=1
   Buffers: shared hit=2066
 Worker 1:  actual time=0.054..97.162 rows=299 loops=1
   Buffers: shared hit=1858
 Planning:
   Buffers: shared hit=19
 Planning Time: 0.194 ms
 Execution Time: 114.129 ms
(18 rows)


I look at btcostestimate() to find the reason and found the bound quals
and cost.num_sa_scans are different from my expectation.

My assumption is
* bound quals is id3=XXX (and id1 and id2 are skipped attributes)
* cost.num_sa_scans = 100 (=10*10 because assuming 10 primitive index scans
  per skipped attribute)

But it's wrong. The above index scan result is
* bound quals is NULL
* cost.num_sa_scans = 1


As I know you said the below, but I'd like to know the above is expected or not.

> That approach seems far more practicable than preempting the problem
> during planning or during nbtree preprocessing. It seems like it'd be
> very hard to model the costs statistically. We need revisions to
> btcostestimate, of course, but the less we can rely on btcostestimate
> the better. As I said, there are no new index paths generated by the
> optimizer for any of this.

I couldn't understand why there is the below logic well.

  btcostestimate()
  (...omit...)
if (indexcol != iclause->indexcol)
{
/* no quals at all for indexcol */
found_skip = true;
if (index->pages < 100)
break;
num_sa_scans += 10 * (indexcol - 
iclause->indexcol); // why add minus value?
continue;   
   // why skip to add bound quals?
}


(3)

Currently, there is an assumption that "there will be 10 primitive index scans
per skipped attribute". Is any chance to use pg_stats.n_distinct?

[1] Improve EXPLAIN output for multicolumn B-Tree Index
https://www.postgresql.org/message-id/flat/TYWPR01MB1098260B694D27758FE2BA46FB1C92%40TYWPR01MB10982.jpnprd01.prod.outlook.com

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION



compare_cost.sql
Description: compare_cost.sql


compare_cost_with_v2_patch.out
Description: compare_cost_with_v2_patch.out

Re: Test to dump and restore objects left behind by regression

2024-07-11 Thread Ashutosh Bapat

On Tue, Jul 9, 2024 at 1:07 PM Michael Paquier  wrote:
>
> On Mon, Jul 08, 2024 at 03:59:30PM +0530, Ashutosh Bapat wrote:
> > Before submitting the patch, I looked for all the places which mention
> > AdjustUpgrade or AdjustUpgrade.pm to find places where the new module needs
> > to be mentioned. But I didn't find any. AdjustUpgrade is not mentioned
> > in src/test/perl/Makefile or src/test/perl/meson.build. Do we want to also
> > add AdjustUpgrade.pm in those files?
>
> Good question.  This has not been mentioned on the thread that added
> the module:
> https://www.postgresql.org/message-id/891521.1673657296%40sss.pgh.pa.us
>
> And I could see it as being useful if installed.  The same applies to
> Kerberos.pm, actually.  I'll ping that on a new thread.

For now, it may be better to maintain status-quo. If we see a need to
use these modules in future by say extensions or tests outside core
tree, we will add them to meson and make files.

>
> > We could forget 0002. I am fine with that.  But I can change the code such
> > that formats other than "plain" are tested when PG_TEST_EXTRAS contains
> > "regress_dump_formats". Would that be acceptable?
>
> Interesting idea.  That may be acceptable, under the same arguments as
> the xid_wraparound one.

Done. Added a new entry in PG_TEST_EXTRA documentation.

I have merged the two patches now.


-- 
Best Wishes,
Ashutosh Bapat
From aff0939b8aa8566daecf1ac35b9c7fce9fa851ca Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Thu, 27 Jun 2024 10:03:53 +0530
Subject: [PATCH 2/2] Test pg_dump/restore of regression objects

002_pg_upgrade.pl tests pg_upgrade on the regression database left
behind by regression run. Modify it to test dump and restore of the
regression database as well.

Regression database created by regression run contains almost all the
database objects supported by PostgreSQL in various states. The test
thus covers wider dump and restore scenarios.

When PG_TEST_EXTRA has 'regress_dump_formats' in it, test dump and
restore in all supported formats. Otherwise test only "plain" format so
that the test finishes quickly.

Author: Ashutosh Bapat
Reviewed by: Michael Pacquire
Discussion: https://www.postgresql.org/message-id/CAExHW5uF5V=Cjecx3_Z=7xfh4rg2wf61pt+hfquzjbqourz...@mail.gmail.com
---
 doc/src/sgml/regress.sgml   |  13 ++
 src/bin/pg_upgrade/t/002_pg_upgrade.pl  | 173 ++--
 src/test/perl/PostgreSQL/Test/AdjustDump.pm | 109 
 3 files changed, 277 insertions(+), 18 deletions(-)
 create mode 100644 src/test/perl/PostgreSQL/Test/AdjustDump.pm

diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml
index d1042e0222..8c1a9ddc40 100644
--- a/doc/src/sgml/regress.sgml
+++ b/doc/src/sgml/regress.sgml
@@ -336,6 +336,19 @@ make check-world PG_TEST_EXTRA='kerberos ldap ssl load_balance libpq_encryption'
   
  
 
+
+
+ regress_dump_formats
+ 
+  
+   When enabled,
+   src/bin/pg_upgrade/t/002_pg_upgrade.pl tests dump
+   and restore of regression database using all dump formats. Otherwise
+   tests only plain format. Not enabled by default
+   because it is resource intensive.
+  
+ 
+

 
Tests for features that are not supported by the current build
diff --git a/src/bin/pg_upgrade/t/002_pg_upgrade.pl b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
index 17af2ce61e..613512ffe7 100644
--- a/src/bin/pg_upgrade/t/002_pg_upgrade.pl
+++ b/src/bin/pg_upgrade/t/002_pg_upgrade.pl
@@ -13,6 +13,7 @@ use File::Path qw(rmtree);
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use PostgreSQL::Test::AdjustUpgrade;
+use PostgreSQL::Test::AdjustDump;
 use Test::More;
 
 # Can be changed to test the other modes.
@@ -36,9 +37,9 @@ sub generate_db
 		"created database with ASCII characters from $from_char to $to_char");
 }
 
-# Filter the contents of a dump before its use in a content comparison.
-# This returns the path to the filtered dump.
-sub filter_dump
+# Filter the contents of a dump before its use in a content comparison for
+# upgrade testing. This returns the path to the filtered dump.
+sub filter_dump_for_upgrade
 {
 	my ($is_old, $old_version, $dump_file) = @_;
 	my $dump_contents = slurp_file($dump_file);
@@ -61,6 +62,44 @@ sub filter_dump
 	return $dump_file_filtered;
 }
 
+# Test that the given two files match.  The files usually contain pg_dump
+# output in "plain" format. Output the difference if any.
+sub compare_dumps
+{
+	my ($dump1, $dump2, $testname) = @_;
+
+	my $compare_res = compare($dump1, $dump2);
+	is($compare_res, 0, $testname);
+
+	# Provide more context if the dumps do not match.
+	if ($compare_res != 0)
+	{
+		my ($stdout, $stderr) =
+		  run_command([ 'diff', '-u', $dump1, $dump2 ]);
+		print "=== diff of $dump1 and $dump2\n";
+		print "=== stdout ===\n";
+		print $stdout;
+		print "=== stderr ===\n";
+		print $stderr;
+		print "=== EOF ===\n";
+	}
+}
+
+# Adjust the contents of a dump before its use in a

Re: Clang function pointer type warnings in v14, v15

2024-07-11 Thread Tom Lane

Thomas Munro  writes:
> On Fri, Jul 12, 2024 at 2:26 PM Tom Lane  wrote:
>> Thomas Munro  writes:
>>> I don't see any on clang 16 in the 12 and 13 branches.  Where are you
>>> seeing them?

>> In the buildfarm.  "adder" and a bunch of other machines are throwing
>> -Wcast-function-type in about two dozen places in v13, and "jay" is
>> emitting several hundred -Wdeprecated-non-prototype warnings.

> Ah, I see.  A few animals running with -Wextra.

Oh, got it.

> I gess we have to decide if it's a matter for the tree, or for the
> people who add -Wextra, ie to decide if they want to filter that down
> a bit with some -Wno-XXX.

I think it's reasonable to say that if you add -Wextra then it's
up to you to deal with the results.  If we were contemplating
enabling -Wextra as standard, then it'd be our problem --- but
nobody has proposed that AFAIR.  In any case we'd surely not
add it now to near-dead branches.

regards, tom lane

Re: Clang function pointer type warnings in v14, v15

2024-07-11 Thread Thomas Munro

On Fri, Jul 12, 2024 at 2:26 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Sat, Jul 6, 2024 at 2:35 PM Tom Lane  wrote:
> >> I see that there are a boatload of related warnings in older
> >> branches too; do we want to try to do anything about that?  (I doubt
> >> code changes would be in-scope, but maybe adding a -Wno-foo switch
> >> to the build flags would be appropriate.)
>
> > I don't see any on clang 16 in the 12 and 13 branches.  Where are you
> > seeing them?
>
> In the buildfarm.  "adder" and a bunch of other machines are throwing
> -Wcast-function-type in about two dozen places in v13, and "jay" is
> emitting several hundred -Wdeprecated-non-prototype warnings.

Ah, I see.  A few animals running with -Wextra.  Whereas in v14+ we
have -Wcast-function-type in the actual tree, which affects people's
workflows more directly.  Like my regular machine, or CI, when a
couple of the OSes' house compilers eventually reach clang 16.

I gess we have to decide if it's a matter for the tree, or for the
people who add -Wextra, ie to decide if they want to filter that down
a bit with some -Wno-XXX.  Adder already has some of those:

 'CFLAGS' => '-O1 -ggdb -g3
-fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter
-Wno-sign-compare -Wno-missing-field-initializers -O0',

Re: Optimize WindowAgg's use of tuplestores

2024-07-11 Thread Tatsuo Ishii

Hi David,

Thank you for the patch.

> I think this might just be noise as a result of rearranging code. In
> terms of C code, I don't see any reason for it to be slower.  If you
> look at GenerationDelete() (as what is getting called from
> MemoryContextDelete()), it just calls GenerationReset(). So resetting
> is going to always be less work than deleting the context, especially
> given we don't need to create the context again when we reset it.
> 
> I wrote the attached script to see if I can also see the slowdown and
> I do see the patched code come out slightly slower (within noise
> levels) in lower partition counts.
> 
> To get my compiler to produce code in a more optimal order for the
> common case, I added unlikely() to the "if (winstate->all_first)"
> condition.  This is only evaluated on the first time after a rescan,
> so putting that code at the end of the function makes more sense.  The
> attached v2 patch has it this way.  You can see the numbers look
> slightly better in the attached graph.
> 
> Thanks for having a look at this.
> 
> David

@@ -2684,6 +2726,14 @@ ExecEndWindowAgg(WindowAggState *node)
PlanState  *outerPlan;
int i;
 
+   if (node->buffer != NULL)
+   {
+   tuplestore_end(node->buffer);
+
+   /* nullify so that release_partition skips the 
tuplestore_clear() */
+   node->buffer = NULL;
+   }
+

Is it possible that node->buffer == NULL in ExecEndWindowAgg()?  If
not, probably making it an Assert() or just removing the if() should
be fine.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Check lateral references within PHVs for memoize cache keys

2024-07-11 Thread Andrei Lepikhov


On 7/11/24 16:18, Richard Guo wrote:

On Fri, Jun 28, 2024 at 10:14 PM Andrei Lepikhov  wrote:

I got the point about Memoize over join, but as a join still calls
replace_nestloop_params to replace parameters in its clauses, why not to
invent something similar to find Memoize keys inside specific JoinPath
node? It is not the issue of this patch, though - but is it doable?


I don't think it's impossible to do, but I'm skeptical that there's an
easy way to identify all the cache keys for joinrels, without having
available ppi_clauses and lateral_vars.

Ok




IMO, the code:
if (bms_nonempty_difference(outerPlan->chgParam, node->keyparamids))
cache_purge_all(node);

is a good place to check an assertion: is it really the parent query
parameters that make a difference between memoize keys and node list of
parameters?


I don't think we have enough info available here to identify which
params within outerPlan->chgParam are from outer levels.  Maybe we can
store root->outer_params in the MemoizeState node to help with this
assertion, but I'm not sure if it's worth the trouble.

Got it


Attached is an updated version of this patch.
I'm not sure about stability of output format of AVG aggregate across 
different platforms. Maybe better to return the result of comparison 
between the AVG() and expected value?


--
regards, Andrei Lepikhov

Re: Removing unneeded self joins

2024-07-11 Thread Andrei Lepikhov


On 7/11/24 14:43, jian he wrote:

On Tue, Jul 9, 2024 at 2:06 PM Andrei Lepikhov  wrote:


On 7/2/24 07:25, jian he wrote:

to make sure it's correct, I have added a lot of tests,
Some of this may be contrived, maybe some of the tests are redundant.

Thanks for your job!
I passed through the patches and have some notes:
1. Patch 0001 has not been applied anymore since the previous week's
changes in the core. Also, there is one place with trailing whitespace.


thanks.
because the previous thread mentioned the EPQ problem.
in remove_useless_self_joins, i make it can only process CMD_SELECT query.
I would like to oppose here: IMO, it is just a mishap which we made 
because of a long history of patch transformations. There we lost the 
case where RowMark exists for only one of candidate relations.
Also, after review I think we don't need so many new tests. Specifically 
for DML we already have one:


EXPLAIN (COSTS OFF)
UPDATE sj sq SET b = 1 FROM sj as sz WHERE sq.a = sz.a;

And we should just add something to elaborate it a bit.
See the patch in attachment containing my proposal to improve v4-0001 
main SJE patch. I think it resolved the issue with EPQ assertion as well 
as problems with returning value.


--
regards, Andrei Lepikhov
From c04add30999ecd64c51bde7db56a6e5637c16c74 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Tue, 9 Jul 2024 12:25:23 +0700
Subject: [PATCH] Apply SJE to DML queries: Just don't include result relation
 to the set of SJE candidates.

Also, fix the subtle bug with RowMarks.
---
 src/backend/optimizer/plan/analyzejoins.c | 24 +++--
 src/test/regress/expected/join.out| 61 +++
 src/test/regress/sql/join.sql | 17 ++-
 3 files changed, 84 insertions(+), 18 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index bb14597762..d2b9ba7c08 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -1860,10 +1860,6 @@ remove_self_join_rel(PlannerInfo *root, PlanRowMark *kmark, PlanRowMark *rmark,
 	/* restore the rangetblref in a proper order. */
 	restore_rangetblref((Node *) root->parse, toKeep->relid, toRemove->relid, 0, 0);
 
-	/* See remove_self_joins_one_group() */
-	Assert(root->parse->resultRelation != toRemove->relid);
-	Assert(root->parse->resultRelation != toKeep->relid);
-
 	/* Replace links in the planner info */
 	remove_rel_from_query(root, toRemove, toKeep->relid, NULL, NULL);
 
@@ -2046,14 +2042,6 @@ remove_self_joins_one_group(PlannerInfo *root, Relids relids)
 	{
 		RelOptInfo *inner = root->simple_rel_array[r];
 
-		/*
-		 * We don't accept result relation as either source or target relation
-		 * of SJE, because result relation has different behavior in
-		 * EvalPlanQual() and RETURNING clause.
-		 */
-		if (root->parse->resultRelation == r)
-			continue;
-
 		k = r;
 
 		while ((k = bms_next_member(relids, k)) > 0)
@@ -2069,9 +2057,6 @@ remove_self_joins_one_group(PlannerInfo *root, Relids relids)
 			PlanRowMark *imark = NULL;
 			List	   *uclauses = NIL;
 
-			if (root->parse->resultRelation == k)
-continue;
-
 			/* A sanity check: the relations have the same Oid. */
 			Assert(root->simple_rte_array[k]->relid ==
    root->simple_rte_array[r]->relid);
@@ -2121,7 +2106,8 @@ remove_self_joins_one_group(PlannerInfo *root, Relids relids)
 if (omark && imark)
 	break;
 			}
-			if (omark && imark && omark->markType != imark->markType)
+			if (((omark == NULL) ^ (imark == NULL)) ||
+(omark && omark->markType != imark->markType))
 continue;
 
 			/*
@@ -2231,7 +2217,8 @@ remove_self_joins_recurse(PlannerInfo *root, List *joinlist, Relids toRemove)
 			 */
 			if (rte->rtekind == RTE_RELATION &&
 rte->relkind == RELKIND_RELATION &&
-rte->tablesample == NULL)
+rte->tablesample == NULL &&
+varno != root->parse->resultRelation)
 			{
 Assert(!bms_is_member(varno, relids));
 relids = bms_add_member(relids, varno);
@@ -2300,6 +2287,9 @@ remove_self_joins_recurse(PlannerInfo *root, List *joinlist, Relids toRemove)
 
 relids = bms_del_members(relids, group);
 
+/* Don't apply SJE to result relation */
+Assert(!bms_is_member(root->parse->resultRelation, group));
+
 /*
  * Try to remove self-joins from a group of identical entries.
  * Make the next attempt iteratively - if something is deleted
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 4e4cec633a..78dfcd4866 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -7068,6 +7068,18 @@ UPDATE sj sq SET b = 1 FROM sj as sz WHERE sq.a = sz.a;
->  Seq Scan on sj sz
 (6 rows)
 
+EXPLAIN (COSTS OFF)
+UPDATE sj sq SET b = sz.b FROM sj as sz WHERE sq.a = sz.a;
+ QUERY PLAN  
+-
+ Update on sj sq
+   ->  Nested Loop
+ Join Filter: (sq.a

Re: Logging which local address was connected to in log_line_prefix

2024-07-11 Thread David Steele


On 7/11/24 23:09, Greg Sabino Mullane wrote:
Thanks for the review. Please find attached a new version with proper 
tabs and indenting.


This looks good to me now. +1 overall for the feature.

Regards,
-David

Re: Logical Replication of sequences

2024-07-11 Thread Peter Smith

Hi Vignesh. Here are the rest of my comments for patch v20240705-0003.

(Apologies for the length of this post; but it was unavoidable due to
this being the 1st review of a very large large 1700-line patch)

==
src/backend/catalog/pg_subscription.c

1. GetSubscriptionSequences

+/*
+ * Get the sequences for the subscription.
+ *
+ * The returned list is palloc'ed in the current memory context.
+ */

Is that comment right? The palloc seems to be done in
CacheMemoryContext, not in the current context.

~

2.
The code is very similar to the other function
GetSubscriptionRelations(). In fact I did not understand how the 2
functions know what they are returning:

E.g. how does GetSubscriptionRelations not return sequences too?
E.g. how does GetSubscriptionSequences not return relations too?

==
src/backend/commands/subscriptioncmds.c

CreateSubscription:
nitpick - put the sequence logic *after* the relations logic because
that is the order that seems used everywhere else.

~~~

3. AlterSubscription_refresh

- logicalrep_worker_stop(sub->oid, relid);
+ /* Stop the worker if relation kind is not sequence*/
+ if (relkind != RELKIND_SEQUENCE)
+ logicalrep_worker_stop(sub->oid, relid);

Can you give more reasons in the comment why skip the stop for sequence worker?

~

nitpick - period and space in the comment

~~~

4.
  for (off = 0; off < remove_rel_len; off++)
  {
  if (sub_remove_rels[off].state != SUBREL_STATE_READY &&
- sub_remove_rels[off].state != SUBREL_STATE_SYNCDONE)
+ sub_remove_rels[off].state != SUBREL_STATE_SYNCDONE &&
+ get_rel_relkind(sub_remove_rels[off].relid) != RELKIND_SEQUENCE)
  {
Would this new logic perhaps be better written as:

if (get_rel_relkind(sub_remove_rels[off].relid) == RELKIND_SEQUENCE)
  continue;

~~~

AlterSubscription_refreshsequences:
nitpick - rename AlterSubscription_refresh_sequences

~
5.
There is significant code overlap between the existing
AlterSubscription_refresh and the new function
AlterSubscription_refreshsequences. I wonder if it is better to try to
combine the logic and just pass another parameter to
AlterSubscription_refresh saying to update the existing sequences if
necessary. Particularly since the AlterSubscription_refresh is already
tweaked to work for sequences. Of course, the resulting combined
function would be large and complex, but maybe that would still be
better than having giant slabs of nearly identical cut/paste code.
Thoughts?

~~~

check_publications_origin:
nitpick - move variable declarations
~~~

fetch_sequence_list:
nitpick - change /tablelist/seqlist/
nitpick - tweak the spaces of the SQL for alignment (similar to
fetch_table_list)

~

6.
+" WHERE s.pubname IN (");
+ first = true;
+ foreach_ptr(String, pubname, publications)
+ {
+ if (first)
+ first = false;
+ else
+ appendStringInfoString(, ", ");
+
+ appendStringInfoString(, quote_literal_cstr(pubname->sval));
+ }
+ appendStringInfoChar(, ')');

IMO this can be written much better by using get_publications_str()
function to do all this list work.

==
src/backend/replication/logical/launcher.c

7. logicalrep_worker_find

/*
 * Walks the workers array and searches for one that matches given
 * subscription id and relid.
 *
 * We are only interested in the leader apply worker or table sync worker.
 */

The above function comment (not in the patch 0003) is stale because
this AFAICT this is also going to return sequence workers if it finds
one.

~~~

8. logicalrep_sequence_sync_worker_find

+/*
+ * Walks the workers array and searches for one that matches given
+ * subscription id.
+ *
+ * We are only interested in the sequence sync worker.
+ */
+LogicalRepWorker *
+logicalrep_sequence_sync_worker_find(Oid subid, bool only_running)

There are other similar functions for walking the workers array to
search for a worker. Instead of having different functions for
different cases, wouldn't it be cleaner to combine these into a single
function, where you pass a parameter (e.g. a mask of worker types that
you are interested in finding)?

~

nitpick - declare a for loop variable 'i'

~~~

9. logicalrep_apply_worker_find

+static LogicalRepWorker *
+logicalrep_apply_worker_find(Oid subid, bool only_running)

All the other find* functions assume the lock is already held
(Assert(LWLockHeldByMe(LogicalRepWorkerLock));). But this one is
different. IMO it might be better to acquire the lock in the caller to
make all the find* functions look the same. Anyway, that will help to
combine everything into 1 "find" worker as suggested in the previous
review comment #8.

~

nitpick - declare a for loop variable 'i'
nitpick - removed unnecessary parens in condition.

~~~

10. logicalrep_worker_launch

/*--
* Sanity checks:
* - must be valid worker type
* - tablesync workers are only ones to have relid
* - parallel apply worker is the only kind of subworker
*/

The above code-comment (not in the 0003 patch) seems stale. This
should now also mention sequence sync workers, right?

~~~

11.
-

Re: Should consider materializing the cheapest inner path in consider_parallel_nestloop()

2024-07-11 Thread Tender Wang

Richard Guo  于2024年7月12日周五 10:30写道：

> On Sat, Jul 6, 2024 at 5:32 PM Richard Guo  wrote:
> > Here is a new rebase.
> >
> > I'm planning to push it next week, barring any objections.
>
> Pushed.
>
> Thanks
> Richard
>

Thanks for pushing.

-- 
Tender Wang

Re: Should consider materializing the cheapest inner path in consider_parallel_nestloop()

2024-07-11 Thread Richard Guo

On Sat, Jul 6, 2024 at 5:32 PM Richard Guo  wrote:
> Here is a new rebase.
>
> I'm planning to push it next week, barring any objections.

Pushed.

Thanks
Richard

Re: Redundant syscache access in get_rel_sync_entry()

2024-07-11 Thread Tom Lane

Michael Paquier  writes:
> On Thu, Jul 11, 2024 at 07:10:58PM +0530, Ashutosh Bapat wrote:
>> I think it's just convenient. We do that at multiple places; not exactly
>> these functions but functions which fetch relation attributes from cached
>> tuples. Given that the tuple is cached and local to the backend, it's not
>> too expensive.  But if there are multiple places which do something like
>> this, we may want to create more function get_rel_* function which return
>> multiple properties in one function call. I see get_rel_namspace() and
>> get_rel_name() called together at many places.

> That's not worth the complications based on the level of caching.
> This code is fine as-is.

I could get behind creating such functions if there were a
demonstrable performance win, but in places that are not hot-spots
that's unlikely to be demonstrable.

regards, tom lane

Re: Clang function pointer type warnings in v14, v15

2024-07-11 Thread Tom Lane

Thomas Munro  writes:
> On Sat, Jul 6, 2024 at 2:35 PM Tom Lane  wrote:
>> I see that there are a boatload of related warnings in older
>> branches too; do we want to try to do anything about that?  (I doubt
>> code changes would be in-scope, but maybe adding a -Wno-foo switch
>> to the build flags would be appropriate.)

> I don't see any on clang 16 in the 12 and 13 branches.  Where are you
> seeing them?

In the buildfarm.  "adder" and a bunch of other machines are throwing
-Wcast-function-type in about two dozen places in v13, and "jay" is
emitting several hundred -Wdeprecated-non-prototype warnings.

> Has anyone thought about the -Wclobbered warnings from eg tamandua?

I decided long ago that gcc's algorithm for emitting that warning
has no detectable connection to real problems.  Maybe it's worth
silencing them on general principles, but I've seen no sign that
it would actually fix any bugs.

regards, tom lane

Re: improve performance of pg_dump with many sequences

2024-07-11 Thread Nathan Bossart

On Wed, Jul 10, 2024 at 11:52:33PM -0300, Euler Taveira wrote:
> On Wed, Jul 10, 2024, at 7:05 PM, Nathan Bossart wrote:
>> Unfortunately, I think we have to keep this workaround since older minor
>> releases of PostgreSQL don't have the fix.
> 
> Hmm. Right.

On second thought, maybe we should just limit this improvement to the minor
releases with the fix so that we _can_ get rid of the workaround.  Or we
could use the hacky workaround only for versions with the bug.

-- 
nathan

Re: Allow non-superuser to cancel superuser tasks.

2024-07-11 Thread Nathan Bossart

On Wed, Jul 10, 2024 at 02:14:55PM +0900, Michael Paquier wrote:
> +# Only non-superuser roles granted pg_signal_autovacuum_worker are allowed
> +# to signal autovacuum workers.  This test uses an injection point located
> +# at the beginning of the autovacuum worker startup.

nitpick: Superuser roles are also allowed to signal autovacuum workers.
Maybe this should read "Only roles with privileges of..."

> +# Create some content and set an aggressive autovacuum.
> +$node->safe_psql(
> + 'postgres', qq(
> +CREATE TABLE tab_int(i int);
> +ALTER TABLE tab_int SET (autovacuum_vacuum_cost_limit = 1);
> +ALTER TABLE tab_int SET (autovacuum_vacuum_cost_delay = 100);
> +));
> +
> +$node->safe_psql(
> + 'postgres', qq(
> +INSERT INTO tab_int VALUES(1);
> +));
> +
> +# Wait until an autovacuum worker starts.
> +$node->wait_for_event('autovacuum worker', 'autovacuum-worker-start');

I'm not following how this is guaranteed to trigger an autovacuum quickly.
Shouldn't we set autovacuum_vacuum_insert_threshold to 1 so that it is
eligible for autovacuum?

> +# Wait for the autovacuum worker to exit before scanning the logs.
> +$node->poll_query_until('postgres',
> + "SELECT count(*) = 0 FROM pg_stat_activity "
> + . "WHERE pid = $av_pid AND backend_type = 'autovacuum worker';");

WFM.  Even if the PID is quickly reused, this should work.  We just might
end up waiting a little longer.

Is it worth testing cancellation, too?

-- 
nathan

Re: Redundant syscache access in get_rel_sync_entry()

2024-07-11 Thread Michael Paquier

On Thu, Jul 11, 2024 at 07:10:58PM +0530, Ashutosh Bapat wrote:
> I think it's just convenient. We do that at multiple places; not exactly
> these functions but functions which fetch relation attributes from cached
> tuples. Given that the tuple is cached and local to the backend, it's not
> too expensive.  But if there are multiple places which do something like
> this, we may want to create more function get_rel_* function which return
> multiple properties in one function call. I see get_rel_namspace() and
> get_rel_name() called together at many places.

That's not worth the complications based on the level of caching.
This code is fine as-is.
--
Michael


signature.asc
Description: PGP signature

Re: Clang function pointer type warnings in v14, v15

2024-07-11 Thread Thomas Munro

On Sat, Jul 6, 2024 at 2:35 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > So I propose to back-patch 101c37cd to those two branches.
>
> +1.

Done.

> I see that there are a boatload of related warnings in older
> branches too; do we want to try to do anything about that?  (I doubt
> code changes would be in-scope, but maybe adding a -Wno-foo switch
> to the build flags would be appropriate.)

I don't see any on clang 16 in the 12 and 13 branches.  Where are you
seeing them?

Has anyone thought about the -Wclobbered warnings from eg tamandua?

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2024-07-11 Thread Michael Paquier

On Thu, Jul 11, 2024 at 01:11:05PM +0300, Aleksander Alekseev wrote:
> Thanks, Michael. I prepared a corrected patchset.

A comment about v3-0001.

-* If true, use long segment filenames formed from lower 48 bits of the
-* segment number, e.g. pg_xact/1234. Otherwise, use short
-* filenames formed from lower 16 bits of the segment number e.g.
-* pg_xact/1234.
+* If true, use long segment filenames. Use short filenames otherwise.
+* See SlruFileName().

We're losing some details here even if SlruFileName() has some
explanations, because one would need to read through the snprintf's
04X to know that short file names include 4 characters.  I'm OK to
mention SlruFileName() rather than duplicate the knowledge here, but
SlruFileName() should also be updated to mention the same level of
details with some examples of file names.
--
Michael


signature.asc
Description: PGP signature

Re: Pluggable cumulative statistics

2024-07-11 Thread Michael Paquier

On Thu, Jul 11, 2024 at 01:29:08PM +, Bertrand Drouvot wrote:
> Please find attached a patch to do so (attached as .txt to not perturb the
> cfbot).

+ * Reads in existing statistics file into the shared stats hash (for non fixed
+ * amount stats) or into the fixed amount stats.

Thanks.  I have applied a simplified version of that, not mentioning
the details of what happens depending on the kinds of stats dealt
with.
--
Michael


signature.asc
Description: PGP signature

Re: CREATE INDEX CONCURRENTLY on partitioned index

2024-07-11 Thread Michael Paquier

On Thu, Jul 11, 2024 at 09:35:24PM +0100, Ilya Gladyshev wrote:
> It is broken in master, I just didn’t want to create a separate
> thread, but it can be fixed independently. As I remember, the
> problem is that progress is tracked for each table in the hierarchy
> as if the table is processed separately, without ever setting
> partitions_total and partitions_done counters.

Please let's move this point to its own thread and deal with it with
an independent patch.  Hiding that in a thread that's already quite
long is not a good idea.  This needs proper review, and a separate
thread with a good subject to describe the problem will attract a
better audience to deal with the problem you are seeing.

I was not paying much attention, until you've mentioned that this was
an issue with HEAD.
--
Michael

signature.asc
Description: PGP signature

Re: Use read streams in CREATE DATABASE command when the strategy is wal_log

2024-07-11 Thread Noah Misch

On Tue, Apr 16, 2024 at 02:12:19PM +0300, Nazir Bilal Yavuz wrote:
> I am working on using read streams in the CREATE DATABASE command when the
> strategy is wal_log. RelationCopyStorageUsingBuffer() function is used in
> this context. This function reads source buffers then copies them to the

Please rebase.  I applied this to 40126ac for review purposes.

> a. Timings:
> b. strace:
> c. perf:

Thanks for including those details.  That's valuable confirmation.

> Subject: [PATCH v1 1/3] Refactor PinBufferForBlock() to remove if checks about
>  persistence
> 
> There are if checks in PinBufferForBlock() function to set persistence
> of the relation and this function is called for the each block in the
> relation. Instead of that, set persistence of the relation before
> PinBufferForBlock() function.

I tried with the following additional patch to see if PinBufferForBlock() ever
gets invalid smgr_relpersistence:

--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1098,6 +1098,11 @@ PinBufferForBlock(Relation rel,

Assert(blockNum != P_NEW);

+   if (!(smgr_persistence == RELPERSISTENCE_TEMP ||
+ smgr_persistence == RELPERSISTENCE_PERMANENT ||
+ smgr_persistence == RELPERSISTENCE_UNLOGGED))
+   elog(WARNING, "unexpected relpersistence %d", smgr_persistence);
+
if (smgr_persistence == RELPERSISTENCE_TEMP)
{
io_context = IOCONTEXT_NORMAL;

That still gets relpersistence==0 in various src/test/regress cases.  I think
the intent was to prevent that.  If not, please add a comment about when
relpersistence==0 is still allowed.

> --- a/src/backend/storage/aio/read_stream.c
> +++ b/src/backend/storage/aio/read_stream.c
> @@ -549,7 +549,7 @@ read_stream_begin_relation(int flags,
>   {
>   stream->ios[i].op.rel = rel;
>   stream->ios[i].op.smgr = RelationGetSmgr(rel);
> - stream->ios[i].op.smgr_persistence = 0;
> + stream->ios[i].op.smgr_persistence = 
> rel->rd_rel->relpersistence;

Does the following comment in ReadBuffersOperation need an update?

/*
 * The following members should be set by the caller.  If only smgr is
 * provided without rel, then smgr_persistence can be set to override 
the
 * default assumption of RELPERSISTENCE_PERMANENT.
 */

> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c

> +/*
> + * Helper struct for read stream object used in
> + * RelationCopyStorageUsingBuffer() function.
> + */
> +struct copy_storage_using_buffer_read_stream_private
> +{
> + BlockNumber blocknum;
> + int64   last_block;
> +};

Why is last_block an int64, not a BlockNumber?

> @@ -4667,19 +4698,31 @@ RelationCopyStorageUsingBuffer(RelFileLocator 
> srclocator,

>   /* Iterate over each block of the source relation file. */
>   for (blkno = 0; blkno < nblocks; blkno++)
>   {
>   CHECK_FOR_INTERRUPTS();
>  
>   /* Read block from source relation. */
> - srcBuf = ReadBufferWithoutRelcache(srclocator, forkNum, blkno,
> - 
>RBM_NORMAL, bstrategy_src,
> - 
>permanent);
> + srcBuf = read_stream_next_buffer(src_stream, NULL);
>   LockBuffer(srcBuf, BUFFER_LOCK_SHARE);

I think this should check for read_stream_next_buffer() returning
InvalidBuffer.  pg_prewarm doesn't, but the other callers do, and I think the
other callers are a better model.  LockBuffer() doesn't check the
InvalidBuffer case, so let's avoid the style of using a
read_stream_next_buffer() return value without checking.

Thanks,
nm

Re: speed up a logical replica setup

2024-07-11 Thread Euler Taveira

On Thu, Jul 11, 2024, at 2:00 PM, Alexander Lakhin wrote:
> Hello Amit and Hou-San,
> 
> 11.07.2024 13:21, Amit Kapila wrote:
> >
> > We don't wait for the xmin to catch up corresponding to this insert
> > and I don't know if there is a way to do that. So, we should move this
> > Insert to after the call to pg_sync_replication_slots(). It won't
> > impact the general test of pg_createsubscriber.
> >
> > Thanks to Hou-San for helping me in the analysis of this BF failure.
> 
> Thank you for investigating that issue!
> 
> May I ask you to look at another failure of the test occurred today [1]?
> 

Thanks for the report!

You are observing the same issue that Amit explained in [1]. The
pg_create_logical_replication_slot returns the EndRecPtr (see
slot->data.confirmed_flush in DecodingContextFindStartpoint()). EndRecPtr points
to the next record and it is a future position for an idle server. That's why
the recovery takes some time to finish because it is waiting for an activity to
increase the LSN position. Since you modified LOG_SNAPSHOT_INTERVAL_MS to create
additional WAL records soon, the EndRecPtr position is reached rapidly and the
recovery ends quickly.

Hayato proposes a patch [2] to create an additional WAL record that has the same
effect from you little hack: increase the LSN position to allow the recovery
finishes soon. I don't like the solution although it seems simple to implement.
As Amit said if we know the ReadRecPtr, we could use it as consistent LSN. The
problem is that it is used by logical decoding but it is not exposed. [reading
the code...] When the logical replication slot is created, restart_lsn points to
the lastReplayedEndRecPtr (see ReplicationSlotReserveWal()) that is the last
record replayed. Since the replication slots aren't in use, we could use the
restart_lsn from the last replication slot as a consistent LSN.

I'm attaching a patch that implements it.It runs in 6s instead of 26s.

[1] 
https://www.postgresql.org/message-id/CAA4eK1%2Bp%2B7Ag6nqdFRdqowK1EmJ6bG-MtZQ_54dnFBi%3D_OO5RQ%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/OSBPR01MB25521B15BF950D2523BBE143F5D32%40OSBPR01MB2552.jpnprd01.prod.outlook.com

--
Euler Taveira
EDB   https://www.enterprisedb.com/
From a3be65c9bea08d3c7ffd575342a879d37b28b9a8 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Thu, 11 Jul 2024 19:59:56 -0300
Subject: [PATCH] pg_createsubscriber: fix slow recovery

If the primary server is idle when you are running pg_createsubscriber,
it used to take some time during recovery. The reason is that it was
using the LSN returned by pg_create_logical_replication_slot as
recovery_target_lsn. This LSN points to "end + 1" record that might not
be available at WAL, hence, the recovery routine waits until some
activity from auxiliary processes write WAL records and once it reaches
the recovery_target_lsn position.

Instead, use restart_lsn from the last replication slot. It points to
the last WAL record that was replayed. Hence, the recovery finishes
soon.

create_logical_replication_slot() does not return LSN so change its
signature.

Discussion: https://www.postgresql.org/message-id/2377319.1719766794%40sss.pgh.pa.us
Discussion: https://www.postgresql.org/message-id/68de6498-0449-a113-dd03-e198dded0bac%40gmail.com
---
 src/bin/pg_basebackup/pg_createsubscriber.c | 56 -
 1 file changed, 43 insertions(+), 13 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index 21dd50f8089..5ef3f751a5b 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -86,8 +86,8 @@ static void setup_recovery(const struct LogicalRepInfo *dbinfo, const char *data
 static void drop_primary_replication_slot(struct LogicalRepInfo *dbinfo,
 		  const char *slotname);
 static void drop_failover_replication_slots(struct LogicalRepInfo *dbinfo);
-static char *create_logical_replication_slot(PGconn *conn,
-			 struct LogicalRepInfo *dbinfo);
+static bool create_logical_replication_slot(PGconn *conn,
+			struct LogicalRepInfo *dbinfo);
 static void drop_replication_slot(PGconn *conn, struct LogicalRepInfo *dbinfo,
   const char *slot_name);
 static void pg_ctl_status(const char *pg_ctl_cmd, int rc);
@@ -769,15 +769,47 @@ setup_publisher(struct LogicalRepInfo *dbinfo)
 		create_publication(conn, [i]);

 		/* Create replication slot on publisher */
-		if (lsn)
-			pg_free(lsn);
-		lsn = create_logical_replication_slot(conn, [i]);
-		if (lsn != NULL || dry_run)
+		if (create_logical_replication_slot(conn, [i]) || dry_run)
 			pg_log_info("create replication slot \"%s\" on publisher",
 		dbinfo[i].replslotname);
 		else
 			exit(1);

+		/*
+		 * Get restart_lsn from the last replication slot. It points to the
+		 * last record replayed. The LSN returned by
+		 * pg_create_logical_replication_slot() should not be used because it
+		 * points to a future

Re: Eager aggregation, take 3

2024-07-11 Thread Paul George

Hey Richard,

Looking more closely at this example

>select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by t2.a
having t2.a is null;

I wonder if the inability to exploit eager aggregation is more based on the
fact that COUNT(*) cannot be decomposed into an aggregation of PARTIAL
COUNT(*)s (apologies if my terminology is off/made up...I'm new to the
codebase). In other words, is it the case that a given aggregate function
already has built-in protection against the error case you correctly
pointed out?

To highlight this, in the simple example below we don't see aggregate
pushdown even with an INNER JOIN when the agg function is COUNT(*) but we
do when it's COUNT(t2.*):

-- same setup
drop table if exists t;
create table t(a int, b int, c int);
insert into t select i % 100, i % 10, i from generate_series(1, 1000) i;
analyze t;

-- query 1: COUNT(*) --> no pushdown

set enable_eager_aggregate=on;
explain (verbose, costs off) select t1.a, count(*) from t t1 join t t2 on
t1.a=t2.a group by t1.a;

QUERY PLAN
---
 HashAggregate
   Output: t1.a, count(*)
   Group Key: t1.a
   ->  Hash Join
 Output: t1.a
 Hash Cond: (t1.a = t2.a)
 ->  Seq Scan on public.t t1
   Output: t1.a, t1.b, t1.c
 ->  Hash
   Output: t2.a
   ->  Seq Scan on public.t t2
 Output: t2.a
(12 rows)


-- query 2: COUNT(t2.*) --> agg pushdown

set enable_eager_aggregate=on;
explain (verbose, costs off) select t1.a, count(t2.*) from t t1 join t t2
on t1.a=t2.a group by t1.a;

  QUERY PLAN
---
 Finalize HashAggregate
   Output: t1.a, count(t2.*)
   Group Key: t1.a
   ->  Hash Join
 Output: t1.a, (PARTIAL count(t2.*))
 Hash Cond: (t1.a = t2.a)
 ->  Seq Scan on public.t t1
   Output: t1.a, t1.b, t1.c
 ->  Hash
   Output: t2.a, (PARTIAL count(t2.*))
   ->  Partial HashAggregate
 Output: t2.a, PARTIAL count(t2.*)
 Group Key: t2.a
 ->  Seq Scan on public.t t2
   Output: t2.*, t2.a
(15 rows)

...while it might be true that COUNT(*) ... INNER JOIN should allow eager
agg pushdown (I haven't thought deeply about it, TBH), I did find this
result pretty interesting.


-Paul

On Wed, Jul 10, 2024 at 1:27 AM Richard Guo  wrote:

> On Sun, Jul 7, 2024 at 10:45 AM Paul George 
> wrote:
> > Thanks for reviving this patch and for all of your work on it! Eager
> aggregation pushdown will be beneficial for my work and I'm hoping to see
> it land.
>
> Thanks for looking at this patch!
>
> > The output of both the original query and this one match (and the plans
> with eager aggregation and the subquery are nearly identical if you restore
> the LEFT JOIN to a JOIN). I admittedly may be missing a subtlety, but does
> this mean that there are conditions under which eager aggregation can be
> pushed down to the nullable side?
>
> I think it's a very risky thing to push a partial aggregation down to
> the nullable side of an outer join, because the NULL-extended rows
> produced by the outer join would not be available when we perform the
> partial aggregation, while with a non-eager-aggregation plan these
> rows are available for the top-level aggregation.  This may put the
> rows into groups in a different way than expected, or get wrong values
> from the aggregate functions.  I've managed to compose an example:
>
> create table t (a int, b int);
> insert into t select 1, 1;
>
> select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
> t2.a having t2.a is null;
>  a | count
> ---+---
>| 1
> (1 row)
>
> This is the expected result, because after the outer join we have got
> a NULL-extended row.
>
> But if we somehow push down the partial aggregation to the nullable
> side of this outer join, we would get a wrong result.
>
> explain (costs off)
> select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
> t2.a having t2.a is null;
> QUERY PLAN
> ---
>  Finalize HashAggregate
>Group Key: t2.a
>->  Nested Loop Left Join
>  Filter: (t2.a IS NULL)
>  ->  Seq Scan on t t1
>  ->  Materialize
>->  Partial HashAggregate
>  Group Key: t2.a
>  ->  Seq Scan on t t2
>Filter: (b > 1)
> (10 rows)
>
> select t2.a, count(*) from t t1 left join t t2 on t2.b > 1 group by
> t2.a having t2.a is null;
>  a | count
> ---+---
>| 0
> (1 row)
>
> I believe there are cases where pushing a partial aggregation down to
> the nullable side of an outer join can be safe, but I doubt that there
> is an easy way to identify these cases and do the push-down for them.
> So for now I think we'd better refrain from doing

Re: Optimize WindowAgg's use of tuplestores

2024-07-11 Thread Ranier Vilela

Em qui., 11 de jul. de 2024 às 09:09, David Rowley 
escreveu:

> On Wed, 10 Jul 2024 at 02:42, Ashutosh Bapat
>  wrote:
> > Observations
> > 1. The numbers corresponding to 10 and 100 partitions are higher when
> > patched. That might be just noise. I don't see any reason why it would
> > impact negatively when there are a small number of partitions. The
> > lower partition cases also have a higher number of rows per partition,
> > so is the difference between MemoryContextDelete() vs
> > MemoryContextReset() making any difference here. May be worth
> > verifying those cases carefully. Otherwise upto 1000 partitions, it
> > doesn't show any differences.
>
> I think this might just be noise as a result of rearranging code. In
> terms of C code, I don't see any reason for it to be slower.  If you
> look at GenerationDelete() (as what is getting called from
> MemoryContextDelete()), it just calls GenerationReset(). So resetting
> is going to always be less work than deleting the context, especially
> given we don't need to create the context again when we reset it.
>
> I wrote the attached script to see if I can also see the slowdown and
> I do see the patched code come out slightly slower (within noise
> levels) in lower partition counts.
>
> To get my compiler to produce code in a more optimal order for the
> common case, I added unlikely() to the "if (winstate->all_first)"
> condition.  This is only evaluated on the first time after a rescan,
> so putting that code at the end of the function makes more sense.  The
> attached v2 patch has it this way.

Not that it's going to make any difference,
but since you're going to touch this code, why not?

Function *begin_partition*:
1. You can reduce the scope for variable *outerPlan*,
it is not needed anywhere else.
+ /*
+ * If this is the very first partition, we need to fetch the first input
+ * row to store in first_part_slot.
+ */
+ if (TupIsNull(winstate->first_part_slot))
+ {
+ PlanState *outerPlan = outerPlanState(winstate);
+ TupleTableSlot *outerslot = ExecProcNode(outerPlan);

2. There is once reference to variable numFuncs
You can reduce the scope.

+ /* reset mark and seek positions for each real window function */
+ for (int i = 0; i < winstate->numfuncs; i++)

Function *prepare_tuplestore. *
1. There is once reference to variable numFuncs
You can reduce the scope.
  /* create mark and read pointers for each real window function */
- for (i = 0; i < numfuncs; i++)
+ for (int i = 0; i < winstate->numfuncs; i++)

2. You can securely initialize the default value for variable
*winstate->grouptail_ptr*
in *else* part.

  if ((frameOptions & (FRAMEOPTION_EXCLUDE_GROUP |
  FRAMEOPTION_EXCLUDE_TIES)) &&
  node->ordNumCols != 0)
  winstate->grouptail_ptr =
  tuplestore_alloc_read_pointer(winstate->buffer, 0);
  }
+ else
+ winstate->grouptail_ptr = -1;

best regards,
Ranier Vilela

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

2024-07-11 Thread Tom Lane

Nathan Bossart  writes:
> On Wed, Jul 10, 2024 at 01:41:41PM -0500, Nathan Bossart wrote:
>> On Wed, Jul 10, 2024 at 11:11:13AM -0400, Tom Lane wrote:
>>> We went through a ton of permutations of that kind of
>>> idea years ago, when it first became totally clear that cross-checks
>>> between GUCs do not work nicely if implemented in check_hooks.

>> Do you remember the general timeframe or any of the GUCs involved?  I spent
>> some time searching through the commit log and mailing lists, but I've thus
>> far only found allusions to past bad experiences with such hooks.

> Could it be the effective_cache_size work from 2013-2014?

Yeah, that's what I was remembering.  It looks like my memory was
slightly faulty, in that what ee1e5662d tried to do was make the
default value of one GUC depend on the actual value of another one,
not implement a consistency check per se.  But the underlying problem
is the same: a check_hook can't assume it is seeing the appropriate
value of some other GUC, since a change of that one may be pending.

regards, tom lane

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-07-11 Thread Peter Eisentraut


On 07.05.24 11:36, Daniel Gustafsson wrote:

Yeah, that depends on how much version you expect your application to
work on.  Still it seems to me that there's value in mentioning that
if your application does not care about anything older than OpenSSL
1.1.0, like PG 18 assuming that this patch is merged, then these calls
are pointless for HEAD.  The routine definitions would be around only
for the .so compatibility.


Fair enough.  I've taken a stab at documenting that the functions are
deprecated, while at the same time documenting when and how they can be used
(and be useful).  The attached also removes one additional comment in the
testcode which is now obsolete (since removing 1.0.1 support really), and fixes
the spurious whitespace you detected upthread.


The 0001 patch removes the functions pgtls_init_library() and 
pgtls_init() but keeps the declarations in libpq-int.h.  This should be 
fixed.

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-07-11 Thread Peter Eisentraut


On 03.07.24 17:20, Daniel Gustafsson wrote:

After fiddling a bit with the code and documentation I came up with the
attached version which also makes the testsuite use the list syntax in order to
test it.  It's essentially just polish and adding comments with the functional
changes that a) it parses the entire list of curves so all errors can be
reported instead of giving up at the first error; b) leaving the cipher suite
GUC blank will set the suites to the OpenSSL default vale.


It would be worth checking the discussion at 
 
about strtok()/strtok_r() issues.  First, for list parsing, it sometimes 
gives the wrong semantics, which I think might apply here.  Maybe it's 
worth comparing this with the semantics that OpenSSL provides natively. 
And second, strtok_r() is not available on Windows without the 
workaround provided in that thread.


I'm doubtful that it's worth replicating all this list parsing logic 
instead of just letting OpenSSL do it.  This is a very marginal feature 
after all.

Re: CFbot failed on Windows platform

2024-07-11 Thread Tatsuo Ishii

> On 2024-07-11 Th 5:34 AM, Tatsuo Ishii wrote:
 The differences are that the result has an extra space at the end of
 line.  I am not familiar with Windows and has no idea why this could
 happen (I haven't changed the patch set since May 24. The last
 succeeded test was on July 9 14:58:44 (I am not sure about time zone).
 Also I see exactly the same test failures in some other tests (for
 example http://cfbot.cputube.org/highlights/all.html#4337)

 Any idea?
>>> I think It is related to the '628c1d1f2c' commit. This commit changed
>>> the output of the regress test in Windows.
>> Yeah, it seems that explains. I see few buildfarm window animals
>> complain too.
>>
> 
> I have partially reverted that patch. Thanks for the report.

Thank you for taking care of this!
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

2024-07-11 Thread Nathan Bossart

On Wed, Jul 10, 2024 at 01:41:41PM -0500, Nathan Bossart wrote:
> On Wed, Jul 10, 2024 at 11:11:13AM -0400, Tom Lane wrote:
>> We went through a ton of permutations of that kind of
>> idea years ago, when it first became totally clear that cross-checks
>> between GUCs do not work nicely if implemented in check_hooks.
>> (You can find all the things we tried in the commit log, although
>> I don't recall exactly when.)
> 
> Do you remember the general timeframe or any of the GUCs involved?  I spent
> some time searching through the commit log and mailing lists, but I've thus
> far only found allusions to past bad experiences with such hooks.

Could it be the effective_cache_size work from 2013-2014?


https://www.postgresql.org/message-id/flat/CAHyXU0weDECnab1pypNum-dWGwjso_XMTY8-NvvzRphzM2Hv5A%40mail.gmail.com

https://www.postgresql.org/message-id/flat/CAMkU%3D1zTMNZsnUV6L7aMvfJZfzjKbzAtuML3N35wyYaia9MJAw%40mail.gmail.com

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ee1e5662d8d8330726eaef7d3110cb7add24d058

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2850896961994aa0993b9e2ed79a209750181b8a

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=af930e606a3217db3909029c6c3f8d003ba70920

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a16d421ca4fc639929bc964b2585e8382cf16e33;hp=08c8e8962f56c23c6799178d52d3b31350a0708f

-- 
nathan

Re: CREATE INDEX CONCURRENTLY on partitioned index

2024-07-11 Thread Ilya Gladyshev

It is broken in master, I just didn’t want to create a separate thread, but it 
can be fixed independently. As I remember, the problem is that progress is 
tracked for each table in the hierarchy as if the table is processed 
separately, without ever setting partitions_total and partitions_done counters.

> 11 июля 2024 г., в 13:31, Justin Pryzby  написал(а):
> 
> On Sat, Jun 15, 2024 at 07:56:38PM +0100, Ilya Gladyshev wrote:
>> In addition, I noticed that progress tracking is once again broken for
>> partitioned tables, while looking at REINDEX implementation, attaching the
>> second patch to fix it.
> 
> Thanks for the fixes, I started reviewing them but need some more time
> to digest.
> 
> Do you mean that progress reporting is broken in master, for REINDEX, or
> just with this patch ?
> 
> -- 
> Justin

Converting tab-complete.c's else-if chain to a switch

2024-07-11 Thread Tom Lane

Per discussion elsewhere [1], I've been looking at $SUBJECT.
I have a very crude Perl hack (too ugly to show yet ;-)) that
produces output along the lines of


Existing code:

/* CREATE STATISTICS  */
else if (Matches("CREATE", "STATISTICS", MatchAny))
COMPLETE_WITH("(", "ON");
else if (Matches("CREATE", "STATISTICS", MatchAny, "("))
COMPLETE_WITH("ndistinct", "dependencies", "mcv");
else if (Matches("CREATE", "STATISTICS", MatchAny, "(*)"))
COMPLETE_WITH("ON");
else if (HeadMatches("CREATE", "STATISTICS", MatchAny) &&
 TailMatches("FROM"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tables);


Generated tables:

enum TCPatternID
{
...
M_CREATE_STATISTICS_ANY,
M_CREATE_STATISTICS_ANY_LPAREN,
M_CREATE_STATISTICS_ANY_PARENSTARPAREN,
HM_CREATE_STATISTICS_ANY,
...
};

enum TCPatternKind
{
Match,
MatchCS,
HeadMatch,
HeadMatchCS,
TailMatch,
TailMatchCS,
};

typedef struct
{
enum TCPatternID id;
enum TCPatternKind kind;
int nwords;
const char *const *words;
} TCPattern;

#define TCPAT(id, kind, ...) \
{ (id), (kind), VA_ARGS_NARGS(__VA_ARGS__), \
  (const char * const []) { __VA_ARGS__ } }

static const TCPattern tcpatterns[] =
{
...
TCPAT(M_CREATE_STATISTICS_ANY,
  Match, "CREATE", "STATISTICS", MatchAny),
TCPAT(M_CREATE_STATISTICS_ANY_LPAREN,
  Match, "CREATE", "STATISTICS", MatchAny, "("),
TCPAT(M_CREATE_STATISTICS_ANY_PARENSTARPAREN,
  Match, "CREATE", "STATISTICS", MatchAny, "(*)"),
TCPAT(HM_CREATE_STATISTICS_ANY,
  HeadMatch, "CREATE", "STATISTICS", MatchAny),
...
};


Converted code:

/* CREATE STATISTICS  */
case M_CREATE_STATISTICS_ANY:
COMPLETE_WITH("(", "ON");
break;
case M_CREATE_STATISTICS_ANY_LPAREN:
COMPLETE_WITH("ndistinct", "dependencies", "mcv");
break;
case M_CREATE_STATISTICS_ANY_PARENSTARPAREN:
COMPLETE_WITH("ON");
break;
case HM_CREATE_STATISTICS_ANY:
if (TailMatches("FROM"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tables);
break;


The idea (I've not actually written this part yet) is that an
outer loop would iterate through the table entries and invoke
the appropriate switch case for any successful match.  As soon
as the switch code sets "matches" non-NULL (it might not, as in
the last case in the example), we can exit.

While this clearly can be made to work, it doesn't seem like the
result will be very maintainable.  You have to invent a single-use
enum label, and the actual definition of the primary match pattern
is far away from the code using it, and we have two fundamentally
different ways of writing the same pattern test (since we'll still
need some instances of direct calls to TailMatches and friends,
as in the last example case).

What I'm thinking about doing instead of pursuing this exact
implementation idea is that we should create a preprocessor
to deal with building the table.  I'm envisioning that the
new source code will look like


/* CREATE STATISTICS  */
case Matches("CREATE", "STATISTICS", MatchAny):
COMPLETE_WITH("(", "ON");
break;
case Matches("CREATE", "STATISTICS", MatchAny, "("):
COMPLETE_WITH("ndistinct", "dependencies", "mcv");
break;
case Matches("CREATE", "STATISTICS", MatchAny, "(*)"):
COMPLETE_WITH("ON");
break;
case HeadMatches("CREATE", "STATISTICS", MatchAny):
if (TailMatches("FROM"))
COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_tables);
break;


Of course this isn't legal C, since the case labels are not
constant expressions.  The preprocessing script would look
for case labels that look like this, generate the appropriate
table entries, and replace the case labels with mechanically-
generated ID codes that don't need to be particularly human
readable.  On the downside, YA preprocessor and mini-language
isn't much fun; but at least this is a lot closer to real C
than some of the other things we've invented.

(Further down the line, we can look into improvements such as
avoiding duplicate string comparisons; but that can be done behind
the scenes, without messing with this source-code notation.)

Does anyone particularly hate this plan, or have a better idea?

BTW, we have quite a few instances of code like the aforementioned

else if (HeadMatches("CREATE", "STATISTICS", MatchAny) &&
 TailMatches("FROM"))

I'm thinking we should invent a Matches() option "MatchAnyN",
which could appear at most once and would represent an automatic
match to zero or more words appearing between the head part and
the tail part.  Then this could be transformed to

else if (Matches("CREATE", "STATISTICS", MatchAny, MatchAnyN, "FROM"))

the advantage being that more of the pattern can be embedded in the
table and we need fewer bits of ad-hoc logic.  Maybe this'd be worth
doing

Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-07-11 Thread Robert Haas

On Wed, Jul 10, 2024 at 9:16 PM David Rowley  wrote:
> Melih and I talked about this in a meeting yesterday evening. I think
> I'm about on the fence about having the IDs in leaf-to-root or
> root-to-leaf.  My main concern about which order is chosen is around
> how easy it is to write hierarchical queries. I think I'd feel better
> about having it in root-to-leaf order if "level" was 1-based rather
> than 0-based. That would allow querying CacheMemoryContext and all of
> its descendants with:
>
> WITH c AS (SELECT * FROM pg_backend_memory_contexts)
> SELECT c1.*
> FROM c c1, c c2
> WHERE c2.name = 'CacheMemoryContext'
> AND c1.path[c2.level] = c2.path[c2.level];

I don't object to making it 1-based.

> Ideally, no CTE would be needed here, but unfortunately, there's no
> way to know the CacheMemoryContext's ID beforehand.  We could make the
> ID more stable if we did a breadth-first traversal of the context.
> i.e., assign IDs in level order.  This would stop TopMemoryContext's
> 2nd child getting a different ID if its first child became a parent
> itself.

Do we ever have contexts with the same name at the same level? Could
we just make the path an array of strings, so that you could then say
something like this...

SELECT * FROM pg_backend_memory_contexts where path[2] = 'CacheMemoryContext'

...and get all the things with that in the path?

> select * from pg_backend_memory_contexts;
> -- Observe that CacheMemoryContext has ID=22 and level=2. Get the
> total of that and all of its descendants.
> select sum(total_bytes) from pg_backend_memory_contexts where path[2] = 22;
> -- or just it and direct children
> select sum(total_bytes) from pg_backend_memory_contexts where path[2]
> = 22 and level <= 3;

I'm doubtful about this because nothing prevents the set of memory
contexts from changing between one query and the next. We should try
to make it so that it's easy to get what you want in a single query.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Restart pg_usleep when interrupted

2024-07-11 Thread Nathan Bossart

On Thu, Jul 11, 2024 at 01:10:25PM -0500, Sami Imseih wrote:
>> I'm curious why we wouldn't just subtract "elapsed_time" from "delay" at
>> the bottom of the while loop to avoid needing this extra check.  
> 
> Can you elaborate further? I am not sure how this will work since delay is a 
> timespec 
> and elapsed time is an instr_time. 
> 
> Also, in every iteration of the loop, the delay must be set to the remaining 
> time. The
> purpose of the elapsed_time is to make sure that we don´t surpass requested 
> time
> delay as an additional safeguard.

I'm imagining something like this:

struct timespec delay;
TimestampTz end_time;

end_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(), msec);

do
{
longsecs;
int microsecs;

TimestampDifference(GetCurrentTimestamp(), end_time,
, );

delay.tv_sec = secs;
delay.tv_nsec = microsecs * 1000;

} while (nanosleep(, NULL) == -1 && errno == EINTR);

-- 
nathan

Re: Restart pg_usleep when interrupted

2024-07-11 Thread Sami Imseih



> 
> I'm curious why we wouldn't just subtract "elapsed_time" from "delay" at
> the bottom of the while loop to avoid needing this extra check.  

Can you elaborate further? I am not sure how this will work since delay is a 
timespec 
and elapsed time is an instr_time. 

Also, in every iteration of the loop, the delay must be set to the remaining 
time. The
purpose of the elapsed_time is to make sure that we don’t surpass requested time
delay as an additional safeguard.

> Also, I
> think we need some commentary about why we want to retry after an interrupt
> in this case.

I will elaborate further in the comments for the next revision.


Regards,

Sami

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

2024-07-11 Thread Robert Haas

On Thu, Jul 11, 2024 at 6:51 AM Fujii Masao  wrote:
> So far, I haven't found any other issues with the patch.

Here is a new version that removes the hunks you highlighted and also
adds a test case.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v4-0001-Do-not-summarize-WAL-if-generated-with-wal_level-.patch
Description: Binary data

Re: Improve the connection failure error messages

2024-07-11 Thread Tom Lane

Nisha Moond  writes:
> Attached the patch v6 with suggested improvements.
> - Removed unnecessary translator comments.
> - Added appropriate identifier names where missing.

I think this is generally OK, although one could argue that it's
violating our message style guideline that primary error messages
should be short [1].  The text itself isn't that bad, but once you
tack on a libpq connection failure message it's hard to claim that
the result "fits on one line".

Another way we could address this that'd reduce that problem is to
leave the primary messages as-is and add an errdetail or errcontext
message to show what's throwing the error.  However, I'm not convinced
that's better.  The main argument against it is that detail/context
lines can get lost, eg if you're running the server in terse logging
mode.  

On balance I think it's OK, so I pushed it.  I did take out a couple
of uses of "logical replication" that seemed unnecessary and were
making the length problem worse.

regards, tom lane

[1] 
https://www.postgresql.org/docs/devel/error-style-guide.html#ERROR-STYLE-GUIDE-WHAT-GOES-WHERE

Re: speed up a logical replica setup

2024-07-11 Thread Alexander Lakhin


Hello Amit and Hou-San,

11.07.2024 13:21, Amit Kapila wrote:


We don't wait for the xmin to catch up corresponding to this insert
and I don't know if there is a way to do that. So, we should move this
Insert to after the call to pg_sync_replication_slots(). It won't
impact the general test of pg_createsubscriber.

Thanks to Hou-San for helping me in the analysis of this BF failure.


Thank you for investigating that issue!

May I ask you to look at another failure of the test occurred today [1]?
The failure log contains:
recovery_target_lsn = '0/30098D0'
pg_createsubscriber: starting the subscriber
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
2024-07-11 07:40:10.837 UTC [2948531][postmaster][:0] LOG:  received fast 
shutdown request

Though what I'm seeing after a successful run is:
recovery_target_lsn = '0/3009860'
pg_createsubscriber: starting the subscriber
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
2024-07-11 15:19:40.733 UTC [207517] 040_pg_createsubscriber.pl LOG:  
statement: SELECT pg_catalog.pg_is_in_recovery()
2024-07-11 15:19:41.635 UTC [207514] LOG:  recovery stopping after WAL location (LSN) 
"0/3009860"
2024-07-11 15:19:41.635 UTC [207514] LOG:  redo done at 0/3009860 system usage: CPU: user: 0.00 s, system: 0.00 s, 
elapsed: 21.00 s


I've managed to reproduce the failure locally. With the bgwriter mod:
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 150

and wal_debug=on, when running 5 test instances with parallel, I get the
failure with the following log:
recovery_target_lsn = '0/3009A20'
pg_createsubscriber: starting the subscriber

2024-07-11 14:40:04.551 UTC [205589:72][startup][33/0:0] LOG:  REDO @ 0/30099E8; LSN 0/3009A20: prev 0/30099B0; xid 0; 
len 24 - Standby/RUNNING_XACTS: nextXid 747 latestCompletedXid 746 oldestRunningXid 747

# ^^^ the last REDO record in the log
...
pg_createsubscriber: server was started
pg_createsubscriber: waiting for the target server to reach the consistent state
...
pg_createsubscriber: server was stopped
pg_createsubscriber: error: recovery timed out
...
[14:43:06.011](181.800s) not ok 30 - run pg_createsubscriber on node S
[14:43:06.012](0.000s)
[14:43:06.012](0.000s) #   Failed test 'run pg_createsubscriber on node S'
#   at t/040_pg_createsubscriber.pl line 356.

$ pg_waldump -p src/bin/pg_basebackup_1/tmp_check/t_040_pg_createsubscriber_node_s_data/pgdata/pg_wal/ 
00010003 00010003 | tail -2
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099B0, prev 0/03009948, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099E8, prev 0/030099B0, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747


Whilst
$ pg_waldump -p src/bin/pg_basebackup_1/tmp_check/t_040_pg_createsubscriber_node_p_data/pgdata/pg_wal/ 
00010003 00010003 | tail
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099B0, prev 0/03009948, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/030099E8, prev 0/030099B0, desc: RUNNING_XACTS 
nextXid 747 latestCompletedXid 746 oldestRunningXid 747
rmgr: Heap2   len (rec/tot): 60/    60, tx:    747, lsn: 0/03009A20, prev 0/030099E8, desc: NEW_CID rel: 
1663/16384/6104, tid: 0/1, cmin: 4294967295, cmax: 0, combo: 4294967295
rmgr: Heap    len (rec/tot): 54/    54, tx:    747, lsn: 0/03009A60, prev 0/03009A20, desc: DELETE xmax: 
747, off: 1, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/16384/6104 blk 0
rmgr: Transaction len (rec/tot): 78/    78, tx:    747, lsn: 0/03009A98, prev 0/03009A60, desc: INVALIDATION ; 
inval msgs: catcache 49 catcache 46 relcache 0
rmgr: Transaction len (rec/tot): 98/    98, tx:    747, lsn: 0/03009AE8, prev 0/03009A98, desc: COMMIT 
2024-07-11 14:43:05.994561 UTC; relcache init file inval dbid 16384 tsid 1663; inval msgs: catcache 49 catcache 46 
relcache 0
rmgr: Heap2   len (rec/tot): 60/    60, tx:    748, lsn: 0/03009B50, prev 0/03009AE8, desc: NEW_CID rel: 
1663/16385/6104, tid: 0/1, cmin: 4294967295, cmax: 0, combo: 4294967295
rmgr: Heap    len (rec/tot): 54/    54, tx:    748, lsn: 0/03009B90, prev 0/03009B50, desc: DELETE xmax: 
748, off: 1, infobits: [KEYS_UPDATED], flags: 0x00, blkref #0: rel 1663/16385/6104 blk 0
rmgr: Transaction len (rec/tot): 78/    78, tx:    748, lsn: 0/03009BC8, prev 0/03009B90, desc: INVALIDATION ; 
inval msgs: catcache 49 catcache 46 relcache 0
rmgr: Transaction len (rec/tot): 98/    98, tx:    748, lsn: 0/03009C18, prev 0/03009BC8, desc: COMMIT 
2024-07-11

Re: POC, WIP: OR-clause support for indexes

2024-07-11 Thread Alena Rybakina

Sorry for repeating, but I have noticed that this message displays 
incorrectly and just in case I'll duplicate it.


On 11.07.2024 19:17, Alena Rybakina wrote:
The errorwascausedby the specificsof storingthe "OR"clausesinthe 
RestrictInfostructure.Scanning the orclauses list of the RestrictInfo 
variable, wecouldfacenotonlytheitem with RestrictInfo 
type,butalsotheBoolExpr type.


For example, when we have both or clauses and "AND" clauses together, 
like x = 1 and (y =1 or y=2 or y=3 and z = 1). The structure looks like:


RestrictInfo->orclauses = [RestrictInfo [x=1],
RestrictInfo->orclauses = [RestrictInfo[y=1],
                    RestrictInfo [y=2],
                    BoolExpr = [Restrictinfo [y=3], RestrictInfo [z=1]
                   ]
                                     ]

It'sworkingfinenow.

The error was caused by the specifics of storing the "OR" clauses in the 
RestrictInfo structure. When viewing the list of or offers, we could 
encounter not only the RestrictInfo type, but also the BoolExpr type. 
It's working fine now.


For example, when we have both or clauses and "AND" clauses together, 
like x = 1 and (y =1 or y=2 or y=3 and z = 1). The structure looks like:


RestrictInfo->orclauses = [RestrictInfo [x=1],
RestrictInfo->orclauses = [RestrictInfo[y=1],
RestrictInfo [y=2],
BoolExpr = [Restrictinfo [y=3], RestrictInfo [z=1]
]
 ]

It's working fine now.

--
Regards,
Alena Rybakina
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

Re: CFbot failed on Windows platform

2024-07-11 Thread Andrew Dunstan




On 2024-07-11 Th 9:50 AM, Tom Lane wrote:

Tatsuo Ishii  writes:

I think It is related to the '628c1d1f2c' commit. This commit changed
the output of the regress test in Windows.

Yeah, it seems that explains. I see few buildfarm window animals
complain too.

I think that the contents of
src/test/regress/expected/collate.windows.win1252.out are actually
wrong, and we'd not noticed because it was only checked with diff -w.
psql does put an extra trailing space in some lines of table output,
but that space isn't there in collate.windows.win1252.out.





Yeah, makes sense I will produce an updated output file and then 
re-enable --strip-trailing-cr (after a full test run).



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: POC, WIP: OR-clause support for indexes

2024-07-11 Thread Alena Rybakina


Hi, again!

I have finished patch and processed almost your suggestions (from [0], 
[1], [2]). It remainsonlyto addtestswherethe conversionshouldwork,butI 
willaddthis inthe nextversion.


[0] 
https://www.postgresql.org/message-id/3381819.e9J7NaK4W3%40thinkpad-pgpro


[1] 
https://www.postgresql.org/message-id/9736220.CDJkKcVGEf%40thinkpad-pgpro


[2] 
https://www.postgresql.org/message-id/2193851.QkHrqEjB74%40thinkpad-pgpro


On 09.07.2024 04:57, Alena Rybakina wrote:


Hi! Thank you for your review! Sorryforthe delayin responding.

Irewrotethe patchasyourequested,butnowI'm facedwiththe problemof 
processingthe elementsof the or_entries list.For somereason, 
thepointerto thelistis cleared and I couldn't find the place where it 
happened.MaybeI'mmissingsomethingsimpleinviewof the heavyworkloadright 
now,butmaybeyou'll seea problem?Ihave displayedpart of stackbelow.


#5 0x5b0f6d9f6a6a in ExceptionalCondition 
(conditionName=0x5b0f6dbb74f7 "IsPointerList(list)", 
fileName=0x5b0f6dbb7418 "list.c", lineNumber=341) at assert.c:66 #6 
0x5b0f6d5dc3ba in lappend (list=0x5b0f6eec5ca0, 
datum=0x5b0f6eec0d90) at list.c:341 #7 0x5b0f6d69230c in 
transform_or_to_any (root=0x5b0f6eeb13c8, orlist=0x5b0f6eec57c0) at 
initsplan.c:2818 #8 0x5b0f6d692958 in add_base_clause_to_rel 
(root=0x5b0f6eeb13c8, relid=1, restrictinfo=0x5b0f6eec5990) at 
initsplan.c:2982 #9 0x5b0f6d692e5f in 
distribute_restrictinfo_to_rels (root=0x5b0f6eeb13c8, 
restrictinfo=0x5b0f6eec5990) at initsplan.c:3175 #10 
0x5b0f6d691bf2 in distribute_qual_to_rels (root=0x5b0f6eeb13c8, 
clause=0x5b0f6eec0fc0, jtitem=0x5b0f6eec4330, sjinfo=0x0, 
security_level=0, qualscope=0x5b0f6eec4730, ojscope=0x0, 
outerjoin_nonnullable=0x0, incompatible_relids=0x0, 
allow_equivalence=true, has_clone=false, is_clone=false, 
postponed_oj_qual_list=0x0) at initsplan.c:2576 #11 0x5b0f6d69146f 
in distribute_quals_to_rels (root=0x5b0f6eeb13c8, 
clauses=0x5b0f6eec0bb0, jtitem=0x5b0f6eec4330, sjinfo=0x0, 
security_level=0, qualscope=0x5b0f6eec4730, ojscope=0x0, 
outerjoin_nonnullable=0x0, incompatible_relids=0x0, 
allow_equivalence=true, has_clone=false, is_clone=false, 
postponed_oj_qual_list=0x0) at initsplan.c:2144


Thisis stillthe firstiterationof the fixesyouhave proposed,soI have 
attachedthe patchindiffformat.I rewroteit,asyousuggestedinthe 
firstletter[0].Icreateda separatefunctionthattriesto forman 
OrClauseGroup node,butifit failsinthis, it returnsfalse,otherwiseit 
processesthe generatedelementaccordingtowhat it found-eitheraddsit to 
thelistasnew,oraddsa constantto anexistingone.


Ialsodividedonegenerallistof 
suitableforconversionandunsuitableintotwodifferentones:appropriate_entriesandor_entries.Nowweare 
onlylookinginthe listof suitableelementstoformANYexpr.


Thishelpsusto get ridofrepetitionsinthe codeyoumentioned. 
Pleasewriteifthisis notthelogicthatyouhave seenbefore.


[0] 
https://www.postgresql.org/message-id/3381819.e9J7NaK4W3%40thinkpad-pgpro


The errorwascausedby the specificsof storingthe "OR"clausesinthe 
RestrictInfostructure.Scanning the orclauses list of the RestrictInfo 
variable, wecouldfacenotonlytheitem with RestrictInfo 
type,butalsotheBoolExpr type.


For example, when we have both or clauses and "AND" clauses together, 
like x = 1 and (y =1 or y=2 or y=3 and z = 1). The structure looks like:


RestrictInfo->orclauses = [RestrictInfo [x=1],
RestrictInfo->orclauses = [RestrictInfo[y=1],
                    RestrictInfo [y=2],
                    BoolExpr = [Restrictinfo [y=3], RestrictInfo [z=1]
                   ]
                                     ]

It'sworkingfinenow.

--
Regards,
Alena Rybakina
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company
From 26eca98229749b20ad0c82a9fa55f4f37fd34d29 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Thu, 11 Jul 2024 19:01:10 +0300
Subject: [PATCH 1/2] Transform OR clauses to ANY expression

Replace (expr op C1) OR (expr op C2) ... with expr op ANY(ARRAY[C1, C2, ...])
during adding restrictinfo's to the base relation.

Here Cn is a n-th constant or parameters expression, 'expr' is non-constant
expression, 'op' is an operator which returns boolean result and has a commuter
(for the case of reverse order of constant and non-constant parts of the
expression, like 'Cn op expr').

Discussion: https://postgr.es/m/567ED6CA.2040504%40sigaev.ru
Author: Alena Rybakina 
Author: Andrey Lepikhov 
Reviewed-by: Peter Geoghegan 
Reviewed-by: Ranier Vilela 
Reviewed-by: Alexander Korotkov 
Reviewed-by: Robert Haas 
Reviewed-by: Jian He 
Reviewed-by: Tom Lane 
Reviewed-by: Nikolay Shaplov 
---
 src/backend/optimizer/plan/initsplan.c| 335 ++
 src/include/nodes/pathnodes.h |  31 ++
 src/test/regress/expected/create_index.out|  32 +-
 src/test/regress/expected/partition_prune.out |  46 +--
 src/test/regress/expected/stats_ext.out   |  12 +-
 src/test/regress/expected/tidscan.out |   6 +-

Re: Add a GUC check hook to ensure summarize_wal cannot be enabled when wal_level is minimal

2024-07-11 Thread Robert Haas

On Thu, Jul 11, 2024 at 6:51 AM Fujii Masao  wrote:
> It looks like the fast_forward field in WalSummarizerData is no longer 
> necessary.
>
> So far, I haven't found any other issues with the patch.

Thanks for reviewing. Regarding fast_forward, I think I had the idea
in mind that perhaps it should be exposed by
pg_get_wal_summarizer_state(), but I didn't actually implement that.
Thinking about it again, I think maybe it's fine to just remove it
from the shared memory state, as this should be a rare scenario in
practice. What is your opinion?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Logging which local address was connected to in log_line_prefix

2024-07-11 Thread Greg Sabino Mullane

Thanks for the review. Please find attached a new version with proper tabs
and indenting.

Cheers,
Greg


0002-Add-local-address-to-log_line_prefix.patch
Description: Binary data

Re: MIN/MAX functions for a record

2024-07-11 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Jul 08, 2024 at 12:20:30PM +0300, Aleksander Alekseev wrote:
> -and arrays of any of these types.
> +and also arrays and records of any of these types.

> This update of the docs is incorrect, no?  Records could include much
> more types than the ones currently supported for min()/max().

Yeah, actually the contained data types could be anything with
btree sort support.  This is true for arrays too, so the text
was wrong already.  I changed it to

+and also arrays and composite types containing sortable data types.

(Using "composite type" not "record" is a judgment call here, but
I don't see anyplace else in func.sgml preferring "record" for this
meaning.)

> I am not sure to get the concerns of upthread regarding the type
> caching in the context of an aggregate, which is the business with
> lookup_type_cache(), especially since there is a btree operator
> relying on record_cmp().  Tom, what were your concerns here?

Re-reading, I was just mentioning that as something to check,
not a major problem.  It isn't, because array min/max are already
relying on the ability to use fcinfo->flinfo->fn_extra as cache space
in an aggregate.  (Indeed, the array aggregate code is almost
identical to where we ended up.)

AFAICS this is good to go.  I made a couple of tiny cosmetic
adjustments, added a catversion bump, and pushed.

regards, tom lane

Re: tests fail on windows with default git settings

2024-07-11 Thread Dave Page

On Wed, 10 Jul 2024 at 10:35, Dave Page  wrote:

> > The other failures I see are the following, which I'm just starting to
>>> dig
>>> > into:
>>> >
>>> >  26/298 postgresql:recovery / recovery/019_replslot_limit
>>> > ERROR43.05s   exit status 2
>>> >  44/298 postgresql:recovery / recovery/027_stream_regress
>>> > ERROR   383.08s   exit status 1
>>> >  50/298 postgresql:recovery / recovery/035_standby_logical_decoding
>>> > ERROR   138.06s   exit status 25
>>> >  68/298 postgresql:recovery / recovery/040_standby_failover_slots_sync
>>> >  ERROR   132.87s   exit status 25
>>> > 170/298 postgresql:pg_dump / pg_dump/002_pg_dump
>>> >  ERROR93.45s   exit status 2
>>> > 233/298 postgresql:bloom / bloom/001_wal
>>> >  ERROR54.47s   exit status 2
>>> > 236/298 postgresql:subscription / subscription/001_rep_changes
>>> >  ERROR46.46s   exit status 2
>>> > 246/298 postgresql:subscription / subscription/010_truncate
>>> > ERROR47.69s   exit status 2
>>> > 253/298 postgresql:subscription / subscription/013_partition
>>> >  ERROR   125.63s   exit status 25
>>> > 255/298 postgresql:subscription / subscription/022_twophase_cascade
>>> > ERROR58.13s   exit status 2
>>> > 257/298 postgresql:subscription / subscription/015_stream
>>> > ERROR   128.32s   exit status 2
>>> > 262/298 postgresql:subscription / subscription/028_row_filter
>>> > ERROR43.14s   exit status 2
>>> > 263/298 postgresql:subscription / subscription/027_nosuperuser
>>> >  ERROR   102.02s   exit status 2
>>> > 269/298 postgresql:subscription / subscription/031_column_list
>>> >  ERROR   123.16s   exit status 2
>>> > 271/298 postgresql:subscription / subscription/032_subscribe_use_index
>>> >  ERROR   139.33s   exit status 2
>>>
>>> Hm, it'd be good to see some of errors behind that ([1]).
>>>
>>> I suspect it might be related to conflicting ports. I had to use
>>> PG_TEST_USE_UNIX_SOCKETS to avoid random tests from failing:
>>>
>>>   # use unix socket to prevent port conflicts
>>>   $env:PG_TEST_USE_UNIX_SOCKETS = 1;
>>>   # otherwise pg_regress insists on creating the directory and
>>> does it
>>>   # in a non-existing place, this needs to be fixed :(
>>>   mkdir d:/sockets
>>>   $env:PG_REGRESS_SOCK_DIR = "d:/sockets/"
>>>
>>
>> No, it all seems to be fallout from GSSAPI being included in the build.
>> If I build without that, everything passes. Most of the tests are failing
>> with a "too many clients already" error, but a handful do seem to include
>> GSSAPI auth related errors as well. For example, this is from
>>
>
>
> ... this is from subscription/001_rep_changes:
>
> [14:46:57.723](2.318s) ok 11 - check rows on subscriber after table drop
> from publication
> connection error: 'psql: error: connection to server at "127.0.0.1", port
> 58059 failed: could not initiate GSSAPI security context: No credentials
> were supplied, or the credentials were unavailable or inaccessible:
> Credential cache is empty
> connection to server at "127.0.0.1", port 58059 failed: FATAL:  sorry, too
> many clients already'
> while running 'psql -XAtq -d port=58059 host=127.0.0.1 dbname='postgres'
> -f - -v ON_ERROR_STOP=1' at
> C:/Users/dpage/git/postgresql/src/test/perl/PostgreSQL/Test/Cluster.pm line
> 2129.
> # Postmaster PID for node "publisher" is 14488
> ### Stopping node "publisher" using mode immediate
> # Running: pg_ctl -D
> C:\Users\dpage\git\postgresql\build/testrun/subscription/001_rep_changes\data/t_001_rep_changes_publisher_data/pgdata
> -m immediate stop
> waiting for server to shut down done
> server stopped
> # No postmaster PID for node "publisher"
> # Postmaster PID for node "subscriber" is 15012
> ### Stopping node "subscriber" using mode immediate
> # Running: pg_ctl -D
> C:\Users\dpage\git\postgresql\build/testrun/subscription/001_rep_changes\data/t_001_rep_changes_subscriber_data/pgdata
> -m immediate stop
> waiting for server to shut down done
> server stopped
> # No postmaster PID for node "subscriber"
> [14:46:59.068](1.346s) # Tests were run but no plan was declared and
> done_testing() was not seen.
> [14:46:59.069](0.000s) # Looks like your test exited with 2 just after 11.
>

So I received an off-list tip to checkout [1], a discussion around GSSAPI
causing test failures on windows that Alexander Lakhin was looking at.
Thomas Munro's v2 patch to try to address the issue brought me down to just
a single test failure with GSSAPI enabled on 17b2 (with a second, simple
fix for the OpenSSL/Kerberos/x509 issue): pg_dump/002_pg_dump. The
relevant section from the log looks like this:

[15:28:42.692](0.006s) not ok 2 - connecting to a non-existent database:
matches

Re: Restart pg_usleep when interrupted

2024-07-11 Thread Nathan Bossart

+   /*
+* We allow nanosleep to handle interrupts and retry with the 
remaining time.
+* However, since nanosleep is susceptible to time drift when 
interrupted
+* frequently, we add a safeguard to break out of the nanosleep 
whenever the
+* total time of the sleep exceeds the requested sleep time. 
Using nanosleep
+* is a more portable approach than clock_nanosleep.
+*/

I'm curious why we wouldn't just subtract "elapsed_time" from "delay" at
the bottom of the while loop to avoid needing this extra check.  Also, I
think we need some commentary about why we want to retry after an interrupt
in this case.

-- 
nathan

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 171742 matches

Mail list logo