date:20240315

Re: Add new error_action COPY ON_ERROR "log"

2024-03-15 Thread Michael Paquier

On Wed, Mar 13, 2024 at 07:32:37PM +0530, Bharath Rupireddy wrote:
> I'm attaching the v8 patch set implementing the above idea. With this,
> [1] is sent to the client, [2] is sent to the server log. This
> approach not only reduces the duplicate info in the NOTICE and CONTEXT
> messages, but also makes it easy for users to get all the necessary
> info in the NOTICE message without having to set extra parameters to
> get CONTEXT message.

In terms of eliminating the information duplication while allowing
clients to know the value, the attribute and the line involved in the
conversion failure, the approach of tweaking the context information
has merits, I guess.

How do others feel about this kind of tuning?
--
Michael


signature.asc
Description: PGP signature

Re: type cache cleanup improvements

2024-03-15 Thread Teodor Sigaev


Okay, I've applied this piece for now.  Not sure I'll have much room
to look at the rest.


Thank you very much!

Rest of patches, rebased.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/From b30af080d7768c2fdb6198e2e40ef93419f60732 Mon Sep 17 00:00:00 2001
From: Teodor Sigaev 
Date: Fri, 15 Mar 2024 13:55:10 +0300
Subject: [PATCH 3/3] usage of hash_search_with_hash_value

---
 src/backend/utils/cache/attoptcache.c | 39 
 src/backend/utils/cache/typcache.c| 52 +++
 2 files changed, 70 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c
index af978ccd4b1..3a18b2e9a77 100644
--- a/src/backend/utils/cache/attoptcache.c
+++ b/src/backend/utils/cache/attoptcache.c
@@ -44,12 +44,10 @@ typedef struct
 
 /*
  * InvalidateAttoptCacheCallback
- *		Flush all cache entries when pg_attribute is updated.
+ *		Flush cache entry (or entries) when pg_attribute is updated.
  *
  * When pg_attribute is updated, we must flush the cache entry at least
- * for that attribute.  Currently, we just flush them all.  Since attribute
- * options are not currently used in performance-critical paths (such as
- * query execution), this seems OK.
+ * for that attribute.
  */
 static void
 InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
@@ -57,7 +55,16 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	HASH_SEQ_STATUS status;
 	AttoptCacheEntry *attopt;
 
-	hash_seq_init(, AttoptCacheHash);
+	/*
+	 * By convection, zero hash value is passed to the callback as a sign
+	 * that it's time to invalidate the cache. See sinval.c, inval.c and
+	 * InvalidateSystemCachesExtended().
+	 */
+	if (hashvalue == 0)
+		hash_seq_init(, AttoptCacheHash);
+	else
+		hash_seq_init_with_hash_value(, AttoptCacheHash, hashvalue);
+
 	while ((attopt = (AttoptCacheEntry *) hash_seq_search()) != NULL)
 	{
 		if (attopt->opts)
@@ -70,6 +77,18 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	}
 }
 
+/*
+ * Hash function compatible with two-arg system cache hash function.
+ */
+static uint32
+relatt_cache_syshash(const void *key, Size keysize)
+{
+	const AttoptCacheKey* ckey = key;
+
+	Assert(keysize == sizeof(*ckey));
+	return GetSysCacheHashValue2(ATTNUM, ckey->attrelid, ckey->attnum);
+}
+
 /*
  * InitializeAttoptCache
  *		Initialize the attribute options cache.
@@ -82,9 +101,17 @@ InitializeAttoptCache(void)
 	/* Initialize the hash table. */
 	ctl.keysize = sizeof(AttoptCacheKey);
 	ctl.entrysize = sizeof(AttoptCacheEntry);
+
+	/*
+	 * AttoptCacheEntry takes hash value from the system cache. For
+	 * AttoptCacheHash we use the same hash in order to speedup search by hash
+	 * value. This is used by hash_seq_init_with_hash_value().
+	 */
+	ctl.hash = relatt_cache_syshash;
+
 	AttoptCacheHash =
 		hash_create("Attopt cache", 256, ,
-	HASH_ELEM | HASH_BLOBS);
+	HASH_ELEM | HASH_FUNCTION);
 
 	/* Make sure we've initialized CacheMemoryContext. */
 	if (!CacheMemoryContext)
diff --git a/src/backend/utils/cache/typcache.c b/src/backend/utils/cache/typcache.c
index 7936c3b46d0..9145088f44d 100644
--- a/src/backend/utils/cache/typcache.c
+++ b/src/backend/utils/cache/typcache.c
@@ -339,6 +339,15 @@ static TupleDesc find_or_make_matching_shared_tupledesc(TupleDesc tupdesc);
 static dsa_pointer share_tupledesc(dsa_area *area, TupleDesc tupdesc,
    uint32 typmod);
 
+/*
+ * Hash function compatible with one-arg system cache hash function.
+ */
+static uint32
+type_cache_syshash(const void *key, Size keysize)
+{
+	Assert(keysize == sizeof(Oid));
+	return GetSysCacheHashValue1(TYPEOID, ObjectIdGetDatum(*(const Oid*)key));
+}
 
 /*
  * lookup_type_cache
@@ -364,8 +373,15 @@ lookup_type_cache(Oid type_id, int flags)
 
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(TypeCacheEntry);
+		/*
+		 * TypeEntry takes hash value from the system cache. For TypeCacheHash
+		 * we use the same hash in order to speedup search by hash value. This
+		 * is used by hash_seq_init_with_hash_value().
+		 */
+		ctl.hash = type_cache_syshash;
+
 		TypeCacheHash = hash_create("Type information cache", 64,
-	, HASH_ELEM | HASH_BLOBS);
+	, HASH_ELEM | HASH_FUNCTION);
 
 		ctl.keysize = sizeof(Oid);
 		ctl.entrysize = sizeof(mapRelTypeEntry);
@@ -421,8 +437,7 @@ lookup_type_cache(Oid type_id, int flags)
 
 		/* These fields can never change, by definition */
 		typentry->type_id = type_id;
-		typentry->type_id_hash = GetSysCacheHashValue1(TYPEOID,
-	   ObjectIdGetDatum(type_id));
+		typentry->type_id_hash = get_hash_value(TypeCacheHash, _id);
 
 		/* Keep this part in sync with the code below */
 		typentry->typlen = typtup->typlen;
@@ -2429,20 +2444,27 @@ TypeCacheTypCallback(Datum arg, int cacheid, uint32 hashvalue)
 	TypeCacheEntry *typentry;

Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning

2024-03-15 Thread Amit Langote

Hi Ashutosh,

On Mon, Feb 19, 2024 at 10:01 PM Ashutosh Bapat
 wrote:
> On Sun, Feb 18, 2024 at 10:55 PM Tomas Vondra
>  wrote:
> >
> > Hi,
> >
> > I took a quick look at this patch today. I certainly agree with the
> > intent to reduce the amount of memory during planning, assuming it's not
> > overly disruptive. And I think this patch is fairly localized and looks
> > sensible.
>
> Thanks for looking at the patch-set.
> >
> > That being said I'm a big fan of using a local variable on stack and
> > filling it. I'd probably go with the usual palloc/pfree, because that
> > makes it much easier to use - the callers would not be responsible for
> > allocating the SpecialJoinInfo struct. Sure, it's a little bit of
> > overhead, but with the AllocSet caching I doubt it's measurable.
>
> You are suggesting that instead of declaring a local variable of type
> SpecialJoinInfo, build_child_join_sjinfo() should palloc() and return
> SpecialJoinInfo which will be freed by free_child_sjinfo()
> (free_child_sjinfo_members in the patch). I am fine with that.

Agree with Tomas about using makeNode() / pfree().  Having the pfree()
kind of makes it extra-evident that those SpecialJoinInfos are
transitory.

> > I did put this through check-world on amd64/arm64, with valgrind,
> > without any issue. I also tried the scripts shared by Ashutosh in his
> > initial message (with some minor fixes, adding MEMORY to explain etc).
> >
> > The results with the 20240130 patches are like this:
> >
> >tablesmasterpatched
> >   -
> > 2  40.8   39.9
> > 3 151.7  142.6
> > 4 464.0  418.5
> > 51663.9 1419.5

Could you please post the numbers with the palloc() / pfree() version?

-- 
Thanks, Amit Langote

Re: Add the ability to limit the amount of memory that can be allocated to backends.

2024-03-15 Thread Alexander Lakhin


Hello Anton,

14.03.2024 23:36, Anton A. Melnikov wrote:

On 13.03.2024 10:41, Anton A. Melnikov wrote:


Here is a version updated for the current master.



During patch updating i mistakenly added double counting of deallocatated 
blocks.
That's why the tests in the patch tester failed.
Fixed it and squashed fix 0002 with 0001.
Here is fixed version.


Please try the following with the patches applied:
echo "shared_buffers = '1MB'
max_total_backend_memory = '10MB'" > /tmp/extra.config

CPPFLAGS="-Og" ./configure --enable-tap-tests --enable-debug --enable-cassert 
...
TEMP_CONFIG=/tmp/extra.config make check

It fails for me as follows:
...
# postmaster did not respond within 60 seconds, examine 
".../src/test/regress/log/postmaster.log" for the reason
...
src/test/regress/log/postmaster.log contains:
...
TRAP: failed Assert("ret != NULL"), File: "mcxt.c", Line: 1327, PID: 4109270
TRAP: failed Assert("ret != NULL"), File: "mcxt.c", Line: 1327, PID: 4109271
postgres: autovacuum launcher (ExceptionalCondition+0x69)[0x55ce441fcc6e]
postgres: autovacuum launcher (palloc0+0x0)[0x55ce4422eb67]
postgres: logical replication launcher 
(ExceptionalCondition+0x69)[0x55ce441fcc6e]
postgres: autovacuum launcher (InitDeadLockChecking+0xa6)[0x55ce4408a6f0]
postgres: logical replication launcher (palloc0+0x0)[0x55ce4422eb67]
postgres: logical replication launcher 
(InitDeadLockChecking+0x45)[0x55ce4408a68f]
postgres: autovacuum launcher (InitProcess+0x600)[0x55ce4409c6f2]
postgres: logical replication launcher (InitProcess+0x600)[0x55ce4409c6f2]
postgres: autovacuum launcher (+0x44b4e2)[0x55ce43ff24e2]
...
grep TRAP src/test/regress/log/postmaster.log | wc -l
445

Best regards,
Alexander

Re: Weird test mixup

2024-03-15 Thread Heikki Linnakangas


On 15/03/2024 09:39, Michael Paquier wrote:

On Thu, Mar 14, 2024 at 07:13:53PM -0400, Tom Lane wrote:

I can see that some tests would want to be able to inject code
cluster-wide, but I bet that's going to be a small minority.
I suggest that we invent a notion of "global" vs "local"
injection points, where a "local" one only fires in the DB it
was defined in.  Then only tests that require a global injection
point need to be NO_INSTALLCHECK.


Attached is a POC of what could be done.  I have extended the module
injection_points so as it is possible to register what I'm calling a
"condition" in the module that can be defined with a new SQL function.

The condition is stored in shared memory with the point name, then at
runtime the conditions are cross-checked in the callbacks.  With the
interface of this patch, the condition should be registered *before* a
point is attached, but this stuff could also be written so as
injection_points_attach() takes an optional argument with a database
name.  Or this could use a different, new SQL function, say a
injection_points_attach_local() that registers a condition with
MyDatabaseId on top of attaching the point, making the whole happening
while holding once the spinlock of the shmem state for the module.


For the gin test, a single "SELECT injection_points_attach_local()" at 
the top of the test file would be most convenient.


If I have to do "SELECT 
injection_points_condition('gin-finish-incomplete-split', :'datname');" 
for every injection point in the test, I will surely forget it sometimes.


In the 'gin' test, they could actually be scoped to the same backend.


Wrt. the spinlock and shared memory handling, I think this would be 
simpler if you could pass some payload in the InjectionPointAttach() 
call, which would be passed back to the callback function:


 void
 InjectionPointAttach(const char *name,
 const char *library,
-const char *function)
+const char *function,
+uint64 payload)

In this case, the payload would be the "slot index" in shared memory.

Or perhaps always allocate, say, 1024 bytes of working area for every 
attached injection point that the test module can use any way it wants. 
Like for storing extra conditions, or for the wakeup counter stuff in 
injection_wait(). A fixed size working area is a little crude, but would 
be very handy in practice.



By the way, modules/gin/ was missing missing a detach, so the test was
not repeatable with successive installchecks.


Oops.

It would be nice to automatically detach all the injection points on 
process exit. You wouldn't always want that, but I think most tests hold 
a session open throughout the test, and for those it would be handy.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: remaining sql/json patches

2024-03-15 Thread jian he

On Mon, Mar 11, 2024 at 11:30 AM jian he  wrote:
>
> On Sun, Mar 10, 2024 at 10:57 PM jian he  wrote:
> >
> > one more issue.
>
> Hi
> one more documentation issue.
> after applied V42, 0001 to 0003,
> there are 11 appearance of `FORMAT JSON` in functions-json.html
> still not a single place explained what it is for.
>
> json_query ( context_item, path_expression [ PASSING { value AS
> varname } [, ...]] [ RETURNING data_type [ FORMAT JSON [ ENCODING UTF8
> ] ] ] [ { WITHOUT | WITH { CONDITIONAL | [UNCONDITIONAL] } } [ ARRAY ]
> WRAPPER ] [ { KEEP | OMIT } QUOTES [ ON SCALAR STRING ] ] [ { ERROR |
> NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression } ON EMPTY ]
> [ { ERROR | NULL | EMPTY { [ ARRAY ] | OBJECT } | DEFAULT expression }
> ON ERROR ])
>
> FORMAT JSON seems just a syntax sugar or for compatibility in json_query.
> but it returns an error when the returning type category is not
> TYPCATEGORY_STRING.
>
> for example, even the following will return an error.
> `
> CREATE TYPE regtest_comptype AS (b text);
> SELECT JSON_QUERY(jsonb '{"a":{"b":"c"}}', '$.a' RETURNING
> regtest_comptype format json);
> `
>
> seems only types in[0] will not generate an error, when specifying
> FORMAT JSON in JSON_QUERY.
>
> so it actually does something, not a syntax sugar?
>

SELECT * FROM JSON_TABLE(jsonb'[{"aaa": 123}]', 'lax $[*]' COLUMNS
(js2 text format json PATH '$' omit quotes));
SELECT * FROM JSON_TABLE(jsonb'[{"aaa": 123}]', 'lax $[*]' COLUMNS
(js2 text format json PATH '$' keep quotes));
SELECT * FROM JSON_TABLE(jsonb'[{"aaa": 123}]', 'lax $[*]' COLUMNS
(js2 text PATH '$' keep quotes)); -- JSON_QUERY_OP
SELECT * FROM JSON_TABLE(jsonb'[{"aaa": 123}]', 'lax $[*]' COLUMNS
(js2 text PATH '$' omit quotes)); -- JSON_QUERY_OP
SELECT * FROM JSON_TABLE(jsonb'[{"aaa": 123}]', 'lax $[*]' COLUMNS
(js2 text PATH '$')); -- JSON_VALUE_OP
SELECT * FROM JSON_TABLE(jsonb'[{"aaa": 123}]', 'lax $[*]' COLUMNS
(js2 json PATH '$')); -- JSON_QUERY_OP
comparing these queries, I think 'FORMAT JSON' main usage is in json_table.

CREATE TYPE regtest_comptype AS (b text);
SELECT JSON_QUERY(jsonb '{"a":{"b":"c"}}', '$.a' RETURNING
regtest_comptype format json);
ERROR:  cannot use JSON format with non-string output types
LINE 1: ..."a":{"b":"c"}}', '$.a' RETURNING regtest_comptype format jso...
 ^
the error message is not good, but that's a minor issue. we can pursue it later.
-
SELECT JSON_QUERY(jsonb 'true', '$' RETURNING int KEEP QUOTES );
SELECT JSON_QUERY(jsonb 'true', '$' RETURNING int omit QUOTES );
SELECT JSON_VALUE(jsonb 'true', '$' RETURNING int);
the third query returns integer 1, not sure this is the desired behavior.
it obviously has an implication for json_table.
-
in jsonb_get_element, we have something like:
if (jbvp->type == jbvBinary)
{
container = jbvp->val.binary.data;
have_object = JsonContainerIsObject(container);
have_array = JsonContainerIsArray(container);
Assert(!JsonContainerIsScalar(container));
}

+ res = JsonValueListHead();
+ if (res->type == jbvBinary && JsonContainerIsScalar(res->val.binary.data))
+ JsonbExtractScalar(res->val.binary.data, res);
So in JsonPathValue, the above (res->type == jbvBinary) is unreachable?
also see the comment in jbvBinary.

maybe we can just simply do:
if (res->type == jbvBinary)
Assert(!JsonContainerIsScalar(res->val.binary.data));
-
+
+JSON_TABLE (
+  context_item,
path_expression  AS
json_path_name  
PASSING { value AS
varname } , ...

+  COLUMNS ( json_table_column ,
... )
+   { ERROR | EMPTY }
ON ERROR 
+  
+PLAN ( json_table_plan ) |
+PLAN DEFAULT ( { INNER | OUTER }  , { CROSS | UNION } 
+ | { CROSS | UNION }  , { INNER | OUTER }
 )
+  
+)

based on the synopsis
the following query should not be allowed?
SELECT *FROM (VALUES ('"11"'), ('"err"')) vals(js)
LEFT OUTER JOIN  JSON_TABLE(vals.js::jsonb, '$' COLUMNS (a int PATH
'$') default '11' ON ERROR) jt ON true;

aslo the synopsis need to reflect case like:
SELECT *FROM (VALUES ('"11"'), ('"err"')) vals(js)
LEFT OUTER JOIN  JSON_TABLE(vals.js::jsonb, '$' COLUMNS (a int PATH
'$') NULL ON ERROR) jt ON true;

Re: Infinite loop in XLogPageRead() on standby

2024-03-15 Thread Ants Aasma

On Wed, 13 Mar 2024 at 04:56, Kyotaro Horiguchi  wrote:
>
> At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi 
>  wrote in
> > Oh, I once saw the fix work, but seems not to be working after some
> > point. The new issue was a corruption of received WAL records on the
> > first standby, and it may be related to the setting.
>
> I identified the cause of the second issue. When I tried to replay the
> issue, the second standby accidentally received the old timeline's
> last page-spanning record till the end while the first standby was
> promoting (but it had not been read by recovery). In addition to that,
> on the second standby, there's a time window where the timeline
> increased but the first segment of the new timeline is not available
> yet. In this case, the second standby successfully reads the
> page-spanning record in the old timeline even after the second standby
> noticed that the timeline ID has been increased, thanks to the
> robustness of XLogFileReadAnyTLI().
>
> I think the primary change to XLogPageRead that I suggested is correct
> (assuming the use of wal_segment_size instead of the
> constant). However, still XLogFileReadAnyTLI() has a chance to read
> the segment from the old timeline after the second standby notices a
> timeline switch, leading to the second issue. The second issue was
> fixed by preventing XLogFileReadAnyTLI from reading segments from
> older timelines than those suggested by the latest timeline
> history. (In other words, disabling the "AnyTLI" part).
>
> I recall that there was a discussion for commit 4bd0ad9e44, about the
> objective of allowing reading segments from older timelines than the
> timeline history suggests. In my faint memory, we concluded to
> postpone making the decision to remove the feature due to uncertainity
> about the objective. If there's no clear reason to continue using
> XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt
> XLogFileReadOnTLHistory(), which reads segments that align precisely
> with the timeline history.


This sounds very similar to the problem described in [1]. And I think
both will be resolved by that change.

[1] 
https://postgr.es/m/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw%40mail.gmail.com

Re: MERGE ... RETURNING

2024-03-15 Thread Dean Rasheed

On Fri, 15 Mar 2024 at 11:06, Dean Rasheed  wrote:
>
> Updated patch attached.
>

I have gone over this patch again in detail, and I believe that the
code is in good shape. All review comments have been addressed, and
the only thing remaining is the syntax question.

To recap, this adds support for a single RETURNING list at the end of
a MERGE command, and a special MERGE_ACTION() function that may be
used in the RETURNING list to return the action command string
('INSERT', 'UPDATE', or 'DELETE') that was executed.

Looking for similar precedents in other databases, SQL Server uses a
slightly different (non-standard) syntax for MERGE, and uses "OUTPUT"
instead of "RETURNING" to return rows. But it does allow "$action" in
the output list, which is functionally equivalent to MERGE_ACTION():

https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-ver16#output_clause

In the future, we may choose to support the SQL standard syntax for
returning rows modified by INSERT, UPDATE, DELETE, and MERGE commands,
but I don't think that this patch needs to do that.

What this patch does is to make MERGE more consistent with INSERT,
UPDATE, and DELETE, by allowing RETURNING. And if the patch to add
support for returning OLD/NEW values [1] makes it in too, it will be
more powerful than the SQL standard syntax, since it will allow both
old and new values to be returned at the same time, in arbitrary
expressions.

So barring any further objections, I'd like to go ahead and get this
patch committed.

Regards,
Dean

[1] 
https://www.postgresql.org/message-id/flat/CAEZATCWx0J0-v=Qjc6gXzR=KtsdvAE7Ow=D=mu50agoe+pv...@mail.gmail.com

Re: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

2024-03-15 Thread Nazir Bilal Yavuz

Hi,

On Thu, 7 Mar 2024 at 15:26, Cédric Villemain
 wrote:
>
> On 07/03/2024 12:19, Nazir Bilal Yavuz wrote:
> >
> > I did not test read performance but I am planning to do that soon.

I did not have the time to check other things you mentioned but I
tested the read performance. The table size is 5.5GB, I did 20 runs in
total.

When the OS cache is cleared:

Master -> Median: 2266.293ms, Avg: 2265.5038ms
Patched -> Median: 2166.493ms, Avg: 2183.6208ms

When the buffers are in the OS cache:

Master -> Median: 585.719ms, Avg: 583.5032ms
Patched -> Median: 533.071ms, Avg: 532.7119ms

Patched version is better on both. ~4% when the OS cache is cleared,
~%9 when the buffers are in the OS cache.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Re: Adding comments to help understand psql hidden queries

2024-03-15 Thread Jim Jones

Hi Greg, hi David

On 01.02.24 23:39, David Christensen wrote:
> On Thu, Feb 1, 2024 at 4:34 PM Greg Sabino Mullane  wrote:
>> The use of the --echo-hidden flag in psql is used to show people the way 
>> psql performs its magic for its backslash commands. None of them has more 
>> magic than "\d relation", but it suffers from needing a lot of separate 
>> queries to gather all of the information it needs. Unfortunately, those 
>> queries can get overwhelming and hard to figure out which one does what, 
>> especially for those not already very familiar with the system catalogs. 
>> Attached is a patch to add a small SQL comment to the top of each SELECT 
>> query inside describeOneTableDetail. All other functions use a single query, 
>> and thus need no additional context. But "\d mytable" has the potential to 
>> run over a dozen SQL queries! The new format looks like this:
>>
>> / QUERY */
>> /* Get information about row-level policies */
>> SELECT pol.polname, pol.polpermissive,
>>   CASE WHEN pol.polroles = '{0}' THEN NULL ELSE 
>> pg_catalog.array_to_string(array(select rolname from pg_catalog.pg_roles 
>> where oid = any (pol.polroles) order by 1),',') END,
>>   pg_catalog.pg_get_expr(pol.polqual, pol.polrelid),
>>   pg_catalog.pg_get_expr(pol.polwithcheck, pol.polrelid),
>>   CASE pol.polcmd
>> WHEN 'r' THEN 'SELECT'
>> WHEN 'a' THEN 'INSERT'
>> WHEN 'w' THEN 'UPDATE'
>> WHEN 'd' THEN 'DELETE'
>> END AS cmd
>> FROM pg_catalog.pg_policy pol
>> WHERE pol.polrelid = '134384' ORDER BY 1;
>> //
>>
>> Cheers,
>> Greg
> Thanks, this looks like some helpful information. In the same vein,
> I'm including a patch which adds information about the command that
> generates the given query as well (atop your commit).  This will
> modify the query line to include the command itself:
>
> / QUERY (\dRs) */
>
> Best,
>
> David

Having this kind of information in each query would have saved me a lot
of time in the past :) +1

There is a tiny little issue in the last patch (qualifiers):

command.c:312:16: warning: assignment discards ‘const’ qualifier from
pointer target type [-Wdiscarded-qualifiers]
  312 | curcmd = cmd;

Thanks

-- 
Jim

Re: Extensible storage manager API - SMGR hook Redux

2024-03-15 Thread Aleksander Alekseev

Hi,

> Thought I would show off what is possible with this patchset.
>
> [...]

Just wanted to let you know that cfbot doesn't seem to be entirely
happy with the patch [1]. Please consider submitting an updated
version.

Best regards,
Aleksander Alekseev (wearing co-CFM hat)

[1]: http://cfbot.cputube.org/

Re: Support json_errdetail in FRONTEND builds

2024-03-15 Thread Tom Lane

Daniel Gustafsson  writes:
> I can't see how refusing to free memory owned and controlled by someone else,
> and throwing an error if attempted, wouldn't be a sound defensive programming
> measure.

I think the argument is about what "refusal" looks like.
An Assert seems OK to me, but anything based on elog is
likely to be problematic, because it'll involve exit()
somewhere.

regards, tom lane

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-15 Thread Alexander Korotkov

On Mon, Mar 11, 2024 at 12:44 PM Alexander Korotkov
 wrote:
> I've decided to put my hands on this patch.
>
> On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila  wrote:
> > +1 for the second one not only because it avoids new words in grammar
> > but also sounds to convey the meaning. I think you can explain in docs
> > how this feature can be used basically how will one get the correct
> > LSN value to specify.
>
> I picked the second option and left only the AFTER clause for the
> BEGIN statement.  I think this should be enough for the beginning.
>
> > As suggested previously also pick one of the approaches (I would
> > advocate the second one) and keep an option for the second one by
> > mentioning it in the commit message. I hope to see more
> > reviews/discussions or usage like how will users get the LSN value to
> > be specified on the core logic of the feature at this stage. IF
> > possible, state, how real-world applications could leverage this
> > feature.
>
> I've added a paragraph to the docs about the usage.  After you made
> some changes on primary, you run pg_current_wal_insert_lsn().  Then
> connect to replica and run 'BEGIN AFTER lsn' with the just obtained
> LSN.  Now you're guaranteed to see the changes made to the primary.
>
> Also, I've significantly reworked other aspects of the patch.  The
> most significant changes are:
> 1) Waiters are now stored in the array sorted by LSN.  This saves us
> from scanning of wholeper-backend array.
> 2) Waiters are removed from the array immediately once their LSNs are
> replayed.  Otherwise, the WAL replayer will keep scanning the shared
> memory array till waiters wake up.
> 3) To clean up after errors, we now call WaitLSNCleanup() on backend
> shmem exit.  I think this is preferable over the previous approach to
> remove from the queue before ProcessInterrupts().
> 4) There is now condition to recheck if LSN is replayed after adding
> to the shared memory array.  This should save from the race
> conditions.
> 5) I've renamed too generic names for functions and files.

I went through this patch another time, and made some minor
adjustments.  Now it looks good, I'm going to push it if no
objections.

--
Regards,
Alexander Korotkov


v9-0001-Implement-AFTER-clause-for-BEGIN-command.patch
Description: Binary data

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-15 Thread Nathan Bossart

On Thu, Mar 14, 2024 at 12:24:00PM +0530, Amit Kapila wrote:
> On Wed, Mar 13, 2024 at 9:24 PM Bharath Rupireddy
>  wrote:
>> On Wed, Mar 13, 2024 at 9:21 AM Amit Kapila  wrote:
>> > > So, how about we turn conflict_reason to only report the reasons that
>> > > actually cause conflict with recovery for logical slots, something
>> > > like below, and then have invalidation_cause as a generic column for
>> > > all sorts of invalidation reasons for both logical and physical slots?
>> >
>> > If our above understanding is correct then coflict_reason will be a
>> > subset of invalidation_reason. If so, whatever way we arrange this
>> > information, there will be some sort of duplicity unless we just have
>> > one column 'invalidation_reason' and update the docs to interpret it
>> > correctly for conflicts.
>>
>> Yes, there will be some sort of duplicity if we emit conflict_reason
>> as a text field. However, I still think the better way is to turn
>> conflict_reason text to conflict boolean and set it to true only on
>> rows_removed and wal_level_insufficient invalidations. When conflict
>> boolean is true, one (including all the tests that we've added
>> recently) can look for invalidation_reason text field for the reason.
>> This sounds reasonable to me as opposed to we just mentioning in the
>> docs that "if invalidation_reason is rows_removed or
>> wal_level_insufficient it's the reason for conflict with recovery".
> 
> Fair point. I think we can go either way. Bertrand, Nathan, and
> others, do you have an opinion on this matter?

WFM

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: CF entries for 17 to be reviewed

2024-03-15 Thread Aleksander Alekseev

Hi,

> > Aleksander, I would greatly appreciate if you join me in managing CF. 
> > Together we can move more stuff :)
> > Currently, I'm going through "SQL Commands". And so far I had not come to 
> > "Performance" and "Server Features" at all... So if you can handle updating 
> > statuses of that sections - that would be great.
>
> Server Features:
>
> [...]
>
> * (to be continued)

Server Features:

* Custom storage managers (SMGR), redux
The patch needs a rebase. I notified the authors.
* pg_tracing
   Ditto. The patch IMO is promising. I encourage anyone interested in
the topic to take a look.
* Support run-time partition pruning for hash join
  The patch needs (more) review. It doesn't look extremely complex.
* Support prepared statement invalidation when result or argument types change
  I changed the status to "Waiting on Author". The patch needs a
rebase since January.
* Allow INSTEAD OF DELETE triggers to modify the tuple for RETURNING
  Needs rebase.

--
Best regards,
Aleksander Alekseev

Re: [HACKERS] Allow INSTEAD OF DELETE triggers to modify the tuple for RETURNING

2024-03-15 Thread Aleksander Alekseev

Hi,

> it took me a while to figure out why the doc build fails.
>
> [...]
>
> Anyway, based on your patch, I modified it, also added a slightly more 
> complicated test.

Thank you for keeping the patch up-to-date. Unfortunately, it seems to
be needing another rebase, according to cfbot.

Best regards,
Aleksander Alekseev (wearing co-CFM hat)

Re: Weird test mixup

2024-03-15 Thread Heikki Linnakangas


On 15/03/2024 16:00, Tom Lane wrote:

Heikki Linnakangas  writes:

On 15/03/2024 13:09, Heikki Linnakangas wrote:

I committed a patch to do that, to put out the fire.



That's turning the buildfarm quite red. Many, but not all animals are
failing like this:


It may be even worse than it appears from the buildfarm status page.
My animals were stuck in infinite loops that required a manual "kill"
to get out of, and it's reasonable to assume there are others that
will require owner intervention.  Why would this test have done that,
if the module failed to load?


The gin_incomplete_split test inserts rows until it hits the injection 
point, at page split. There is a backstop, it should give up after 1 
iterations, but that was broken. Fixed that, thanks for the report!


Hmm, don't we have any timeout that would kill tests if they get stuck?

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Weird test mixup

2024-03-15 Thread Heikki Linnakangas


On 15/03/2024 14:10, Heikki Linnakangas wrote:

On 15/03/2024 13:09, Heikki Linnakangas wrote:

I committed a patch to do that, to put out the fire.


That's turning the buildfarm quite red. Many, but not all animals are
failing like this:


--- 
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/expected/injection_points.out
 2024-03-15 12:41:16.363286975 +0100
+++ 
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/results/injection_points.out
  2024-03-15 12:53:11.528159615 +0100
@@ -1,118 +1,111 @@
  CREATE EXTENSION injection_points;
+ERROR:  extension "injection_points" is not available
+DETAIL:  Could not open extension control file 
"/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/tmp_install/home/buildfarm/hippopotamus/buildroot/HEAD/inst/share/postgresql/extension/injection_points.control":
 No such file or directory.
+HINT:  The extension must first be installed on the system where PostgreSQL is 
running.
...


Looks like adding NO_INSTALLCHECK somehow affected how the modules are
installed in tmp_install. I'll investigate..


I think this is a bug in the buildfarm client. In the make_misc_check 
step, it does this (reduced to just the interesting parts):



# run the modules that can't be run with installcheck
sub make_misc_check
{
...
my @dirs = glob("$pgsql/src/test/modules/* $pgsql/contrib/*");
foreach my $dir (@dirs)
{
next unless -e "$dir/Makefile";
my $makefile = file_contents("$dir/Makefile");
next unless $makefile =~ /^NO_INSTALLCHECK/m;
my $test = basename($dir);

# skip redundant TAP tests which are called elsewhere
my @out = run_log("cd $dir && $make $instflags TAP_TESTS= 
check");
...
}


So it scans src/test/modules, and runs "make check" for all 
subdirectories that have NO_INSTALLCHECK in the makefile. But the 
injection fault tests are also conditional on the 
enable_injection_points in the parent Makefile:



ifeq ($(enable_injection_points),yes)
SUBDIRS += injection_points gin
else
ALWAYS_SUBDIRS += injection_points gin
endif


The buildfarm client doesn't pay any attention to that, and runs the 
test anyway.


I committed an ugly hack to the subdirectory Makefiles, to turn "make 
check" into a no-op if injection points are disabled. Normally when you 
run "make check" at the parent level, it doesn't even recurse to the 
directories, but this works around the buildfarm script. I hope...


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Weird test mixup

2024-03-15 Thread Tom Lane

Heikki Linnakangas  writes:
> On 15/03/2024 13:09, Heikki Linnakangas wrote:
>> I committed a patch to do that, to put out the fire.

> That's turning the buildfarm quite red. Many, but not all animals are 
> failing like this:

It may be even worse than it appears from the buildfarm status page.
My animals were stuck in infinite loops that required a manual "kill"
to get out of, and it's reasonable to assume there are others that
will require owner intervention.  Why would this test have done that,
if the module failed to load?

regards, tom lane

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-15 Thread Masahiko Sawada

On Fri, Mar 15, 2024 at 4:36 PM John Naylor  wrote:
>
> On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 14, 2024 at 6:55 PM John Naylor  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  
> > > > wrote:
> > > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > > general and put the validation down into C as well. First we save the
> > > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > > found values in another array. So the static items structure would
> > > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > > the iteration output is private to check_set_block_offsets(). Then
> > > > > sort as needed and check they are all the same.
> > > >
> > > > That's a promising idea. We can use the same mechanism for randomized
> > > > tests too. If you're going to work on this, I'll do other tests on my
> > > > environment in the meantime.
> > >
> > > Some progress on this in v72 -- I tried first without using SQL to
> > > save the blocks, just using the unique blocks from the verification
> > > array. It seems to work fine.
> >
> > Thanks!
>
> Seems I forgot the attachment last time...there's more stuff now
> anyway, based on discussion.

Thank you for updating the patches!

The idea of using three TID arrays for the lookup test and iteration
test looks good to me. I think we can add random-TIDs tests on top of
it.

>
> > > - Since there are now three arrays we should reduce max bytes to
> > > something smaller.
> >
> > Agreed.
>
> I went further than this, see below.
>
> > > - Further on that, I'm not sure if the "is full" test is telling us
> > > much. It seems we could make max bytes a static variable and set it to
> > > the size of the empty store. I'm guessing it wouldn't take much to add
> > > enough tids so that the contexts need to allocate some blocks, and
> > > then it would appear full and we can test that. I've made it so all
> > > arrays repalloc when needed, just in case.
> >
> > How about using work_mem as max_bytes instead of having it as a static
> > variable? In test_tidstore.sql we set work_mem before creating the
> > tidstore. It would make the tidstore more controllable by SQL queries.
>
> My complaint is that the "is full" test is trivial, and also strange
> in that max_bytes is used for two unrelated things:
>
> - the initial size of the verification arrays, which was always larger
> than necessary, and now there are three of them
> - the hint to TidStoreCreate to calculate its max block size / the
> threshold for being "full"
>
> To make the "is_full" test slightly less trivial, my idea is to save
> the empty store size and later add enough tids so that it has to
> allocate new blocks/DSA segments, which is not that many, and then it
> will appear full. I've done this and also separated the purpose of
> various sizes in v72-0009/10.

I see your point and the changes look good to me.

> Using actual work_mem seems a bit more difficult to make this work.

Agreed.

>
>
> > ---
> > +   if (TidStoreIsShared(ts))
> > +   found = shared_rt_set(ts->tree.shared, blkno, page);
> > +   else
> > +   found = local_rt_set(ts->tree.local, blkno, page);
> > +
> > +   Assert(!found);
> >
> > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > overwrite) the value, I think we should not expect that found is
> > always false.
>
> I find that a puzzling statement, since 1) it was designed for
> insert-only workloads, not actual overwrite IIRC and 2) the tests will
> now fail if the same block is set twice, since we just switched the
> tests to use a remnant of vacuum's old array. Having said that, I
> don't object to removing artificial barriers to using it for purposes
> not yet imagined, as long as test_tidstore.sql warns against that.

I think that if it supports only insert-only workload and expects the
same block is set only once, it should raise an error rather than an
assertion. It's odd to me that the function fails only with an
assertion build assertions even though it actually works fine even in
that case.

As for test_tidstore you're right that the test code doesn't handle
the case where setting the same block twice. I think that there is no
problem in the fixed-TIDs tests, but we would need something for
random-TIDs tests so that we don't set the same block twice. I guess
it could be trivial since we can use SQL queries to generate TIDs. I'm
not sure how the random-TIDs tests would be like, but I think we can
use SELECT DISTINCT to eliminate the duplicates of block numbers to
use.

>
> Given the above two things, I think this function's comment needs
> stronger language about its limitations. Perhaps even mention that
> it's intended for, and optimized for, vacuum. You and I have long
>

Re: recovery modules

2024-03-15 Thread Nathan Bossart

rebased for cfbot

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d3f497906daf1c405059b2c292f1eeb5cfeb742b Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 15 Feb 2023 14:28:53 -0800
Subject: [PATCH v19 1/5] introduce routine for checking mutually exclusive
 string GUCs

---
 src/backend/postmaster/pgarch.c |  8 +++-
 src/backend/utils/misc/guc.c| 22 ++
 src/include/utils/guc.h |  3 +++
 3 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index f97035ca03..dcb5222bbe 100644
--- a/src/backend/postmaster/pgarch.c
+++ b/src/backend/postmaster/pgarch.c
@@ -815,11 +815,9 @@ LoadArchiveLibrary(void)
 {
 	ArchiveModuleInit archive_init;
 
-	if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("both archive_command and archive_library set"),
- errdetail("Only one of archive_command, archive_library may be set.")));
+	(void) CheckMutuallyExclusiveStringGUCs(XLogArchiveLibrary, "archive_library",
+			XLogArchiveCommand, "archive_command",
+			ERROR);
 
 	/*
 	 * If shell archiving is enabled, use our special initialization function.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 391866145e..ac507008ce 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2659,6 +2659,28 @@ ReportGUCOption(struct config_generic *record)
 	pfree(val);
 }
 
+/*
+ * If both parameters are set, emits a log message at 'elevel' and returns
+ * false.  Otherwise, returns true.
+ */
+bool
+CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name,
+ const char *p2val, const char *p2name,
+ int elevel)
+{
+	if (p1val[0] != '\0' && p2val[0] != '\0')
+	{
+		ereport(elevel,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("cannot set both %s and %s", p1name, p2name),
+ errdetail("Only one of %s or %s may be set.",
+		   p1name, p2name)));
+		return false;
+	}
+
+	return true;
+}
+
 /*
  * Convert a value from one of the human-friendly units ("kB", "min" etc.)
  * to the given base unit.  'value' and 'unit' are the input value and unit
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 3712aba09b..3088ada610 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -376,6 +376,9 @@ extern void RestrictSearchPath(void);
 extern void AtEOXact_GUC(bool isCommit, int nestLevel);
 extern void BeginReportingGUCOptions(void);
 extern void ReportChangedGUCOptions(void);
+extern bool CheckMutuallyExclusiveStringGUCs(const char *p1val, const char *p1name,
+			 const char *p2val, const char *p2name,
+			 int elevel);
 extern void ParseLongOption(const char *string, char **name, char **value);
 extern const char *get_config_unit_name(int flags);
 extern bool parse_int(const char *value, int *result, int flags,
-- 
2.25.1

>From a8e50ef80c91927d388b23b6e9eb034e046c02af Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 15 Feb 2023 10:36:00 -0800
Subject: [PATCH v19 2/5] refactor code for restoring via shell

---
 src/backend/Makefile  |   2 +-
 src/backend/access/transam/timeline.c |  12 +-
 src/backend/access/transam/xlog.c |  50 -
 src/backend/access/transam/xlogarchive.c  | 167 ---
 src/backend/access/transam/xlogrecovery.c |   3 +-
 src/backend/meson.build   |   1 +
 src/backend/postmaster/startup.c  |  16 +-
 src/backend/restore/Makefile  |  18 ++
 src/backend/restore/meson.build   |   5 +
 src/backend/restore/shell_restore.c   | 245 ++
 src/include/Makefile  |   2 +-
 src/include/access/xlogarchive.h  |   9 +-
 src/include/meson.build   |   1 +
 src/include/postmaster/startup.h  |   1 +
 src/include/restore/shell_restore.h   |  26 +++
 src/tools/pgindent/typedefs.list  |   1 +
 16 files changed, 407 insertions(+), 152 deletions(-)
 create mode 100644 src/backend/restore/Makefile
 create mode 100644 src/backend/restore/meson.build
 create mode 100644 src/backend/restore/shell_restore.c
 create mode 100644 src/include/restore/shell_restore.h

diff --git a/src/backend/Makefile b/src/backend/Makefile
index 3d7be09529..88e0e53f82 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -19,7 +19,7 @@ include $(top_builddir)/src/Makefile.global
 SUBDIRS = access archive backup bootstrap catalog parser commands executor \
 	foreign lib libpq \
 	main nodes optimizer partitioning port postmaster \
-	regex replication rewrite \
+	regex replication restore rewrite \
 	statistics storage tcop tsearch utils $(top_builddir)/src/timezone \
 	jit
 
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 146751ae37..3269547de7

Re: Popcount optimization using AVX512

2024-03-15 Thread Nathan Bossart

On Thu, Mar 14, 2024 at 07:50:46PM +, Amonson, Paul D wrote:
> As for new performance numbers: I just ran a full suite like I did
> earlier in the process. My latest results an equivalent to a pgbench
> scale factor 10 DB with the target column having varying column widths
> and appropriate random data are 1.2% improvement with a 2.2% Margin of
> Error at a 98% confidence level. Still seeing improvement and no
> regressions.

Which test suite did you run?  Those numbers seem potentially
indistinguishable from noise, which probably isn't great for such a large
patch set.

I ran John Naylor's test_popcount module [0] with the following command on
an i7-1195G7:

time psql postgres -c 'select drive_popcount(1000, 1024)'

Without your patches, this seems to take somewhere around 8.8 seconds.
With your patches, it takes 0.6 seconds.  (I re-compiled and re-ran the
tests a couple of times because I had a difficult time believing the amount
of improvement.)

[0] 
https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO%2Bb7AEWHRFANxR1h1kxveEV%3DghLQ%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2024-03-15 Thread Bertrand Drouvot

Hi,

On Thu, Mar 14, 2024 at 02:22:44AM +, Zhijie Hou (Fujitsu) wrote:
> Hi,
> 
> Since the standby_slot_names patch has been committed, I am attaching the last
> doc patch for review.
> 

Thanks!

1 ===

+   continue subscribing to publications now on the new primary server without
+   any data loss.

I think "without any data loss" should be re-worded in this context. Data loss
in the sense "data committed on the primary and not visible on the subscriber in
case of failover" can still occurs (in case synchronous replication is not 
used).

2 ===

+   If the result (failover_ready) of both above steps is
+   true, existing subscriptions will be able to continue without data loss.
+  

I don't think that's true if synchronous replication is not used. Say,

- synchronous replication is not used
- primary is not able to reach the standby anymore and standby_slot_names is set
- new data is inserted into the primary
- then not replicated to subscriber (due to standby_slot_names)

Then I think the both above steps will return true but data would be lost in
case of failover.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-15 Thread Tom Lane

Dean Rasheed  writes:
> One thing that concerns me about making even greater use of "$n" is
> the potential for confusion with generic plan parameters.

True.

> Another possibility is to put the SubPlan and InitPlan names inline,
> rather than outputting "FROM SubPlan ...". I had a go at hacking that
> up and this was the result:

>Output: (($3 = (InitPlan 1).$0) AND ($4 = (InitPlan 1).$1) AND
> ((($1 = (SubPlan 2).$3) AND ($2 = (SubPlan 2).$4

Hmm.  I guess what bothers me about that is that it could be read to
suggest that the initplan or subplan is evaluated again for each
output parameter.  Perhaps it'll be sufficiently clear as long as
we keep the labeling

>InitPlan 1 (returns $0,$1)
>SubPlan 2 (returns $3,$4)

but I'm not sure.  Anybody else have an opinion?

(I didn't read your changes to the code yet --- I think at this
point we can just debate proposed output without worrying about
how to implement it.)

regards, tom lane

Re: POC: Extension for adding distributed tracing - pg_tracing

2024-03-15 Thread Aleksander Alekseev

Hi Anthonin,

> [...]
>
> The usual approach is to have pre-allocated memory. This must actually be 
> written (zeroed usually) or it might be lazily allocated only on page fault. 
> And it can't be copy on write memory allocated once in postmaster startup 
> then inherited by fork.
>
> That imposes an overhead for every single postgres backend. So maybe there's 
> a better solution.

It looks like the patch rotted a bit, see cfbot. Could you please
submit an updated version?

Best regards,
Aleksander Alekseev (wearing co-CFM hat)

RE: Popcount optimization using AVX512

2024-03-15 Thread Amonson, Paul D

> -Original Message-
> From: Nathan Bossart 
> Sent: Friday, March 15, 2024 8:06 AM
> To: Amonson, Paul D 
> Cc: Andres Freund ; Alvaro Herrera  ip.org>; Shankaran, Akash ; Noah Misch
> ; Tom Lane ; Matthias van de
> Meent ; pgsql-
> hack...@lists.postgresql.org
> Subject: Re: Popcount optimization using AVX512
> 
> Which test suite did you run?  Those numbers seem potentially
> indistinguishable from noise, which probably isn't great for such a large 
> patch
> set.

I ran...
psql -c "select bitcount(column) from table;"
...in a loop with "column" widths of 84, 4096, 8192, and 16384 containing 
random data. There DB has 1 million rows.  In the loop before calling the 
select I have code to clear all system caches. If I omit the code to clear 
system caches the margin of error remains the same but the improvement percent 
changes from 1.2% to 14.6% (much less I/O when cached data is available).

> I ran John Naylor's test_popcount module [0] with the following command on
> an i7-1195G7:
> 
>   time psql postgres -c 'select drive_popcount(1000, 1024)'
> 
> Without your patches, this seems to take somewhere around 8.8 seconds.
> With your patches, it takes 0.6 seconds.  (I re-compiled and re-ran the tests 
> a
> couple of times because I had a difficult time believing the amount of
> improvement.)

When I tested the code outside postgres in a micro benchmark I got 200-300% 
improvements. Your results are interesting, as it implies more than 300% 
improvement. Let me do some research on the benchmark you referenced. However, 
in all cases it seems that there is no regression so should we move forward on 
merging while I run some more local tests?

Thanks,
Paul

Re: Reports on obsolete Postgres versions

2024-03-15 Thread Michael Banck

On Fri, Mar 15, 2024 at 11:17:53AM +0100, Daniel Gustafsson wrote:
> > On 14 Mar 2024, at 16:48, Peter Eisentraut  wrote:
> > One could instead, for example, describe those as "maintenance releases":
> 
> That might indeed be a better name for what we provide.

+1

Re: Statistics Import and Export

2024-03-15 Thread Corey Huinker

>
> ANALYZE takes out one lock StatisticRelationId per relation, not per
> attribute like we do now. If we didn't release the lock after every
> attribute, and we only called the function outside of a larger transaction
> (as we plan to do with pg_restore) then that is the closest we're going to
> get to being consistent with ANALYZE.
>

v9 attached. This adds pg_dump support. It works in tests against existing
databases such as dvdrental, though I was surprised at how few indexes have
attribute stats there.

Statistics are preserved by default, but this can be disabled with the
option --no-statistics. This follows the prevailing option pattern in
pg_dump, etc.

There are currently several failing TAP tests around
pg_dump/pg_restore/pg_upgrade. I'm looking at those, but in the mean
time I'm seeking feedback on the progress so far.
From bf291e323fc01215264d41b75d579c11bd22d2ec Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Mon, 11 Mar 2024 14:18:39 -0400
Subject: [PATCH v9 1/2] Create pg_set_relation_stats, pg_set_attribute_stats.

These functions will be used by pg_dump/restore and pg_upgrade to convey
relation and attribute statistics from the source database to the
target. This would be done instead of vacuumdb --analyze-in-stages.

Both functions take an oid to identify the target relation that will
receive the statistics. There is nothing requiring that relation to be
the same one as the one exported, though the statistics would have to
make sense in the context of the new relation. Typecasts for stavaluesN
parameters may fail if the destination column is not of the same type as
the source column.

The parameters of pg_set_attribute_stats identify the attribute by name
rather than by attnum. This is intentional because the column order may
be different in situations other than binary upgrades. For example,
dropped columns do not carry over in a restore.

The statistics imported by pg_set_attribute_stats are imported
transactionally like any other operation. However, pg_set_relation_stats
does it's update in-place, which is to say non-transactionally. This is
in line with what ANALYZE does to avoid table bloat in pg_class.

These functions also allows for tweaking of table statistics in-place,
allowing the user to inflate rowcounts, skew histograms, etc, to see
what those changes will evoke from the query planner.
---
 src/include/catalog/pg_proc.dat   |  15 +
 src/include/statistics/statistics.h   |   2 +
 src/backend/statistics/Makefile   |   3 +-
 src/backend/statistics/meson.build|   1 +
 src/backend/statistics/statistics.c   | 410 ++
 .../regress/expected/stats_export_import.out  | 211 +
 src/test/regress/parallel_schedule|   2 +-
 src/test/regress/sql/stats_export_import.sql  | 198 +
 doc/src/sgml/func.sgml|  95 
 9 files changed, 935 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/statistics/statistics.c
 create mode 100644 src/test/regress/expected/stats_export_import.out
 create mode 100644 src/test/regress/sql/stats_export_import.sql

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 700f7daf7b..a726451a6f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12170,4 +12170,19 @@
   proargtypes => 'int2',
   prosrc => 'gist_stratnum_identity' },
 
+# Statistics Import
+{ oid => '8048',
+  descr => 'set statistics on relation',
+  proname => 'pg_set_relation_stats', provolatile => 'v', proisstrict => 't',
+  proparallel => 'u', prorettype => 'bool',
+  proargtypes => 'oid float4 int4 int4',
+  proargnames => '{relation,reltuples,relpages,relallvisible}',
+  prosrc => 'pg_set_relation_stats' },
+{ oid => '8049',
+  descr => 'set statistics on attribute',
+  proname => 'pg_set_attribute_stats', provolatile => 'v', proisstrict => 'f',
+  proparallel => 'u', prorettype => 'bool',
+  proargtypes => 'oid name bool float4 int4 float4 int2 int2 int2 int2 int2 _float4 _float4 _float4 _float4 _float4 text text text text text',
+  proargnames => '{relation,attname,stainherit,stanullfrac,stawidth,stadistinct,stakind1,stakind2,stakind3,stakind4,stakind5,stanumbers1,stanumbers2,stanumbers3,stanumbers4,stanumbers5,stavalues1,stavalues2,stavalues3,stavalues4,stavalues5}',
+  prosrc => 'pg_set_attribute_stats' },
 ]
diff --git a/src/include/statistics/statistics.h b/src/include/statistics/statistics.h
index 7f2bf18716..1dddf96576 100644
--- a/src/include/statistics/statistics.h
+++ b/src/include/statistics/statistics.h
@@ -127,4 +127,6 @@ extern StatisticExtInfo *choose_best_statistics(List *stats, char requiredkind,
 int nclauses);
 extern HeapTuple statext_expressions_load(Oid stxoid, bool inh, int idx);
 
+extern Datum pg_set_relation_stats(PG_FUNCTION_ARGS);
+extern Datum pg_set_attribute_stats(PG_FUNCTION_ARGS);
 #endif			/* STATISTICS_H */
diff --git

Re: Make COPY format extendable: Extract COPY TO format implementations

2024-03-15 Thread Sutou Kouhei

Hi,

In 
  "Re: Make COPY format extendable: Extract COPY TO format implementations" on 
Wed, 13 Mar 2024 16:00:46 +0800,
  jian he  wrote:

>> More use cases will help us which callbacks are needed. We
>> will be able to collect more use cases by providing basic
>> callbacks.

> Let's say the customized format is "csv1", but it is just analogous to
> the csv format.
> people should be able to create an extension, with serval C functions,
> then they can do `copy (select 1 ) to stdout (format 'csv1');`
> but the output will be exact same as `copy (select 1 ) to stdout
> (format 'csv');`

Thanks for sharing one use case but I think that we need
real-world use cases to consider our APIs.

For example, JSON support that is currently discussing in
another thread is a real-world use case. My Apache Arrow
support is also another real-world use case.


Thanks,
-- 
kou

Re: Support json_errdetail in FRONTEND builds

2024-03-15 Thread Alvaro Herrera

On 2024-Mar-14, Tom Lane wrote:

> Michael Paquier  writes:
> > Hmm.  I am not sure how much protection this would offer, TBH.  One
> > thing that I find annoying with common/stringinfo.c as it is currently
> > is that we have two exit() calls in the enlarge path, and it does not
> > seem wise to me to spread that even more.

> I hope nobody is expecting that such code will get accepted.  We have
> a policy (and an enforcement mechanism) that libpq.so must not call
> exit().  OAuth code in libpq will need to cope with using pqexpbuffer.

FWIW that's exactly what Jacob's OAUTH patch does -- it teaches the
relevant JSON parser code to use pqexpbuffer when in frontend
environment, and continues to use StringInfo in backend.  I find that a
bit cumbersome, but if the idea of changing the StringInfo behavior
(avoiding exit()) is too radical, then perhaps it's better that we go
with Jacob's proposal in the other thread:

+/*
+ * In backend, we will use palloc/pfree along with StringInfo.  In frontend, 
use
+ * malloc and PQExpBuffer, and return JSON_OUT_OF_MEMORY on out-of-memory.
+ */
+#ifdef FRONTEND
+
+#define STRDUP(s) strdup(s)
+#define ALLOC(size) malloc(size)
+
+#define appendStrVal   appendPQExpBuffer
+#define appendBinaryStrVal  appendBinaryPQExpBuffer
+#define appendStrValChar   appendPQExpBufferChar
+#define createStrVal   createPQExpBuffer
+#define resetStrValresetPQExpBuffer
+#define destroyStrVal  destroyPQExpBuffer
+
+#else  /* !FRONTEND */
+
+#define STRDUP(s) pstrdup(s)
+#define ALLOC(size) palloc(size)
+
+#define appendStrVal   appendStringInfo
+#define appendBinaryStrVal  appendBinaryStringInfo
+#define appendStrValChar   appendStringInfoChar
+#define createStrVal   makeStringInfo
+#define resetStrValresetStringInfo
+#define destroyStrVal  destroyStringInfo
+
+#endif


-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Ni aún el genio muy grande llegaría muy lejos
si tuviera que sacarlo todo de su propio interior" (Goethe)

Re: Possibility to disable `ALTER SYSTEM`

2024-03-15 Thread Jelte Fennema-Nio

On Thu, 14 Mar 2024 at 22:15, Maciek Sakrejda  wrote:
> In this case, the end user with access to Postgres superuser
> privileges presumably also has access to the outside configuration
> mechanism. The goal is not to prevent them from changing settings, but
> to offer guard rails that prevent them from changing settings in a way
> that will be unstable (revertible by a future update) or confusing
> (not showing up in a management UI).

Great explanation! Attached is a much changed patch that updates to
docs and code to reflect this. I particularly liked your use of the
word "guard rail" as that reflects the intent of the feature very well
IMO. So I included that wording in both the GUC group and the error
code.

v3-0001-Add-enable_alter_system-GUC.patch
Description: Binary data

Re: Reports on obsolete Postgres versions

2024-03-15 Thread Daniel Gustafsson

> On 14 Mar 2024, at 16:48, Peter Eisentraut  wrote:
> On 13.03.24 18:12, Bruce Momjian wrote:

>> I think "minor" is a better term since it contrasts with "major".  We
>> don't actually supply patches to upgrade minor versions.
> 
> There are potentially different adjectives that could apply to "version" and 
> "release".
> 
> The version numbers can be called major and minor, because that just 
> describes their ordering and significance.
> 
> But I do agree that "minor release" isn't quite as clear, because one could 
> also interpret that as "a release, but a bit smaller this time". (Also might 
> not translate well, since "minor" and "small" could translate to the same 
> thing.)

Some of the user confusion likely stems from us using the same nomenclature as
SemVer, but for different things.  SemVer has become very widely adopted, to
the point where it's almost assumed by many, so maybe we need to explicitly
state that we *don't* use SemVer (we don't mention that anywhere in the docs or
on the website).

> One could instead, for example, describe those as "maintenance releases":

That might indeed be a better name for what we provide.

--
Daniel Gustafsson

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-15 Thread shveta malik

On Thu, Mar 14, 2024 at 7:58 PM Bharath Rupireddy
 wrote:
>
> While we wait to hear from others on this, I'm attaching the v9 patch
> set implementing the above idea (check 0001 patch). Please have a
> look. I'll come back to the other review comments soon.
>

patch002:

1)
I would like to understand the purpose of 'inactive_count'? Is it only
for users for monitoring purposes? We are not using it anywhere
internally.

I shutdown the instance 5 times and found that 'inactive_count' became
5 for all the slots created on that instance. Is this intentional? I
mean we can not really use them if the instance is down.  I felt it
should increment the inactive_count only if during the span of
instance, they were actually inactive i.e. no streaming or replication
happening through them.

2)
slot.c:
+ case RS_INVAL_XID_AGE:
+ {
+ if (TransactionIdIsNormal(s->data.xmin))
+ {
+  ..
+ }
+ if (TransactionIdIsNormal(s->data.catalog_xmin))
+ {
+  ..
+ }
+ }

Can we optimize this code? It has duplicate code for processing
s->data.catalog_xmin and s->data.xmin. Can we create a sub-function
for this purpose and call it twice here?

thanks
Shveta

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-15 Thread John Naylor

On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 6:55 PM John Naylor  wrote:
> >
> > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  
> > > wrote:
> > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > general and put the validation down into C as well. First we save the
> > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > found values in another array. So the static items structure would
> > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > the iteration output is private to check_set_block_offsets(). Then
> > > > sort as needed and check they are all the same.
> > >
> > > That's a promising idea. We can use the same mechanism for randomized
> > > tests too. If you're going to work on this, I'll do other tests on my
> > > environment in the meantime.
> >
> > Some progress on this in v72 -- I tried first without using SQL to
> > save the blocks, just using the unique blocks from the verification
> > array. It seems to work fine.
>
> Thanks!

Seems I forgot the attachment last time...there's more stuff now
anyway, based on discussion.

> > - Since there are now three arrays we should reduce max bytes to
> > something smaller.
>
> Agreed.

I went further than this, see below.

> > - Further on that, I'm not sure if the "is full" test is telling us
> > much. It seems we could make max bytes a static variable and set it to
> > the size of the empty store. I'm guessing it wouldn't take much to add
> > enough tids so that the contexts need to allocate some blocks, and
> > then it would appear full and we can test that. I've made it so all
> > arrays repalloc when needed, just in case.
>
> How about using work_mem as max_bytes instead of having it as a static
> variable? In test_tidstore.sql we set work_mem before creating the
> tidstore. It would make the tidstore more controllable by SQL queries.

My complaint is that the "is full" test is trivial, and also strange
in that max_bytes is used for two unrelated things:

- the initial size of the verification arrays, which was always larger
than necessary, and now there are three of them
- the hint to TidStoreCreate to calculate its max block size / the
threshold for being "full"

To make the "is_full" test slightly less trivial, my idea is to save
the empty store size and later add enough tids so that it has to
allocate new blocks/DSA segments, which is not that many, and then it
will appear full. I've done this and also separated the purpose of
various sizes in v72-0009/10.

Using actual work_mem seems a bit more difficult to make this work.

> > - I'm not sure it's useful to keep test_lookup_tids() around. Since we
> > now have a separate lookup test, the only thing it can tell us is that
> > lookups fail on an empty store. I arranged it so that
> > check_set_block_offsets() works on an empty store. Although that's
> > even more trivial, it's just reusing what we already need.
>
> Agreed.

Removed in v72-0007

On Fri, Mar 15, 2024 at 9:49 AM Masahiko Sawada  wrote:
>
> I have two questions on tidstore.c:
>
> +/*
> + * Set the given TIDs on the blkno to TidStore.
> + *
> + * NB: the offset numbers in offsets must be sorted in ascending order.
> + */
>
> Do we need some assertions to check if the given offset numbers are
> sorted expectedly?

Done in v72-0008

> ---
> +   if (TidStoreIsShared(ts))
> +   found = shared_rt_set(ts->tree.shared, blkno, page);
> +   else
> +   found = local_rt_set(ts->tree.local, blkno, page);
> +
> +   Assert(!found);
>
> Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> overwrite) the value, I think we should not expect that found is
> always false.

I find that a puzzling statement, since 1) it was designed for
insert-only workloads, not actual overwrite IIRC and 2) the tests will
now fail if the same block is set twice, since we just switched the
tests to use a remnant of vacuum's old array. Having said that, I
don't object to removing artificial barriers to using it for purposes
not yet imagined, as long as test_tidstore.sql warns against that.

Given the above two things, I think this function's comment needs
stronger language about its limitations. Perhaps even mention that
it's intended for, and optimized for, vacuum. You and I have long
known that tidstore would need a separate, more complex, function to
add or remove individual tids from existing entries, but it might be
good to have that documented.

Other things:

v72-0011: Test that zero offset raises an error.

v72-0013: I had wanted to microbenchmark this, but since we are
running short of time I decided to skip that, so I want to revert some
code to make it again more similar to the equivalent in tidbitmap.c.
In the absence of evidence, it seems better to do it this way.

Re: Support json_errdetail in FRONTEND builds

2024-03-15 Thread Daniel Gustafsson

> On 15 Mar 2024, at 01:10, Michael Paquier  wrote:
> 
> On Thu, Mar 14, 2024 at 10:56:46AM +0100, Daniel Gustafsson wrote:
>> +/* don't allow destroys of read-only StringInfos */
>> +Assert(str->maxlen != 0);
>> Considering that StringInfo.c don't own the memory here I think it's 
>> warranted
>> to turn this assert into an elog() to avoid the risk of use-after-free bugs.
> 
> Hmm.  I am not sure how much protection this would offer, TBH.

I can't see how refusing to free memory owned and controlled by someone else,
and throwing an error if attempted, wouldn't be a sound defensive programming
measure.

--
Daniel Gustafsson

Re: Weird test mixup

2024-03-15 Thread Heikki Linnakangas


On 15/03/2024 13:09, Heikki Linnakangas wrote:

On 15/03/2024 01:13, Tom Lane wrote:

Michael Paquier  writes:

Or we could just disable runningcheck because of the concurrency
requirement in this test.  The test would still be able to run, just
less times.


No, actually we *must* mark all these tests NO_INSTALLCHECK if we
stick with the current definition of injection points.  The point
of installcheck mode is that the tests are supposed to be safe to
run in a live installation.  Side-effects occurring in other
databases are completely not OK.


I committed a patch to do that, to put out the fire.


That's turning the buildfarm quite red. Many, but not all animals are 
failing like this:



--- 
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/expected/injection_points.out
 2024-03-15 12:41:16.363286975 +0100
+++ 
/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/src/test/modules/injection_points/results/injection_points.out
  2024-03-15 12:53:11.528159615 +0100
@@ -1,118 +1,111 @@
 CREATE EXTENSION injection_points;
+ERROR:  extension "injection_points" is not available
+DETAIL:  Could not open extension control file 
"/home/buildfarm/hippopotamus/buildroot/HEAD/pgsql.build/tmp_install/home/buildfarm/hippopotamus/buildroot/HEAD/inst/share/postgresql/extension/injection_points.control":
 No such file or directory.
+HINT:  The extension must first be installed on the system where PostgreSQL is 
running.
... 


Looks like adding NO_INSTALLCHECK somehow affected how the modules are 
installed in tmp_install. I'll investigate..


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Support json_errdetail in FRONTEND builds

2024-03-15 Thread Daniel Gustafsson

> On 15 Mar 2024, at 09:38, Alvaro Herrera  wrote:
> 
> On 2024-Mar-14, Tom Lane wrote:
> 
>> Michael Paquier  writes:
>>> Hmm.  I am not sure how much protection this would offer, TBH.  One
>>> thing that I find annoying with common/stringinfo.c as it is currently
>>> is that we have two exit() calls in the enlarge path, and it does not
>>> seem wise to me to spread that even more.
> 
>> I hope nobody is expecting that such code will get accepted.  We have
>> a policy (and an enforcement mechanism) that libpq.so must not call
>> exit().  OAuth code in libpq will need to cope with using pqexpbuffer.
> 
> FWIW that's exactly what Jacob's OAUTH patch does -- it teaches the
> relevant JSON parser code to use pqexpbuffer when in frontend
> environment, and continues to use StringInfo in backend.  I find that a
> bit cumbersome, but if the idea of changing the StringInfo behavior
> (avoiding exit()) is too radical, then perhaps it's better that we go
> with Jacob's proposal in the other thread:

Correct, the OAuth work does not make any claims to use StringInfo in libpq.
My understanding of this thread was to make it use StringInfo for now to get
this available for frontend binaries now, and reduce scope here, and later
change it up if/when the OAuth patch lands.

--
Daniel Gustafsson

Re: Weird test mixup

2024-03-15 Thread Heikki Linnakangas


On 15/03/2024 01:13, Tom Lane wrote:

Michael Paquier  writes:

Or we could just disable runningcheck because of the concurrency
requirement in this test.  The test would still be able to run, just
less times.


No, actually we *must* mark all these tests NO_INSTALLCHECK if we
stick with the current definition of injection points.  The point
of installcheck mode is that the tests are supposed to be safe to
run in a live installation.  Side-effects occurring in other
databases are completely not OK.


I committed a patch to do that, to put out the fire.

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Possibility to disable `ALTER SYSTEM`

2024-03-15 Thread Jelte Fennema-Nio

On Fri, 15 Mar 2024 at 11:08, Daniel Gustafsson  wrote:
> Another quirk for the documentation of this: if I disable ALTER SYSTEM I would
> assume that postgresql.auto.conf is no longer consumed, but it still is (and
> still need to be), so maybe "enable/disable" is the wrong choice of words?

Updated the docs to reflect this quirk. But I kept the same name for
the GUC for now, because I couldn't come up with a better name myself.
If someone suggests a better name, I'm happy to change it though.

v4-0001-Add-enable_alter_system-GUC.patch
Description: Binary data

Re: Add bump memory context type and use it for tuplesorts

2024-03-15 Thread Tomas Vondra

On 3/15/24 03:21, David Rowley wrote:
> On Tue, 12 Mar 2024 at 23:57, Tomas Vondra
>  wrote:
>> Attached is an updated version of the mempool patch, modifying all the
>> memory contexts (not just AllocSet), including the bump context. And
>> then also PDF with results from the two machines, comparing results
>> without and with the mempool. There's very little impact on small reset
>> values (128kB, 1MB), but pretty massive improvements on the 8MB test
>> (where it's a 2x improvement).
> 
> I think it would be good to have something like this.  I've done some
> experiments before with something like this [1]. However, mine was
> much more trivial.
> 

Interesting. My thing is a bit more complex because it was not meant to
be a cache initially, but more a way to limit the memory allocated by a
backend (discussed in [1]), or perhaps even a smaller part of a plan.

I only added the caching after I ran into some bottlenecks [2], and
malloc turned out to be a scalability issue.

> One thing my version did was get rid of the *context* freelist stuff
> in aset.  I wondered if we'd need that anymore as, if I understand
> correctly, it's just there to stop malloc/free thrashing, which is
> what the patch aims to do anyway. Aside from that, it's now a little
> weird that aset.c has that but generation.c and slab.c do not.
> 

True. I think the "memory pool" shared by all memory contexts would be a
more principled way to do this - not only it works for all memory
context types, but it's also part of the "regular" cache eviction like
everything else (which the context freelist is not).

> One thing I found was that in btbeginscan(), we have "so =
> (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));", which on this
> machine is 27344 bytes and results in a call to AllocSetAllocLarge()
> and therefore a malloc().  Ideally, there'd be no malloc() calls in a
> standard pgbench run, at least once the rel and cat caches have been
> warmed up.
> 

Right. That's exactly what I found in [2], where it's a massive problem
with many partitions and many concurrent connections.

> I think there are a few things in your patch that could be improved,
> here's a quick review.
> 
> 1. MemoryPoolEntryIndex() could follow the lead of
> AllocSetFreeIndex(), which is quite well-tuned and has no looping. I
> think you can get rid of MemoryPoolEntrySize() and just have
> MemoryPoolEntryIndex() round up to the next power of 2.
> 
> 2. The following could use  "result = Min(MEMPOOL_MIN_BLOCK,
> pg_nextpower2_size_t(size));"
> 
> + * should be very low, though (less than MEMPOOL_SIZES, i.e. 14).
> + */
> + result = MEMPOOL_MIN_BLOCK;
> + while (size > result)
> + result *= 2;
> 
> 3. "MemoryPoolFree": I wonder if this is a good name for such a
> function.  Really you want to return it to the pool. "Free" sounds
> like you're going to free() it.  I went for "Fetch" and "Release"
> which I thought was less confusing.
> 
> 4. MemoryPoolRealloc(), could this just do nothing if the old and new
> indexes are the same?
> 
> 5. It might be good to put a likely() around this:
> 
> + /* only do this once every MEMPOOL_REBALANCE_DISTANCE allocations */
> + if (pool->num_requests < MEMPOOL_REBALANCE_DISTANCE)
> + return;
> 
> Otherwise, if that function is inlined then you'll bloat the functions
> that inline it for not much good reason.  Another approach would be to
> have a static inline function which checks and calls a noinline
> function that does the work so that the rebalance stuff is never
> inlined.
> 

Yes, I agree with all of that. I was a bit lazy when doing the PoC, so I
ignored these things.

> Overall, I wonder if the rebalance stuff might make performance
> testing quite tricky.  I see:
> 
> +/*
> + * How often to rebalance the memory pool buckets (number of allocations).
> + * This is a tradeoff between the pool being adaptive and more overhead.
> + */
> +#define MEMPOOL_REBALANCE_DISTANCE 25000
> 
> Will TPS take a sudden jump after 25k transactions doing the same
> thing?  I'm not saying this shouldn't happen, but... benchmarking is
> pretty hard already. I wonder if there's something more fine-grained
> that can be done which makes the pool adapt faster but not all at
> once.  (I've not studied your algorithm for the rebalance.)
> 

I don't think so, or at least I haven't observed anything like that. My
intent was to make the rebalancing fairly frequent but incremental, with
each increment doing only a tiny amount of work.

It does not do any more malloc/free calls than without the cache - it
may just delay them a bit, and the assumption is the rest of the
rebalancing (walking the size slots and adjusting counters based on
activity since the last run) is super cheap.

So it shouldn't be the case that the rebalancing is so expensive to
cause a measurable drop in throughput, or something like that. I can
imagine spreading it even more (doing it in smaller steps), but on the
other hand the interval must not be too short - we need to do

Re: Disabling Heap-Only Tuples

2024-03-15 Thread Matthias van de Meent

On Wed, 13 Mar 2024 at 14:27, Laurenz Albe  wrote:
>
> On Thu, 2023-09-21 at 16:18 -0700, Andres Freund wrote:
> > I think a minimal working approach could be to have the configuration be 
> > based
> > on the relation size vs space known to the FSM. If the target block of an
> > update is higher than ((relation_size - fsm_free_space) *
> > new_reloption_or_guc), try finding the target block via the FSM, even if
> > there's space on the page.
>
> That sounds like a good way forward.
>
> The patch is in state "needs review", but it got review.  I'll change it to
> "waiting for author".

Then I'll withdraw this patch as I don't currently have (nor expect to
have anytime soon) the bandwitdh or expertise to rewrite this patch to
include a system that calculates the free space available in a
relation.

I've added a TODO item in the UPDATE section with a backlink to this
thread so the discussion isn't lost.

-Matthias

Re: speed up a logical replica setup

2024-03-15 Thread Amit Kapila

On Fri, Mar 15, 2024 at 9:23 AM Euler Taveira  wrote:
>

Did you consider adding options for publication/subscription/slot
names as mentioned in my previous email? As discussed in a few emails
above, it would be quite confusing for users to identify the logical
replication objects once the standby is converted to subscriber.

*
+static void
+cleanup_objects_atexit(void)
{
...
conn = connect_database(dbinfo[i].pubconninfo, false);
+ if (conn != NULL)
+ {
+ if (dbinfo[i].made_publication)
+ drop_publication(conn, [i]);
+ if (dbinfo[i].made_replslot)
+ drop_replication_slot(conn, [i], dbinfo[i].subname);
+ disconnect_database(conn, false);
+ }
+ else
+ {
+ /*
+ * If a connection could not be established, inform the user
+ * that some objects were left on primary and should be
+ * removed before trying again.
+ */
+ if (dbinfo[i].made_publication)
+ {
+ pg_log_warning("There might be a publication \"%s\" in database
\"%s\" on primary",
+dbinfo[i].pubname, dbinfo[i].dbname);
+ pg_log_warning_hint("Consider dropping this publication before
trying again.");
+ }

It seems we care only for publications created on the primary. Isn't
it possible that some of the publications have been replicated to
standby by that time, for example, in case failure happens after
creating a few publications? IIUC, we don't care for standby cleanup
after failure because it can't be used for streaming replication
anymore. So, the only choice the user has is to recreate the standby
and start the pg_createsubscriber again. This sounds questionable to
me as to whether users would like this behavior. Does anyone else have
an opinion on this point?

I see the below note in the patch:
+If pg_createsubscriber fails while processing,
+then the data directory is likely not in a state that can be recovered. It
+is true if the target server was promoted. In such a case, creating a new
+standby server is recommended.

By reading this it is not completely clear whether the standby is not
recoverable in case of any error or only an error after the target
server is promoted. If others agree with this behavior then we should
write the detailed reason for this somewhere in the comments as well
unless it is already explained.

-- 
With Regards,
Amit Kapila.

Re: meson: Specify -Wformat as a common warning flag for extensions

2024-03-15 Thread Sutou Kouhei

Hi,

In <49e97fd0-c17e-4cbc-aeee-80ac51400...@eisentraut.org>
  "Re: meson: Specify -Wformat as a common warning flag for extensions" on Wed, 
13 Mar 2024 08:38:28 +0100,
  Peter Eisentraut  wrote:

> I think the fix then is to put -Wall into CFLAGS in
> Makefile.global. Looking at a diff of Makefile.global between an
> autoconf and a meson build, I also see that under meson, CFLAGS
> doesn't get -O2 -g (or similar, depending on settings).  This
> presumably has the same underlying issue that meson handles those
> flags internally.
> 
> For someone who wants to write a fix for this, the relevant variable
> is var_cflags in our meson scripts.  And var_cxxflags as well.

How about the attached v4 patch?


Thanks,
-- 
kou
>From a515720338dc49e764f598021200316c6d01ddf9 Mon Sep 17 00:00:00 2001
From: Sutou Kouhei 
Date: Fri, 15 Mar 2024 18:27:30 +0900
Subject: [PATCH v4] meson: Restore implicit warning/debug/optimize flags for
 extensions

Meson specifies warning/debug/optimize flags such as "-Wall", "-g" and
"-O2" automatically based on "--warnlevel" and "--buildtype"
options. And we use "--warning_level=1" and
"--buildtype=debugoptimized" by default.

We don't specify warning/debug/optimize flags explicitly to build
PostgreSQL with Meson. Because Meson does it automatically as we said.
But Meson doesn't care about flags in Makefile.global and
pg_config. So we need to care about them manually.

This changes do it. They detect warning/debug/optimize flags based on
warning_level/debug/optimization option values because Meson doesn't
tell us flags Meson guessed.
---
 meson.build | 40 
 src/include/meson.build |  4 ++--
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/meson.build b/meson.build
index c8fdfeb0ec..3c5c449a0a 100644
--- a/meson.build
+++ b/meson.build
@@ -1824,6 +1824,46 @@ endif
 vectorize_cflags = cc.get_supported_arguments(['-ftree-vectorize'])
 unroll_loops_cflags = cc.get_supported_arguments(['-funroll-loops'])
 
+# They aren't used for building PostgreSQL itself because Meson does
+# everything internally. They are used by extensions via pg_config or
+# Makefile.global.
+common_builtin_flags = []
+
+warning_level = get_option('warning_level')
+# See https://mesonbuild.com/Builtin-options.html#details-for-warning_level for
+# warning_level values.
+if warning_level == '1'
+  common_builtin_flags += ['-Wall', '/W2']
+elif warning_level == '2'
+  common_builtin_flags += ['-Wall', '-Wextra', '/W3']
+elif warning_level == '3'
+  common_builtin_flags += ['-Wall', '-Wextra', '-Wpedantic', '/W4']
+elif warning_level == 'everything'
+  common_builtin_flags += ['-Weverything', '/Wall']
+endif
+
+if get_option('debug')
+  common_builtin_flags += ['-g']
+endif
+
+optimization = get_option('optimization')
+if optimization == '0'
+  common_builtin_flags += ['-O0']
+elif optimization == '1'
+  common_builtin_flags += ['-O1']
+elif optimization == '2'
+  common_builtin_flags += ['-O2']
+elif optimization == '3'
+  common_builtin_flags += ['-O3']
+elif optimization == 's'
+  common_builtin_flags += ['-Os']
+endif
+
+cflags_builtin = cc.get_supported_arguments(common_builtin_flags)
+if llvm.found()
+  cxxflags_builtin = cpp.get_supported_arguments(common_builtin_flags)
+endif
+
 common_warning_flags = [
   '-Wmissing-prototypes',
   '-Wpointer-arith',
diff --git a/src/include/meson.build b/src/include/meson.build
index a28f115d86..58b7a9c1e7 100644
--- a/src/include/meson.build
+++ b/src/include/meson.build
@@ -44,9 +44,9 @@ config_paths_data.set_quoted('MANDIR', dir_prefix / dir_man)
 
 var_cc = ' '.join(cc.cmd_array())
 var_cpp = ' '.join(cc.cmd_array() + ['-E'])
-var_cflags = ' '.join(cflags + cflags_warn + get_option('c_args'))
+var_cflags = ' '.join(cflags + cflags_builtin + cflags_warn + get_option('c_args'))
 if llvm.found()
-  var_cxxflags = ' '.join(cxxflags + cxxflags_warn + get_option('cpp_args'))
+  var_cxxflags = ' '.join(cxxflags + cxxflags_builtin + cxxflags_warn + get_option('cpp_args'))
 else
   var_cxxflags = ''
 endif
-- 
2.43.0

Re: UUID v7

2024-03-15 Thread Aleksander Alekseev

Hi,

> > So maybe we don't need the _extract_variant function?
>
> I think it's the best possible solution. The variant has no value besides 
> detecting if a version can be extracted.

+1 to the idea. I doubt that anyone will miss it.

-- 
Best regards,
Aleksander Alekseev

Re: Possibility to disable `ALTER SYSTEM`

2024-03-15 Thread Daniel Gustafsson

> On 15 Mar 2024, at 03:58, Bruce Momjian  wrote:
> 
> On Thu, Mar 14, 2024 at 07:43:15PM -0400, Robert Haas wrote:
>> On Thu, Mar 14, 2024 at 5:15 PM Maciek Sakrejda  wrote:
>>> It's not a security feature: it's a usability feature.
>>> 
>>> It's a usability feature because, when Postgres configuration is
>>> managed by an outside mechanism (e.g., as in a Kubernetes
>>> environment), ALTER SYSTEM currently allows a superuser to make
>>> changes that appear to work, but may be discarded at some point in the
>>> future when that outside mechanism updates the config. They may also
>>> be represented incorrectly in a management dashboard if that dashboard
>>> is based on the values in the outside configuration mechanism, rather
>>> than values directly from Postgres.
>>> 
>>> In this case, the end user with access to Postgres superuser
>>> privileges presumably also has access to the outside configuration
>>> mechanism. The goal is not to prevent them from changing settings, but
>>> to offer guard rails that prevent them from changing settings in a way
>>> that will be unstable (revertible by a future update) or confusing
>>> (not showing up in a management UI).
>>> 
>>> There are challenges here in making sure this is _not_ seen as a
>>> security feature. But I do think the feature itself is sensible and
>>> worthwhile.
>> 
>> This is what I would have said if I'd tried to offer an explanation,
>> except you said it better than I would have done.
> 
> I do think the docs need to clearly say this is not a security feature.

A usability feature whose purpose is to guard against a superuser willingly
acting against how the system is managed, or not even knowing how it is
managed, does have a certain security feature smell.  We've already had a few
CVE's filed against usability features so I don't think Tom's fears are at all
ungrounded.

Another quirk for the documentation of this: if I disable ALTER SYSTEM I would
assume that postgresql.auto.conf is no longer consumed, but it still is (and
still need to be), so maybe "enable/disable" is the wrong choice of words?

--
Daniel Gustafsson

Re: Inconsistent printf placeholders

2024-03-15 Thread Kyotaro Horiguchi

At Fri, 15 Mar 2024 16:20:27 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> I checked for that kind of msgids in a bit more intensive way. The
> numbers are the line numbers in ja.po of backend. I didn't check the
> same for other modules.
> 
> > ###: invalid timeline %lld
> >@ backup/walsummaryfuncs.c:95
> >  invalid timeline %u
> >@ repl_gram.y:318 repl_gram.y:359

"The numbers are the line numbers in ja.po" is wrong. Correctly, it
should be written as:

The information consists of two lines: the first of them is the msgid,
and the second indicates the locations where they appear.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Be strict when request to flush past end of WAL in WaitXLogInsertionsToFinish

2024-03-15 Thread Bharath Rupireddy

Hi,

While working on [1], it was identified that
WaitXLogInsertionsToFinish emits a LOG message, and adjusts the upto
ptr to proceed further when caller requests to flush past the end of
generated WAL. There's a comment explaining no caller should ever do
that intentionally except in cases with bogus LSNs. For a similar
situation, XLogWrite emits a PANIC "xlog write request %X/%X is past
end of log %X/%X". Although there's no problem if
WaitXLogInsertionsToFinish emits LOG, but why can't it be a bit more
harsh and emit PANIC something like the attached to detect the corner
case?

Thoughts?

[1] 
https://www.postgresql.org/message-id/b43615437ac7d7fdef86a36e5d5bf3fc049bc11b.camel%40j-davis.com

On Thu, Feb 22, 2024 at 1:54 AM Jeff Davis  wrote:
>
> WaitXLogInsertionsToFinish() uses a LOG level message
> for the same situation. They should probably be the same log level, and
> I would think it would be either PANIC or WARNING. I have no idea why
> LOG was chosen.

[2]
/*
 * No-one should request to flush a piece of WAL that hasn't even been
 * reserved yet. However, it can happen if there is a block with a bogus
 * LSN on disk, for example. XLogFlush checks for that situation and
 * complains, but only after the flush. Here we just assume that to mean
 * that all WAL that has been reserved needs to be finished. In this
 * corner-case, the return value can be smaller than 'upto' argument.
 */
if (upto > reservedUpto)
{
ereport(LOG,
(errmsg("request to flush past end of generated WAL;
request %X/%X, current position %X/%X",
LSN_FORMAT_ARGS(upto), LSN_FORMAT_ARGS(reservedUpto;
upto = reservedUpto;
}

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

v1-0001-Be-strict-when-request-to-flush-past-end-of-WAL-in.patch
Description: Binary data

Re: Shared detoast Datum proposal

2024-03-15 Thread Nikita Malakhov

Hi!

Here's a slightly improved version of patch Tomas provided above (v2),
with cache invalidations and slices caching added, still as PoC.

The main issue I've encountered during tests is that when the same query
retrieves both slices and full value - slices, like substring(t,...) the
order of
the values is not predictable, with text fields substrings are retrieved
before the full value and we cannot benefit from cache full value first
and use slices from cached value.

Yet the cache code is still very compact and affects only sources related
to detoasting.

Tomas, about  HASH_ENTER - according to comments it could throw
an OOM error, so I've changed it to  HASH_ENTER_NULL to avoid
new errors. In this case we would just have the value not cached
without an error.

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


0001-toast-cache-v3.patch
Description: Binary data

Re: Inconsistent printf placeholders

2024-03-15 Thread Kyotaro Horiguchi

At Fri, 15 Mar 2024 16:01:28 +1300, David Rowley  wrote 
in 
> On Fri, 15 Mar 2024 at 15:27, Kyotaro Horiguchi  
> wrote:
> > I have considered only the two messages. Actually, buffile.c and md.c
> > are already like that. The attached aligns the messages in
> > pg_combinebackup.c and reconstruct.c with the precedents.
> 
> This looks like a worthy cause to make translator work easier.
> 
> I don't want to widen the goalposts or anything, but just wondering if
> you'd searched for any others that could get similar treatment?
> 
> I only just had a quick look at the following.
> 
> $ cat src/backend/po/fr.po | grep -E "^msgid\s" | sed -E
> 's/%[a-zA-Z]+/\%/g' | sort | uniq -d -c
>  31 msgid ""
>   2 msgid "could not accept SSL connection: %"
>   2 msgid "could not initialize LDAP: %"
>   2 msgid "could not look up local user ID %: %"
...
> I've not looked at how hard it would be with any of the above to
> determine how hard it would be to make the formats consistent.  The
> 3rd last one seems similar enough that it might be worth doing
> together with this?


I checked for that kind of msgids in a bit more intensive way. The
numbers are the line numbers in ja.po of backend. I didn't check the
same for other modules.

> ###: invalid timeline %lld
>@ backup/walsummaryfuncs.c:95
>  invalid timeline %u
>@ repl_gram.y:318 repl_gram.y:359

In the first case, the %lld can be more than 2^31.

In the second case, %u is uint32. However, the bigger issue is
checking if the uint32 value is negative:

repl_gram.c: 147
>   if ((yyvsp[0].uintval) <= 0)
>   ereport(ERROR,
>   (errcode(ERRCODE_SYNTAX_ERROR),
>errmsg("invalid timeline %u", 
> (yyvsp[0].uintval;

which cannot happen. I think we can simply remove the useless error
check.

> ###: could not read file \"%s\": read %d of %zu
>@ ../common/controldata_utils.c:116 ../common/controldata_utils.c:119 
> access/transam/xlog.c:3417 access/transam/xlog.c:4278 
> replication/logical/origin.c:750 replication/logical/origin.c:789 
> replication/logical/snapbuild.c:2040 replication/slot.c:2218 
> replication/slot.c:2259 replication/walsender.c:660 
> utils/cache/relmapper.c:833
>  could not read file \"%s\": read %d of %lld
>@ access/transam/twophase.c:1372
>  could not read file \"%s\": read %zd of %zu
>@ backup/basebackup.c:2102
> ###: oversize GSSAPI packet sent by the client (%zu > %zu)
>@ libpq/be-secure-gssapi.c:351
>  oversize GSSAPI packet sent by the client (%zu > %d)
>@ libpq/be-secure-gssapi.c:575
> ###: compressed segment file \"%s\" has incorrect uncompressed size %d, 
> skipping
>@ pg_receivewal.c:362
>  compressed segment file \"%s\" has incorrect uncompressed size %zu, 
> skipping
>@ pg_receivewal.c:448
> ###: could not read file \"%s\": read only %zd of %lld bytes
>@ pg_combinebackup.c:1304
>  could not read file \"%s\": read only %d of %u bytes
>@ reconstruct.c:514


We can "fix" them the same way.  I debated whether to use ssize_t for
read() and replace all instances of size_t with Size. However, in the
end, I decided to only keep it consistent with the surrounding code.

> ###: index %d out of valid range, 0..%d
>@ utils/adt/varlena.c:3220 utils/adt/varlena.c:3287
>  index %lld out of valid range, 0..%lld
>@ utils/adt/varlena.c:3251 utils/adt/varlena.c:3323
> ###: string is too long for tsvector (%d bytes, max %d bytes)
>@ tsearch/to_tsany.c:194 utils/adt/tsvector.c:277 
> utils/adt/tsvector_op.c:1126
>  string is too long for tsvector (%ld bytes, max %ld bytes)
>@ utils/adt/tsvector.c:223

We can unify them and did in the attached, but I'm not sure that's
sensible..

> ###: could not look up local user ID %d: %s
>@ ../port/user.c:43 ../port/user.c:79
>  could not look up local user ID %ld: %s
>@ libpq/auth.c:1886

Both of the above use uid_t (defined as int) as user ID and it is
explicitly cast to "int" in the first case and to long in the second,
which seems mysterious. Although I'm not sure if there's a possibility
of uid_t being widened in the future, I unified them to the latter way
for robustness.

> ###: error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %m
>@ file.c:44
>  error while cloning relation \"%s.%s\" (\"%s\" to \"%s\"): %s
>@ file.c:65
> 
> The latter seems to be changed to %m by reassiging save_errno to errno, as 
> done in other places.
> 
> ###: could not get data directory using %s: %m
>@ option.c:448
>  could not get data directory using %s: %s
>@ option.c:452
> ###: could not get control data using %s: %m
>@ controldata.c:129 controldata.c:199
>  could not get control data using %s: %s
>@ controldata.c:175 controldata.c:507
> ###: %s: %m
>@ copy.c:401 psqlscanslash.l:805 psqlscanslash.l:817 
>

Re: Infinite loop in XLogPageRead() on standby

2024-03-15 Thread Alexander Kukushkin

Hi Kyotaro,

On Wed, 13 Mar 2024 at 03:56, Kyotaro Horiguchi 
wrote:

I identified the cause of the second issue. When I tried to replay the
> issue, the second standby accidentally received the old timeline's
> last page-spanning record till the end while the first standby was
> promoting (but it had not been read by recovery). In addition to that,
> on the second standby, there's a time window where the timeline
> increased but the first segment of the new timeline is not available
> yet. In this case, the second standby successfully reads the
> page-spanning record in the old timeline even after the second standby
> noticed that the timeline ID has been increased, thanks to the
> robustness of XLogFileReadAnyTLI().
>

Hmm, I don't think it could really be prevented.
There are always chances that the standby that is not ahead of other
standbys could be promoted due to reasons like:
1. HA configuration doesn't let certain nodes to be promoted.
2. This is an async standby (name isn't listed in
synchronous_standby_names) and it was ahead of promoted sync standby. No
data loss from the client point of view.

> Of course, regardless of the changes above, if recovery on the second
> standby had reached the end of the page-spanning record before
> redirection to the first standby, it would need pg_rewind to connect
> to the first standby.
>

Correct, IMO pg_rewind is a right way of solving it.

Regards,
--
Alexander Kukushkin

Re: Weird test mixup

2024-03-15 Thread Michael Paquier

On Thu, Mar 14, 2024 at 07:13:53PM -0400, Tom Lane wrote:
> Michael Paquier  writes:
>> Or we could just disable runningcheck because of the concurrency
>> requirement in this test.  The test would still be able to run, just
>> less times.
> 
> No, actually we *must* mark all these tests NO_INSTALLCHECK if we
> stick with the current definition of injection points.  The point
> of installcheck mode is that the tests are supposed to be safe to
> run in a live installation.  Side-effects occurring in other
> databases are completely not OK.

I really don't want to plug any runtime conditions into the backend
APIs, because there can be so much more that can be done there than
only restricting a callback to a database.  I can imagine process type
restrictions, process PID restrictions, etc.  So this knowledge should
stick into the test module itself, and be expanded there.  That's
easier ABI-wise, as well.

> I can see that some tests would want to be able to inject code
> cluster-wide, but I bet that's going to be a small minority.
> I suggest that we invent a notion of "global" vs "local"
> injection points, where a "local" one only fires in the DB it
> was defined in.  Then only tests that require a global injection
> point need to be NO_INSTALLCHECK.

Attached is a POC of what could be done.  I have extended the module
injection_points so as it is possible to register what I'm calling a
"condition" in the module that can be defined with a new SQL function.

The condition is stored in shared memory with the point name, then at
runtime the conditions are cross-checked in the callbacks.  With the
interface of this patch, the condition should be registered *before* a
point is attached, but this stuff could also be written so as
injection_points_attach() takes an optional argument with a database
name.  Or this could use a different, new SQL function, say a
injection_points_attach_local() that registers a condition with
MyDatabaseId on top of attaching the point, making the whole happening
while holding once the spinlock of the shmem state for the module.

By the way, modules/gin/ was missing missing a detach, so the test was
not repeatable with successive installchecks.  Adding a pg_sleep of a
few seconds after 'gin-leave-leaf-split-incomplete' is registered
enlarges the window, and the patch avoids failures when running
installcheck in parallel for modules/gin/ and something else using
gin, like contrib/btree_gin/:
while make USE_MODULE_DB=1 installcheck; do :; done

0001 is the condition facility for the module, 0002 is a fix for the
GIN test.  Thoughts are welcome.
--
Michael
From 5f00092a0bd193f70d9daff77caacfc2ba83ca52 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 15 Mar 2024 16:03:05 +0900
Subject: [PATCH 1/2] Introduce runtime conditions in module injection_points

This adds a new SQL function injection_points_condition() that can be
used to register a runtime condition to an injection point, declared or
not.  For now, it is possible to register a database name to make an
injection point run only on a defined database.
---
 .../expected/injection_points.out |  54 
 .../injection_points--1.0.sql |  13 ++
 .../injection_points/injection_points.c   | 131 ++
 .../injection_points/sql/injection_points.sql |  14 ++
 4 files changed, 212 insertions(+)

diff --git a/src/test/modules/injection_points/expected/injection_points.out b/src/test/modules/injection_points/expected/injection_points.out
index 827456ccc5..ccb83b291f 100644
--- a/src/test/modules/injection_points/expected/injection_points.out
+++ b/src/test/modules/injection_points/expected/injection_points.out
@@ -115,4 +115,58 @@ NOTICE:  notice triggered for injection point TestInjectionLog2

 (1 row)

+SELECT injection_points_detach('TestInjectionLog2');
+ injection_points_detach 
+-
+ 
+(1 row)
+
+-- Conditions
+SELECT current_database() AS datname \gset
+-- Register injection point to run on database 'postgres'.
+SELECT injection_points_attach('TestConditionError', 'error');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_condition('TestConditionError', 'postgres');
+ injection_points_condition 
+
+ 
+(1 row)
+
+SELECT injection_points_run('TestConditionError'); -- nothing
+ injection_points_run 
+--
+ 
+(1 row)
+
+SELECT injection_points_detach('TestConditionError');
+ injection_points_detach 
+-
+ 
+(1 row)
+
+-- Register injection point to run on current database
+SELECT injection_points_attach('TestConditionError', 'error');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_condition('TestConditionError', :'datname');
+ injection_points_condition 
+
+ 
+(1 row)
+
+SELECT injection_points_run('TestConditionError'); -- error
+ERROR:  error triggered for injection point

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-15 Thread Bharath Rupireddy

On Fri, Mar 15, 2024 at 10:15 AM shveta malik  wrote:
>
> > > > wal_level_insufficient it's the reason for conflict with recovery".
>
> +1 on maintaining both conflicting and invalidation_reason

Thanks.

> Thanks for the patch. JFYI, patch09 does not apply to HEAD, some
> recent commit caused the conflict.

Yep, the conflict is in src/test/recovery/meson.build and is because
of e6927270cd18d535b77cbe79c55c6584351524be.

> Some trivial comments on patch001 (yet to review other patches)

Thanks for looking into this.

> 1)
> info.c:
>
> - "%s as caught_up, conflict_reason IS NOT NULL as invalid "
> + "%s as caught_up, invalidation_reason IS NOT NULL as invalid "
>
> Can we revert back to 'conflicting as invalid' since it is a query for
> logical slots only.

I guess, no. There the intention is to check for invalid logical slots
not just for the conflicting ones. The logical slots can get
invalidated due to other reasons as well.

> 2)
> 040_standby_failover_slots_sync.pl:
>
> - q{SELECT conflict_reason IS NULL AND synced AND NOT temporary FROM
> pg_replication_slots WHERE slot_name = 'lsub1_slot';}
> + q{SELECT invalidation_reason IS NULL AND synced AND NOT temporary
> FROM pg_replication_slots WHERE slot_name = 'lsub1_slot';}
>
> Here too, can we have 'NOT conflicting' instead of '
> invalidation_reason IS NULL' as it is a logical slot test.

I guess no. The tests are ensuring the slot on the standby isn't invalidated.

In general, one needs to use the 'conflicting' column from
pg_replication_slots when the intention is to look for reasons for
conflicts, otherwise use the 'invalidation_reason' column for
invalidations.

Please see the attached v10 patch set after resolving the merge
conflict and fixing an indentation warning in the TAP test file.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 41290be4eb1562cf10313e3eda19fcbbf392088f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 15 Mar 2024 05:47:44 +
Subject: [PATCH v10 1/4] Track invalidation_reason in pg_replication_slots

Up until now, reason for replication slot invalidation is not
tracked in pg_replication_slots. A recent commit 007693f2a added
conflict_reason to show the reasons for slot invalidation, but
only for logical slots.

This commit adds a new column to show invalidation reasons for
both physical and logical slots. And, this commit also turns
conflict_reason text column to conflicting boolean column
(effectively reverting commit 007693f2a). One now can look at the
new invalidation_reason column for logical slots conflict with
recovery.
---
 doc/src/sgml/ref/pgupgrade.sgml   |  4 +-
 doc/src/sgml/system-views.sgml| 63 +++
 src/backend/catalog/system_views.sql  |  5 +-
 src/backend/replication/logical/slotsync.c|  2 +-
 src/backend/replication/slot.c|  8 +--
 src/backend/replication/slotfuncs.c   | 25 +---
 src/bin/pg_upgrade/info.c |  4 +-
 src/include/catalog/pg_proc.dat   |  6 +-
 src/include/replication/slot.h|  2 +-
 .../t/035_standby_logical_decoding.pl | 35 ++-
 .../t/040_standby_failover_slots_sync.pl  |  4 +-
 src/test/regress/expected/rules.out   |  7 ++-
 12 files changed, 95 insertions(+), 70 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 58c6c2df8b..8de52bf752 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -453,8 +453,8 @@ make prefix=/usr/local/pgsql.new install
   
All slots on the old cluster must be usable, i.e., there are no slots
whose
-   pg_replication_slots.conflict_reason
-   is not NULL.
+   pg_replication_slots.conflicting
+   is not true.
   
  
  
diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index be90edd0e2..f3fb5ba1b0 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -2525,34 +2525,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
 
  
   
-   conflict_reason text
+   conflicting bool
   
   
-   The reason for the logical slot's conflict with recovery. It is always
-   NULL for physical slots, as well as for logical slots which are not
-   invalidated. The non-NULL values indicate that the slot is marked
-   as invalidated. Possible values are:
-   
-
- 
-  wal_removed means that the required WAL has been
-  removed.
- 
-
-
- 
-  rows_removed means that the required rows have
-  been removed.
- 
-
-
- 
-  wal_level_insufficient means that the
-  primary doesn't have a  sufficient to
-  perform logical decoding.
- 
-
-   
+   True if this logical

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-15 Thread Alexander Korotkov

On Fri, Mar 15, 2024 at 10:32 PM Kartyshov Ivan
 wrote:
>
> On 2024-03-15 22:59, Kartyshov Ivan wrote:
> > On 2024-03-11 13:44, Alexander Korotkov wrote:
> >> I picked the second option and left only the AFTER clause for the
> >> BEGIN statement.  I think this should be enough for the beginning.
> >
> > Thank you for your rework on your patch, here I made some fixes:
> > 0) autocomplete
> > 1) less jumps
> > 2) more description and add cases in doc

Thank you!

> > I think, it will be useful to have stand-alone statement.
> > Why you would like to see only AFTER clause for the BEGIN statement?

Yes, stand-alone statements might be also useful.  But I think that
the best way for this feature to get into the core would be to commit
the minimal version first.  The BEGIN clause has minimal invasiveness
for the syntax and I believe covers most typical use-cases.  Once we
figure out it's OK and have positive feedback from users, we can do
more enchantments incrementally.

> Rebase and update patch.

Cool, I was just about to ask you to do this.

 --
Regards,
Alexander Korotkov

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-15 Thread Andres Freund

Hi,

On 2024-03-14 17:39:30 -0400, Melanie Plageman wrote:
> I will soon send out a summary of what we investigated off-list about
> 0010 (though we didn't end up concluding anything). My "fix" (leaving
> BitmapAdjustPrefetchIterator() above table_scan_bitmap_next_block())
> eliminates the regression in 0010 on the one example that I repro'd
> upthread, but it would be good to know if it eliminates the
> regressions across some other tests.

I spent a good amount of time looking into this with Melanie. After a bunch of
wrong paths I think I found the issue: We end up prefetching blocks we have
already read. Notably this happens even as-is on master - just not as
frequently as after moving BitmapAdjustPrefetchIterator().

>From what I can tell the prefetching in parallel bitmap heap scans is
thoroughly broken.  I added some tracking of the last block read, the last
block prefetched to ParallelBitmapHeapState and found that with a small
effective_io_concurrency we end up with ~18% of prefetches being of blocks we
already read! After moving the BitmapAdjustPrefetchIterator() to rises to 86%,
no wonder it's slower...

The race here seems fairly substantial - we're moving the two iterators
independently from each other, in multiple processes, without useful locking.

I'm inclined to think this is a bug we ought to fix in the backbranches.

Greetings,

Andres Freund

Re: Support json_errdetail in FRONTEND builds

2024-03-15 Thread Michael Paquier

On Fri, Mar 15, 2024 at 11:23:07PM +0100, Daniel Gustafsson wrote:
> On 15 Mar 2024, at 21:56, Andrew Dunstan  wrote:
>> On Fri, Mar 15, 2024 at 10:15 AM Tom Lane > > wrote:
>> Yeah, I agree an Assert seems safest here.

Cool.

>> I'd like to get this done ASAP so I can rebase my incremental parse
>> patchset. Daniel, do you want to commit it? If not, I can.
> 
> Sure, I can commit these patches.  It won't be until tomorrow though since I
> prefer to have ample time to monitor the buildfarm, if you are in a bigger 
> rush
> than that then feel free to go ahead.

+1.  I've not looked much at 0002, but feel free to do so if you think
both are good for shipping.
--
Michael


signature.asc
Description: PGP signature

Re: add AVX2 support to simd.h

2024-03-15 Thread Nathan Bossart

On Fri, Mar 15, 2024 at 12:41:49PM -0500, Nathan Bossart wrote:
> I've also attached the results of running this benchmark on my machine at
> HEAD, after applying 0001, and after applying both 0001 and 0002.  0001
> appears to work pretty well.  When there is a small "tail," it regresses a
> small amount, but overall, it seems to improve more cases than it harms.
> 0002 does regress searches on smaller arrays quite a bit, since it
> postpones the SIMD optimizations until the arrays are longer.  It might be
> possible to mitigate by using 2 registers when the "tail" is long enough,
> but I have yet to try that.

The attached 0003 is a sketch of what such mitigation might look like.  It
appears to help with the regressions nicely.  I omitted the benchmarking
patch in v3 to appease cfbot.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3817435d200af5da954d505ae66245662dea064c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 15 Mar 2024 12:26:26 -0500
Subject: [PATCH v3 1/3] pg_lfind32: process "tail" with SIMD intructions

---
 src/include/port/pg_lfind.h | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index b8dfa66eef..9d21284724 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -103,7 +103,7 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	const uint32 nelem_per_iteration = 4 * nelem_per_vector;
 
 	/* round down to multiple of elements per iteration */
-	const uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);
+	uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);
 
 #if defined(USE_ASSERT_CHECKING)
 	bool		assert_result = false;
@@ -117,9 +117,11 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 			break;
 		}
 	}
+	i = 0;
 #endif
 
-	for (i = 0; i < tail_idx; i += nelem_per_iteration)
+retry:
+	for (; i < tail_idx; i += nelem_per_iteration)
 	{
 		Vector32	vals1,
 	vals2,
@@ -157,6 +159,16 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 			return true;
 		}
 	}
+
+	if (i == nelem)
+		return false;
+	else if (tail_idx > 0)
+	{
+		tail_idx = nelem;
+		i = nelem - nelem_per_iteration;
+		goto retry;
+	}
+
 #endif			/* ! USE_NO_SIMD */
 
 	/* Process the remaining elements one at a time. */
-- 
2.25.1

>From a867e342db08aae501374c75c0d8f17473a6cbc9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 15 Mar 2024 12:26:52 -0500
Subject: [PATCH v3 2/3] add avx2 support in simd.h

---
 src/include/port/simd.h | 58 -
 1 file changed, 45 insertions(+), 13 deletions(-)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 597496f2fb..767127b85c 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -18,7 +18,15 @@
 #ifndef SIMD_H
 #define SIMD_H
 
-#if (defined(__x86_64__) || defined(_M_AMD64))
+#if defined(__AVX2__)
+
+#include 
+#define USE_AVX2
+typedef __m256i Vector8;
+typedef __m256i Vector32;
+
+#elif (defined(__x86_64__) || defined(_M_AMD64))
+
 /*
  * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
  * that compilers targeting this architecture understand SSE2 intrinsics.
@@ -107,7 +115,9 @@ static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 static inline void
 vector8_load(Vector8 *v, const uint8 *s)
 {
-#if defined(USE_SSE2)
+#if defined(USE_AVX2)
+	*v = _mm256_loadu_si256((const __m256i *) s);
+#elif defined(USE_SSE2)
 	*v = _mm_loadu_si128((const __m128i *) s);
 #elif defined(USE_NEON)
 	*v = vld1q_u8(s);
@@ -120,7 +130,9 @@ vector8_load(Vector8 *v, const uint8 *s)
 static inline void
 vector32_load(Vector32 *v, const uint32 *s)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+	*v = _mm256_loadu_si256((const __m256i *) s);
+#elif defined(USE_SSE2)
 	*v = _mm_loadu_si128((const __m128i *) s);
 #elif defined(USE_NEON)
 	*v = vld1q_u32(s);
@@ -134,7 +146,9 @@ vector32_load(Vector32 *v, const uint32 *s)
 static inline Vector8
 vector8_broadcast(const uint8 c)
 {
-#if defined(USE_SSE2)
+#if defined(USE_AVX2)
+	return _mm256_set1_epi8(c);
+#elif defined(USE_SSE2)
 	return _mm_set1_epi8(c);
 #elif defined(USE_NEON)
 	return vdupq_n_u8(c);
@@ -147,7 +161,9 @@ vector8_broadcast(const uint8 c)
 static inline Vector32
 vector32_broadcast(const uint32 c)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+	return _mm256_set1_epi32(c);
+#elif defined(USE_SSE2)
 	return _mm_set1_epi32(c);
 #elif defined(USE_NEON)
 	return vdupq_n_u32(c);
@@ -270,7 +286,9 @@ vector8_has_le(const Vector8 v, const uint8 c)
 static inline bool
 vector8_is_highbit_set(const Vector8 v)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+	return _mm256_movemask_epi8(v) != 0;
+#elif defined(USE_SSE2)
 	return _mm_movemask_epi8(v) != 0;
 #elif defined(USE_NEON)
 	return vmaxvq_u8(v) > 0x7F;
@@ -308,7 +326,9 @@ vector32_is_highbit_set(const Vector32 v)
 static inline uint32
 vector8_highbit_mask(const Vector8 v)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-15 Thread Melanie Plageman

On Fri, Mar 15, 2024 at 5:14 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-03-14 17:39:30 -0400, Melanie Plageman wrote:
> > I will soon send out a summary of what we investigated off-list about
> > 0010 (though we didn't end up concluding anything). My "fix" (leaving
> > BitmapAdjustPrefetchIterator() above table_scan_bitmap_next_block())
> > eliminates the regression in 0010 on the one example that I repro'd
> > upthread, but it would be good to know if it eliminates the
> > regressions across some other tests.
>
> I spent a good amount of time looking into this with Melanie. After a bunch of
> wrong paths I think I found the issue: We end up prefetching blocks we have
> already read. Notably this happens even as-is on master - just not as
> frequently as after moving BitmapAdjustPrefetchIterator().
>
> From what I can tell the prefetching in parallel bitmap heap scans is
> thoroughly broken.  I added some tracking of the last block read, the last
> block prefetched to ParallelBitmapHeapState and found that with a small
> effective_io_concurrency we end up with ~18% of prefetches being of blocks we
> already read! After moving the BitmapAdjustPrefetchIterator() to rises to 86%,
> no wonder it's slower...
>
> The race here seems fairly substantial - we're moving the two iterators
> independently from each other, in multiple processes, without useful locking.
>
> I'm inclined to think this is a bug we ought to fix in the backbranches.

Thinking about how to fix this, perhaps we could keep the current max
block number in the ParallelBitmapHeapState and then when prefetching,
workers could loop calling tbm_shared_iterate() until they've found a
block at least prefetch_pages ahead of the current block. They
wouldn't need to read the current max value from the parallel state on
each iteration. Even checking it once and storing that value in a
local variable prevented prefetching blocks after reading them in my
example repro of the issue.

- Melanie

Re: pg_upgrade failing for 200+ million Large Objects

2024-03-15 Thread Tom Lane

This patch seems to have stalled out again.  In hopes of getting it
over the finish line, I've done a bit more work to address the two
loose ends I felt were probably essential to deal with:

* Duplicative blob ACLs are now merged into a single TOC entry
(per metadata group) with the GRANT/REVOKE commands stored only
once.  This is to address the possibly-common case where a database
has a ton of blobs that have identical-but-not-default ACLs.

I have not done anything about improving efficiency for blob comments
or security labels.  I think it's reasonable to assume that blobs with
comments are pets not cattle, and there won't be many of them.
I suppose it could be argued that seclabels might be used like ACLs
with a lot of duplication, but I doubt that there's anyone out there
at all putting seclabels on blobs in practice.  So I don't care to
expend effort on that.

* Parallel pg_upgrade cuts the --transaction-size given to concurrent
pg_restore jobs by the -j factor.  This is to ensure we keep the
shared locks table within bounds even in parallel mode.

Now we could go further than that and provide some direct user
control over these hard-wired settings, but I think that could
be left for later, getting some field experience before we design
an API.  In short, I think this patchset is more or less commitable.

0001-0004 are rebased up to HEAD, but differ only in line numbers
from the v10 patchset.  0005 handles ACL merging, and 0006 does
the other thing.

regards, tom lane

From 1df5cc3ffdee85db8f9815dc0839769192b57158 Mon Sep 17 00:00:00 2001
From: Tom Lane 
Date: Fri, 26 Jan 2024 11:10:00 -0500
Subject: [PATCH v11 1/6] Some small preliminaries for pg_dump changes.

Centralize management of the lo_buf used to hold data while restoring
blobs.  The code previously had each format handler create lo_buf,
which seems rather pointless given that the format handlers all make
it the same way.  Moreover, the format handlers never use lo_buf
directly, making this setup a failure from a separation-of-concerns
standpoint.  Let's move the responsibility into pg_backup_archiver.c,
which is the only module concerned with lo_buf.  The main reason to do
this now is that it allows a centralized fix for the soon-to-be-false
assumption that we never restore blobs in parallel.

Also, get rid of dead code in DropLOIfExists: it's been a long time
since we had any need to be able to restore to a pre-9.0 server.
---
 src/bin/pg_dump/pg_backup_archiver.c  |  9 +
 src/bin/pg_dump/pg_backup_custom.c|  7 ---
 src/bin/pg_dump/pg_backup_db.c| 27 +--
 src/bin/pg_dump/pg_backup_directory.c |  6 --
 src/bin/pg_dump/pg_backup_null.c  |  4 
 src/bin/pg_dump/pg_backup_tar.c   |  4 
 6 files changed, 14 insertions(+), 43 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index d97ebaff5b..f5935b08bb 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -1343,6 +1343,12 @@ StartRestoreLO(ArchiveHandle *AH, Oid oid, bool drop)
 	AH->loCount++;
 
 	/* Initialize the LO Buffer */
+	if (AH->lo_buf == NULL)
+	{
+		/* First time through (in this process) so allocate the buffer */
+		AH->lo_buf_size = LOBBUFSIZE;
+		AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
+	}
 	AH->lo_buf_used = 0;
 
 	pg_log_info("restoring large object with OID %u", oid);
@@ -4749,6 +4755,9 @@ CloneArchive(ArchiveHandle *AH)
 	/* clone has its own error count, too */
 	clone->public.n_errors = 0;
 
+	/* clones should not share lo_buf */
+	clone->lo_buf = NULL;
+
 	/*
 	 * Connect our new clone object to the database, using the same connection
 	 * parameters used for the original connection.
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index b576b29924..7c6ac89dd4 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -140,10 +140,6 @@ InitArchiveFmt_Custom(ArchiveHandle *AH)
 	ctx = (lclContext *) pg_malloc0(sizeof(lclContext));
 	AH->formatData = (void *) ctx;
 
-	/* Initialize LO buffering */
-	AH->lo_buf_size = LOBBUFSIZE;
-	AH->lo_buf = (void *) pg_malloc(LOBBUFSIZE);
-
 	/*
 	 * Now open the file
 	 */
@@ -902,9 +898,6 @@ _Clone(ArchiveHandle *AH)
 	 * share knowledge about where the data blocks are across threads.
 	 * _PrintTocData has to be careful about the order of operations on that
 	 * state, though.
-	 *
-	 * Note: we do not make a local lo_buf because we expect at most one BLOBS
-	 * entry per archive, so no parallelism is possible.
 	 */
 }
 
diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c
index f766b65059..b297ca049d 100644
--- a/src/bin/pg_dump/pg_backup_db.c
+++ b/src/bin/pg_dump/pg_backup_db.c
@@ -544,26 +544,9 @@ CommitTransaction(Archive *AHX)
 void
 DropLOIfExists(ArchiveHandle *AH, Oid oid)
 {
-	/*
-	 * If we are not restoring to a direct database connection, we

Re: Statistics Import and Export

2024-03-15 Thread Jeff Davis

On Fri, 2024-03-15 at 15:30 -0700, Jeff Davis wrote:
> Still looking, but one quick comment is that the third argument of
> dumpRelationStats() should be const, which eliminates a warning.

A few other comments:

* pg_set_relation_stats() needs to do an ACL check so you can't set the
stats on someone else's table. I suggest honoring the new MAINTAIN
privilege as well.

* If possible, reading from pg_stats (instead of pg_statistic) would be
ideal because pg_stats already does the right checks at read time, so a
non-superuser can export stats, too.

* If reading from pg_stats, should you change the signature of
pg_set_relation_stats() to have argument names matching the columns of
pg_stats (e.g. most_common_vals instead of stakind/stavalues)?

In other words, make this a slightly higher level: conceptually
exporting/importing pg_stats rather than pg_statistic. This may also
make the SQL export queries simpler.

Also, I'm wondering about error handling. Is some kind of error thrown
by pg_set_relation_stats() going to abort an entire restore? That might
be easy to prevent with pg_restore, because it can just omit the stats,
but harder if it's in a SQL file.

Regards,
Jeff Davis

Re: Add basic tests for the low-level backup method.

2024-03-15 Thread Michael Paquier

On Fri, Mar 15, 2024 at 06:37:35PM +1300, David Steele wrote:
> This seems like a reasonable explanation to me.

FYI, drongo has just passed the test.  fairywren uses TAP, does not
run the recovery tests.
--
Michael


signature.asc
Description: PGP signature

Re: Support json_errdetail in FRONTEND builds

2024-03-15 Thread Andrew Dunstan

On Fri, Mar 15, 2024 at 10:15 AM Tom Lane  wrote:

> Daniel Gustafsson  writes:
> > I can't see how refusing to free memory owned and controlled by someone
> else,
> > and throwing an error if attempted, wouldn't be a sound defensive
> programming
> > measure.
>
> I think the argument is about what "refusal" looks like.
> An Assert seems OK to me, but anything based on elog is
> likely to be problematic, because it'll involve exit()
> somewhere.
>
>
>

Yeah, I agree an Assert seems safest here.

I'd like to get this done ASAP so I can rebase my incremental parse
patchset. Daniel, do you want to commit it? If not, I can.

cheers

andrew

Re: Support json_errdetail in FRONTEND builds

2024-03-15 Thread Daniel Gustafsson

> On 15 Mar 2024, at 21:56, Andrew Dunstan  wrote:
> On Fri, Mar 15, 2024 at 10:15 AM Tom Lane  > wrote:
> Daniel Gustafsson mailto:dan...@yesql.se>> writes:
> > I can't see how refusing to free memory owned and controlled by someone 
> > else,
> > and throwing an error if attempted, wouldn't be a sound defensive 
> > programming
> > measure.
> 
> I think the argument is about what "refusal" looks like.
> An Assert seems OK to me, but anything based on elog is
> likely to be problematic, because it'll involve exit()
> somewhere.
> 
> Yeah, I agree an Assert seems safest here. 
> 
> I'd like to get this done ASAP so I can rebase my incremental parse patchset. 
> Daniel, do you want to commit it? If not, I can.

Sure, I can commit these patches.  It won't be until tomorrow though since I
prefer to have ample time to monitor the buildfarm, if you are in a bigger rush
than that then feel free to go ahead.

--
Daniel Gustafsson

Re: Statistics Import and Export

2024-03-15 Thread Jeff Davis

On Fri, 2024-03-15 at 03:55 -0400, Corey Huinker wrote:
> 
> Statistics are preserved by default, but this can be disabled with
> the option --no-statistics. This follows the prevailing option
> pattern in pg_dump, etc.

I'm not sure if saving statistics should be the default in 17. I'm
inclined to make it opt-in.

> There are currently several failing TAP tests around
> pg_dump/pg_restore/pg_upgrade.

It is a permissions problem. When user running pg_dump is not the
superuser, they don't have permission to access pg_statistic. That
causes an error in exportRelationStatsStmt(), which returns NULL, and
then the caller segfaults.

> I'm looking at those, but in the mean time I'm seeking feedback on
> the progress so far.

Still looking, but one quick comment is that the third argument of
dumpRelationStats() should be const, which eliminates a warning.

Regards,
Jeff Davis

Re: Vectored I/O in bulk_write.c

2024-03-15 Thread Thomas Munro

I canvassed Andres off-list since smgrzeroextend() is his invention,
and he wondered if it was a good idea to blur the distinction between
the different zero-extension strategies like that.  Good question.  My
take is that it's fine:

mdzeroextend() already uses fallocate() only for nblocks > 8, but
otherwise writes out zeroes, because that was seen to interact better
with file system heuristics on common systems.  That preserves current
behaviour for callers of plain-old sgmrextend(), which becomes a
wrapper for smgrwrite(..., 1, _ZERO | _EXTEND).  If some hypothetical
future caller wants to be able to call smgrwritev(..., NULL, 9 blocks,
_ZERO | _EXTEND) directly without triggering the fallocate() strategy
for some well researched reason, we could add a new flag to say so, ie
adjust that gating.

In other words, we have already blurred the semantics.  To me, it
seems nicer to have a single high level interface for the same logical
operation, and flags to select strategies more explicitly if that is
eventually necessary.

Re: Weird test mixup

2024-03-15 Thread Michael Paquier

On Fri, Mar 15, 2024 at 11:23:31AM +0200, Heikki Linnakangas wrote:
> For the gin test, a single "SELECT injection_points_attach_local()" at the
> top of the test file would be most convenient.
> 
> If I have to do "SELECT
> injection_points_condition('gin-finish-incomplete-split', :'datname');" for
> every injection point in the test, I will surely forget it sometimes.

So will I, most likely..  The odds never play in favor of hackers.  I
have a few more tests in mind that can be linked to a specific
backend with SQL queries, but I've not been able to get back to it
yet.

> Wrt. the spinlock and shared memory handling, I think this would be simpler
> if you could pass some payload in the InjectionPointAttach() call, which
> would be passed back to the callback function:
> 
> In this case, the payload would be the "slot index" in shared memory.
>
> Or perhaps always allocate, say, 1024 bytes of working area for every
> attached injection point that the test module can use any way it wants. Like
> for storing extra conditions, or for the wakeup counter stuff in
> injection_wait(). A fixed size working area is a little crude, but would be
> very handy in practice.

Perhaps.  I am not sure that we need more than the current signature,
all that can just be handled in some module-specific shmem area.  The
key is to be able to link a point name to some state related to it.
Using a hash table would be more efficient, but performance wise a
array is not going to matter as there will most likely never be more
than 8 points.  4 is already a lot, just doubling that on safety
ground.

> It would be nice to automatically detach all the injection points on process
> exit. You wouldn't always want that, but I think most tests hold a session
> open throughout the test, and for those it would be handy.

Linking all the points to a PID with a injection_points_attach_local()
that switches a static flag while registering a before_shmem_exit() to
do an automated cleanup sounds like the simplest approach to me based
on what I'm reading on this thread.

(Just saw the buildfarm storm, wow.)
--
Michael

signature.asc
Description: PGP signature

Re: Streaming I/O, vectored I/O (WIP)

2024-03-15 Thread Thomas Munro

I am planning to push the bufmgr.c patch soon.  At that point the new
API won't have any direct callers yet, but the traditional
ReadBuffer() family of functions will internally reach
StartReadBuffers(nblocks=1) followed by WaitReadBuffers(),
ZeroBuffer() or nothing as appropriate.  Any more thoughts or
objections?  Naming, semantics, correctness of buffer protocol,
sufficiency of comments, something else?

Re: Support json_errdetail in FRONTEND builds

2024-03-15 Thread Andrew Dunstan




> On Mar 16, 2024, at 8:53 AM, Daniel Gustafsson  wrote:
> 
> 
>> 
>> On 15 Mar 2024, at 21:56, Andrew Dunstan  wrote:
>> On Fri, Mar 15, 2024 at 10:15 AM Tom Lane > > wrote:
>> Daniel Gustafsson mailto:dan...@yesql.se>> writes:
>>> I can't see how refusing to free memory owned and controlled by someone 
>>> else,
>>> and throwing an error if attempted, wouldn't be a sound defensive 
>>> programming
>>> measure.
>> 
>> I think the argument is about what "refusal" looks like.
>> An Assert seems OK to me, but anything based on elog is
>> likely to be problematic, because it'll involve exit()
>> somewhere.
>> 
>> Yeah, I agree an Assert seems safest here.
>> 
>> I'd like to get this done ASAP so I can rebase my incremental parse 
>> patchset. Daniel, do you want to commit it? If not, I can.
> 
> Sure, I can commit these patches.  It won't be until tomorrow though since I
> prefer to have ample time to monitor the buildfarm, if you are in a bigger 
> rush
> than that then feel free to go ahead.
> 

tomorrow will be fine, thanks 

Cheers

Andrew

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-15 Thread Kartyshov Ivan


On 2024-03-15 22:59, Kartyshov Ivan wrote:

On 2024-03-11 13:44, Alexander Korotkov wrote:

I picked the second option and left only the AFTER clause for the
BEGIN statement.  I think this should be enough for the beginning.


Thank you for your rework on your patch, here I made some fixes:
0) autocomplete
1) less jumps
2) more description and add cases in doc

I think, it will be useful to have stand-alone statement.
Why you would like to see only AFTER clause for the BEGIN statement?


Rebase and update patch.

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.comdiff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..759e46ec24 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  
 
-BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] wait_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+where wait_event is:
+AFTER lsn_value [ WITHIN number_of_milliseconds ]
 
  
 
@@ -78,6 +81,50 @@ BEGIN [ WORK | TRANSACTION ] [ transaction_mode
 

+
+   
+AFTER lsn_value
+
+ 
+   AFTER clause is used on standby in
+   physical streaming replication
+   and specifies waiting for the specific WAL location (LSN)
+   to be replayed before starting the transaction.
+ 
+ 
+   This option is useful when the user makes some data changes on primary
+   and needs a guarantee to see these changes on standby.  The LSN to wait
+   could be obtained on the primary using
+   pg_current_wal_insert_lsn
+   function after committing the relevant changes.
+ 
+
+   
+
+   
+WITHIN number_of_milliseconds
+
+ 
+   Provides the timeout for the AFTER clause.
+   Especially helpful to prevent freezing on streaming replication
+   connection failures.
+ 
+ 
+   On practice we  have three variants of waiting current LSN:
+  waiting forever - (finish on PM DEATH or SIGINT)
+ BEGIN AFTER lsn;
+wait for changes to be replayed on standby and then start transaction
+  waiting on timeout -
+ BEGIN AFTER lsn WITHIN 6;
+wait changes for 60 seconds and then cancel to start transaction
+to perevent freezing on streaming replication connection failures.
+  no wait, just check -
+ BEGIN AFTER lsn WITHIN 1;
+some time it is useful just check if changes was replayed
+   where number_of_milliseconds can:
+ 
+
+   
   
 
   
@@ -123,6 +170,33 @@ BEGIN [ WORK | TRANSACTION ] [ transaction_mode
 BEGIN;
 
+
+  
+   To begin a transaction block after replaying the given LSN.
+   The command will be canceled if the given LSN is not
+   reached within the timeout of one second.
+
+BEGIN AFTER '0/3F05F791' WITHIN 1000;
+BEGIN
+
+
+  
+   This way we just check (without waiting) if LSN was replayed,
+   and if it was then start transaction.
+
+BEGIN AFTER '0/3F0FF791' WITHIN 1;
+ERROR:  canceling waiting for LSN due to timeout
+
+
+  
+   To begin a transaction block after replaying the given LSN.
+   The command will be canceled only on user request or postmaster death.
+
+BEGIN AFTER '0/3F0FF791';
+^CCancel request sent
+ERROR:  canceling statement due to user request
+
+
  
 
  
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  
 
-START TRANSACTION [ transaction_mode [, ...] ]
+START TRANSACTION [ transaction_mode [, ...] ] wait_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+where wait_event is:
+AFTER lsn_value [ WITHIN number_of_milliseconds ]
 
  
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..89b997ad47 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+	(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(>minLSN)))
+

Re: add AVX2 support to simd.h

2024-03-15 Thread Nathan Bossart

On Wed, Jan 10, 2024 at 09:06:08AM +0700, John Naylor wrote:
> If we have say 25 elements, I mean (for SSE2) check the first 16, then
> the last 16. Some will be checked twice, but that's okay.

I finally got around to trying this.  0001 adds this overlapping logic.
0002 is a rebased version of the AVX2 patch (it needed some updates after
commit 9f225e9).  And 0003 is a benchmark for test_lfind32().  It runs
pg_lfind32() on an array of the given size 100M times.

I've also attached the results of running this benchmark on my machine at
HEAD, after applying 0001, and after applying both 0001 and 0002.  0001
appears to work pretty well.  When there is a small "tail," it regresses a
small amount, but overall, it seems to improve more cases than it harms.
0002 does regress searches on smaller arrays quite a bit, since it
postpones the SIMD optimizations until the arrays are longer.  It might be
possible to mitigate by using 2 registers when the "tail" is long enough,
but I have yet to try that.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3a4d74eeab18d9e8f510e11185109ed910e40268 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 15 Mar 2024 12:26:26 -0500
Subject: [PATCH v2 1/3] pg_lfind32: process "tail" with SIMD intructions

---
 src/include/port/pg_lfind.h | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index b8dfa66eef..9d21284724 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -103,7 +103,7 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 	const uint32 nelem_per_iteration = 4 * nelem_per_vector;
 
 	/* round down to multiple of elements per iteration */
-	const uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);
+	uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);
 
 #if defined(USE_ASSERT_CHECKING)
 	bool		assert_result = false;
@@ -117,9 +117,11 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 			break;
 		}
 	}
+	i = 0;
 #endif
 
-	for (i = 0; i < tail_idx; i += nelem_per_iteration)
+retry:
+	for (; i < tail_idx; i += nelem_per_iteration)
 	{
 		Vector32	vals1,
 	vals2,
@@ -157,6 +159,16 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 			return true;
 		}
 	}
+
+	if (i == nelem)
+		return false;
+	else if (tail_idx > 0)
+	{
+		tail_idx = nelem;
+		i = nelem - nelem_per_iteration;
+		goto retry;
+	}
+
 #endif			/* ! USE_NO_SIMD */
 
 	/* Process the remaining elements one at a time. */
-- 
2.25.1

>From 0ac61e17b6ed07116086ded2a6a5142da9afa28f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 15 Mar 2024 12:26:52 -0500
Subject: [PATCH v2 2/3] add avx2 support in simd.h

---
 src/include/port/simd.h | 58 -
 1 file changed, 45 insertions(+), 13 deletions(-)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 597496f2fb..767127b85c 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -18,7 +18,15 @@
 #ifndef SIMD_H
 #define SIMD_H
 
-#if (defined(__x86_64__) || defined(_M_AMD64))
+#if defined(__AVX2__)
+
+#include 
+#define USE_AVX2
+typedef __m256i Vector8;
+typedef __m256i Vector32;
+
+#elif (defined(__x86_64__) || defined(_M_AMD64))
+
 /*
  * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
  * that compilers targeting this architecture understand SSE2 intrinsics.
@@ -107,7 +115,9 @@ static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 static inline void
 vector8_load(Vector8 *v, const uint8 *s)
 {
-#if defined(USE_SSE2)
+#if defined(USE_AVX2)
+	*v = _mm256_loadu_si256((const __m256i *) s);
+#elif defined(USE_SSE2)
 	*v = _mm_loadu_si128((const __m128i *) s);
 #elif defined(USE_NEON)
 	*v = vld1q_u8(s);
@@ -120,7 +130,9 @@ vector8_load(Vector8 *v, const uint8 *s)
 static inline void
 vector32_load(Vector32 *v, const uint32 *s)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+	*v = _mm256_loadu_si256((const __m256i *) s);
+#elif defined(USE_SSE2)
 	*v = _mm_loadu_si128((const __m128i *) s);
 #elif defined(USE_NEON)
 	*v = vld1q_u32(s);
@@ -134,7 +146,9 @@ vector32_load(Vector32 *v, const uint32 *s)
 static inline Vector8
 vector8_broadcast(const uint8 c)
 {
-#if defined(USE_SSE2)
+#if defined(USE_AVX2)
+	return _mm256_set1_epi8(c);
+#elif defined(USE_SSE2)
 	return _mm_set1_epi8(c);
 #elif defined(USE_NEON)
 	return vdupq_n_u8(c);
@@ -147,7 +161,9 @@ vector8_broadcast(const uint8 c)
 static inline Vector32
 vector32_broadcast(const uint32 c)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+	return _mm256_set1_epi32(c);
+#elif defined(USE_SSE2)
 	return _mm_set1_epi32(c);
 #elif defined(USE_NEON)
 	return vdupq_n_u32(c);
@@ -270,7 +286,9 @@ vector8_has_le(const Vector8 v, const uint8 c)
 static inline bool
 vector8_is_highbit_set(const Vector8 v)
 {
-#ifdef USE_SSE2
+#if defined(USE_AVX2)
+	return _mm256_movemask_epi8(v) != 0;
+#elif defined(USE_SSE2)
 	return _mm_movemask_epi8(v) != 0;
 #elif

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-15 Thread Bertrand Drouvot

Hi,

On Thu, Mar 14, 2024 at 12:24:00PM +0530, Amit Kapila wrote:
> On Wed, Mar 13, 2024 at 9:24 PM Bharath Rupireddy
>  wrote:
> >
> > On Wed, Mar 13, 2024 at 9:21 AM Amit Kapila  wrote:
> > >
> > > > So, how about we turn conflict_reason to only report the reasons that
> > > > actually cause conflict with recovery for logical slots, something
> > > > like below, and then have invalidation_cause as a generic column for
> > > > all sorts of invalidation reasons for both logical and physical slots?
> > >
> > > If our above understanding is correct then coflict_reason will be a
> > > subset of invalidation_reason. If so, whatever way we arrange this
> > > information, there will be some sort of duplicity unless we just have
> > > one column 'invalidation_reason' and update the docs to interpret it
> > > correctly for conflicts.
> >
> > Yes, there will be some sort of duplicity if we emit conflict_reason
> > as a text field. However, I still think the better way is to turn
> > conflict_reason text to conflict boolean and set it to true only on
> > rows_removed and wal_level_insufficient invalidations. When conflict
> > boolean is true, one (including all the tests that we've added
> > recently) can look for invalidation_reason text field for the reason.
> > This sounds reasonable to me as opposed to we just mentioning in the
> > docs that "if invalidation_reason is rows_removed or
> > wal_level_insufficient it's the reason for conflict with recovery".
> >
> 
> Fair point. I think we can go either way. Bertrand, Nathan, and
> others, do you have an opinion on this matter?

Sounds like a good approach to me and one will be able to quickly identify
if a conflict occured.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: MERGE ... RETURNING

2024-03-15 Thread Jeff Davis

On Fri, 2024-03-15 at 11:20 +, Dean Rasheed wrote:
> To recap, this adds support for a single RETURNING list at the end of
> a MERGE command, and a special MERGE_ACTION() function that may be
> used in the RETURNING list to return the action command string
> ('INSERT', 'UPDATE', or 'DELETE') that was executed.

...

> So barring any further objections, I'd like to go ahead and get this
> patch committed.

All of my concerns have been extensively discussed and it seems like
they are just the cost of having a good feature. Thank you for going
through so many alternative approaches, I think the one you've arrived
at is consistent with what Vik endorsed[1].

The MERGE_ACTION keyword is added to the 'col_name_keyword' and the
'bare_label_keyword' lists. That has some annoying effects, like:

   CREATE FUNCTION merge_action() RETURNS TEXT
  LANGUAGE SQL AS $$ SELECT 'asdf'; $$;
   ERROR:  syntax error at or near "("
   LINE 1: CREATE FUNCTION merge_action() RETURNS TEXT

I didn't see any affirmative endorsement of exactly how the keyword is
implemented, but that patch has been around for a while, and I didn't
see any objection, either.

I like this feature from a user perspective. So +1 from me.

Regards,
Jeff Davis

[1]
https://www.postgresql.org/message-id/7db39b45-821f-4894-ada9-c19570b11...@postgresfriends.org

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-15 Thread Alexander Korotkov

On Fri, Mar 15, 2024 at 4:20 PM Alexander Korotkov 
wrote:

> On Mon, Mar 11, 2024 at 12:44 PM Alexander Korotkov
>  wrote:
> > I've decided to put my hands on this patch.
> >
> > On Thu, Mar 7, 2024 at 2:25 PM Amit Kapila 
> wrote:
> > > +1 for the second one not only because it avoids new words in grammar
> > > but also sounds to convey the meaning. I think you can explain in docs
> > > how this feature can be used basically how will one get the correct
> > > LSN value to specify.
> >
> > I picked the second option and left only the AFTER clause for the
> > BEGIN statement.  I think this should be enough for the beginning.
> >
> > > As suggested previously also pick one of the approaches (I would
> > > advocate the second one) and keep an option for the second one by
> > > mentioning it in the commit message. I hope to see more
> > > reviews/discussions or usage like how will users get the LSN value to
> > > be specified on the core logic of the feature at this stage. IF
> > > possible, state, how real-world applications could leverage this
> > > feature.
> >
> > I've added a paragraph to the docs about the usage.  After you made
> > some changes on primary, you run pg_current_wal_insert_lsn().  Then
> > connect to replica and run 'BEGIN AFTER lsn' with the just obtained
> > LSN.  Now you're guaranteed to see the changes made to the primary.
> >
> > Also, I've significantly reworked other aspects of the patch.  The
> > most significant changes are:
> > 1) Waiters are now stored in the array sorted by LSN.  This saves us
> > from scanning of wholeper-backend array.
> > 2) Waiters are removed from the array immediately once their LSNs are
> > replayed.  Otherwise, the WAL replayer will keep scanning the shared
> > memory array till waiters wake up.
> > 3) To clean up after errors, we now call WaitLSNCleanup() on backend
> > shmem exit.  I think this is preferable over the previous approach to
> > remove from the queue before ProcessInterrupts().
> > 4) There is now condition to recheck if LSN is replayed after adding
> > to the shared memory array.  This should save from the race
> > conditions.
> > 5) I've renamed too generic names for functions and files.
>
> I went through this patch another time, and made some minor
> adjustments.  Now it looks good, I'm going to push it if no
> objections.
>

The revised patch version with cosmetic fixes proposed by Alexander Lakhin.

--
Regards,
Alexander Korotkov


v10-0001-Implement-AFTER-clause-for-BEGIN-command.patch
Description: Binary data

Re: Fix expecteddir argument in pg_regress

2024-03-15 Thread Daniel Gustafsson

> On 14 Mar 2024, at 11:45, Daniel Gustafsson  wrote:
> 
>> On 11 Mar 2024, at 09:23, Anthonin Bonnefoy 
>>  wrote:
> 
>> pg_regress accepts the expecteddir argument. However, it is never used
>> and outputdir is used instead to get the expected files paths.
> 
> Nice catch, c855872074b5bf44ecea033674d22fac831cfc31 added --expecteddir
> support to pg_regress but only implemented it for the ECPG tests.  Will have
> another look at this before applying with a backpatch to v16 where
> --expecteddir was added.

Pushed and backpatched, thanks for the submission!

--
Daniel Gustafsson

RE: Popcount optimization using AVX512

2024-03-15 Thread Amonson, Paul D

> -Original Message-
> From: Amonson, Paul D 
> Sent: Friday, March 15, 2024 8:31 AM
> To: Nathan Bossart 
...
> When I tested the code outside postgres in a micro benchmark I got 200-
> 300% improvements. Your results are interesting, as it implies more than
> 300% improvement. Let me do some research on the benchmark you
> referenced. However, in all cases it seems that there is no regression so 
> should
> we move forward on merging while I run some more local tests?

When running quick test with small buffers (1 to 32K) I see up to about a 740% 
improvement. This was using my stand-alone micro benchmark outside of PG. My 
original 200-300% numbers were averaged including sizes up to 512MB which seems 
to not run as well on large buffers.  I will try the referenced micro benchmark 
on Monday. None of my benchmark testing used the command line "time" command. 
For Postgres is set "\timing" before the run and for the stand-alone benchmark 
is took timestamps in the code. In all cases I used -O2 for optimization.

Thanks,
Paul

Re: Weird test mixup

2024-03-15 Thread Thomas Munro

On Sat, Mar 16, 2024 at 7:27 AM Tom Lane  wrote:
> Are there limits on the runtime of CI or cfbot jobs?  Maybe
> somebody should go check those systems.

Those get killed at a higher level after 60 minutes (configurable but
we didn't change it AFAIK):

https://cirrus-ci.org/faq/#instance-timed-out

It's a fresh virtual machine for each run, and after that it's gone
(well the ccache directory survives but only by being
uploaded/downloaded in explicit steps to transmit it between runs).

Re: Reports on obsolete Postgres versions

2024-03-15 Thread Jeremy Schneider

On 3/15/24 3:17 AM, Daniel Gustafsson wrote:
>> On 14 Mar 2024, at 16:48, Peter Eisentraut  wrote:
>> On 13.03.24 18:12, Bruce Momjian wrote:
> 
>>> I think "minor" is a better term since it contrasts with "major".  We
>>> don't actually supply patches to upgrade minor versions.
>>
>> There are potentially different adjectives that could apply to "version" and 
>> "release".
>>
>> The version numbers can be called major and minor, because that just 
>> describes their ordering and significance.
>>
>> But I do agree that "minor release" isn't quite as clear, because one could 
>> also interpret that as "a release, but a bit smaller this time". (Also might 
>> not translate well, since "minor" and "small" could translate to the same 
>> thing.)
> 
> Some of the user confusion likely stems from us using the same nomenclature as
> SemVer, but for different things.  SemVer has become very widely adopted, to
> the point where it's almost assumed by many, so maybe we need to explicitly
> state that we *don't* use SemVer (we don't mention that anywhere in the docs 
> or
> on the website).

Semantic Versioning was definitely part of what led to my confusion
up-thread here. I was also mistaken in what I said up-thread about
MySQL, who also calls "5.7" the "major" version.


>> One could instead, for example, describe those as "maintenance releases":
> 
> That might indeed be a better name for what we provide.

The latest PostgreSQL news item uses the word "update" and seems pretty
well written in this area already (at least to me)

Also I just confirmed, the bug reporting form also seems well written:

"Make sure you are running the latest available minor release for your
major version before reporting a bug. The current list of supported
versions is 16.2, 15.6, 14.11, 13.14, 12.18."

This all looks good, but I do still agree that a gradual shift toward
saying "maintenance update" instead of "minor" might still promote more
clarity in the long run?

-Jeremy


-- 
http://about.me/jeremy_schneider

Re: Weird test mixup

2024-03-15 Thread Tom Lane

Heikki Linnakangas  writes:
> On 15/03/2024 16:00, Tom Lane wrote:
>> It may be even worse than it appears from the buildfarm status page.
>> My animals were stuck in infinite loops that required a manual "kill"
>> to get out of, and it's reasonable to assume there are others that
>> will require owner intervention.  Why would this test have done that,
>> if the module failed to load?

> The gin_incomplete_split test inserts rows until it hits the injection 
> point, at page split. There is a backstop, it should give up after 1 
> iterations, but that was broken. Fixed that, thanks for the report!

Duh ...

> Hmm, don't we have any timeout that would kill tests if they get stuck?

AFAIK, the only constraint on a buildfarm animal's runtime is the
wait_timeout setting, which is infinite by default, and was on my
machines.  (Not anymore ;-).)  We do have timeouts in (most?) TAP
tests, but this wasn't a TAP test.

If this is a continuous-insertion loop, presumably it will run the
machine out of disk space eventually, which could be unpleasant
if there are other services running besides the buildfarm.
I think I'll go notify the buildfarm owners list to check for
trouble.

Are there limits on the runtime of CI or cfbot jobs?  Maybe
somebody should go check those systems.

regards, tom lane

Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-15 Thread Kartyshov Ivan


On 2024-03-11 13:44, Alexander Korotkov wrote:

I picked the second option and left only the AFTER clause for the
BEGIN statement.  I think this should be enough for the beginning.


Thank you for your rework on your patch, here I made some fixes:
0) autocomplete
1) less jumps
2) more description and add cases in doc

I think, it will be useful to have stand-alone statement.
Why you would like to see only AFTER clause for the BEGIN statement?

--
Ivan Kartyshov
Postgres Professional: www.postgrespro.comdiff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index 016b021487..759e46ec24 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  
 
-BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] wait_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+where wait_event is:
+AFTER lsn_value [ WITHIN number_of_milliseconds ]
 
  
 
@@ -78,6 +81,50 @@ BEGIN [ WORK | TRANSACTION ] [ transaction_mode
 

+
+   
+AFTER lsn_value
+
+ 
+   AFTER clause is used on standby in
+   physical streaming replication
+   and specifies waiting for the specific WAL location (LSN)
+   to be replayed before starting the transaction.
+ 
+ 
+   This option is useful when the user makes some data changes on primary
+   and needs a guarantee to see these changes on standby.  The LSN to wait
+   could be obtained on the primary using
+   pg_current_wal_insert_lsn
+   function after committing the relevant changes.
+ 
+
+   
+
+   
+WITHIN number_of_milliseconds
+
+ 
+   Provides the timeout for the AFTER clause.
+   Especially helpful to prevent freezing on streaming replication
+   connection failures.
+ 
+ 
+   On practice we  have three variants of waiting current LSN:
+  waiting forever - (finish on PM DEATH or SIGINT)
+ BEGIN AFTER lsn;
+wait for changes to be replayed on standby and then start transaction
+  waiting on timeout -
+ BEGIN AFTER lsn WITHIN 6;
+wait changes for 60 seconds and then cancel to start transaction
+to perevent freezing on streaming replication connection failures.
+  no wait, just check -
+ BEGIN AFTER lsn WITHIN 1;
+some time it is useful just check if changes was replayed
+   where number_of_milliseconds can:
+ 
+
+   
   
 
   
@@ -123,6 +170,33 @@ BEGIN [ WORK | TRANSACTION ] [ transaction_mode
 BEGIN;
 
+
+  
+   To begin a transaction block after replaying the given LSN.
+   The command will be canceled if the given LSN is not
+   reached within the timeout of one second.
+
+BEGIN AFTER '0/3F05F791' WITHIN 1000;
+BEGIN
+
+
+  
+   This way we just check (without waiting) if LSN was replayed,
+   and if it was then start transaction.
+
+BEGIN AFTER '0/3F0FF791' WITHIN 1;
+ERROR:  canceling waiting for LSN due to timeout
+
+
+  
+   To begin a transaction block after replaying the given LSN.
+   The command will be canceled only on user request or postmaster death.
+
+BEGIN AFTER '0/3F0FF791';
+^CCancel request sent
+ERROR:  canceling statement due to user request
+
+
  
 
  
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index 74ccd7e345..46a3bcf1a8 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,16 @@ PostgreSQL documentation
 
  
 
-START TRANSACTION [ transaction_mode [, ...] ]
+START TRANSACTION [ transaction_mode [, ...] ] wait_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+where wait_event is:
+AFTER lsn_value [ WITHIN number_of_milliseconds ]
 
  
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..89b997ad47 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -43,6 +43,7 @@
 #include "backup/basebackup.h"
 #include "catalog/pg_control.h"
 #include "commands/tablespace.h"
+#include "commands/waitlsn.h"
 #include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1828,6 +1829,14 @@ PerformWalRecovery(void)
 break;
 			}
 
+			/*
+			 * If we replayed an LSN that someone was waiting for, set latches
+			 * in shared memory array to notify the waiter.
+			 */
+			if (waitLSN &&
+	(XLogRecoveryCtl->lastReplayedEndRecPtr >= pg_atomic_read_u64(>minLSN)))
+WaitLSNSetLatches(XLogRecoveryCtl->lastReplayedEndRecPtr);
+
 			/* Else, try to fetch the next WAL record

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-15 Thread Bharath Rupireddy

On Fri, Mar 15, 2024 at 12:49 PM shveta malik  wrote:
>
> patch002:
>
> 1)
> I would like to understand the purpose of 'inactive_count'? Is it only
> for users for monitoring purposes? We are not using it anywhere
> internally.

inactive_count metric helps detect unstable replication slots
connections that have a lot of disconnections. It's not used for the
inactive_timeout based slot invalidation mechanism.

> I shutdown the instance 5 times and found that 'inactive_count' became
> 5 for all the slots created on that instance. Is this intentional?

Yes, it's incremented on shutdown (and for that matter upon every slot
release) for all the slots that are tied to walsenders.

> I mean we can not really use them if the instance is down.  I felt it
> should increment the inactive_count only if during the span of
> instance, they were actually inactive i.e. no streaming or replication
> happening through them.

inactive_count is persisted to disk- upon clean shutdown, so, once the
slots become active again, one gets to see the metric and deduce some
info on disconnections.

Having said that, I'm okay to hear from others on the inactive_count
metric being added.

> 2)
> slot.c:
> + case RS_INVAL_XID_AGE:
>
> Can we optimize this code? It has duplicate code for processing
> s->data.catalog_xmin and s->data.xmin. Can we create a sub-function
> for this purpose and call it twice here?

Good idea. Done that way.

> 2)
> The msg for patch 3 says:
> --
> a) when replication slots is lying inactive for a day or so using
> last_inactive_at metric,
> b) when a replication slot is becoming inactive too frequently using
> last_inactive_at metric.
> --
>  I think in b, you want to refer to inactive_count instead of 
> last_inactive_at?

Right. Changed.

> 3)
> I do not see invalidation_reason updated for 2 new reasons in 
> system-views.sgml

Nice catch. Added them now.

I've also responded to Bertrand's comments here.

On Wed, Mar 6, 2024 at 3:56 PM Bertrand Drouvot
 wrote:
>
> A few comments:
>
> 1 ===
>
> +   The reason for the slot's invalidation. NULL if the
> +   slot is currently actively being used.
>
> s/currently actively being used/not invalidated/ ? (I mean it could be valid
> and not being used).

Changed.

> 3 ===
>
> res = executeQueryOrDie(conn, "SELECT slot_name, plugin, two_phase, 
> failover, "
> -   "%s as caught_up, 
> conflict_reason IS NOT NULL as invalid "
> +   "%s as caught_up, 
> invalidation_reason IS NOT NULL as invalid "
> "FROM 
> pg_catalog.pg_replication_slots "
> -   "(CASE WHEN 
> conflict_reason IS NOT NULL THEN FALSE "
> +   "(CASE WHEN 
> invalidation_reason IS NOT NULL THEN FALSE "
>
> Yeah that's fine because there is logical slot filtering here.

Right. And, we really are looking for invalid slots there, so use of
invalidation_reason is much more correct than conflicting.

> 4 ===
>
> -GetSlotInvalidationCause(const char *conflict_reason)
> +GetSlotInvalidationCause(const char *invalidation_reason)
>
> Should we change the comment "Maps a conflict reason" above this function?

Changed.

> 5 ===
>
> -# Check conflict_reason is NULL for physical slot
> +# Check invalidation_reason is NULL for physical slot
>  $res = $node_primary->safe_psql(
> 'postgres', qq[
> -SELECT conflict_reason is null FROM pg_replication_slots 
> where slot_name = '$primary_slotname';]
> +SELECT invalidation_reason is null FROM pg_replication_slots 
> where slot_name = '$primary_slotname';]
>  );
>
>
> I don't think this test is needed anymore: it does not make that much sense 
> since
> it's done after the primary database initialization and startup.

It is now turned into a test verifying 'conflicting boolean' is null
for the physical slot. Isn't that okay?

> 6 ===
>
> 'Logical slots are reported as non conflicting');
>
> What about?
>
> "
> # Verify slots are reported as valid in pg_replication_slots
> 'Logical slots are reported as valid');
> "

Changed.

Please see the attached v11 patch set with all the above review
comments addressed.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v11-0001-Track-invalidation_reason-in-pg_replication_slot.patch
Description: Binary data


v11-0002-Add-XID-age-based-replication-slot-invalidation.patch
Description: Binary data


v11-0003-Track-inactive-replication-slot-information.patch
Description: Binary data


v11-0004-Add-inactive_timeout-based-replication-slot-inva.patch
Description: Binary data

Add TAP tests for backtrace functionality (was Re: Add test module for verifying backtrace functionality)

2024-03-15 Thread Bharath Rupireddy

On Tue, Feb 20, 2024 at 11:30 AM Bharath Rupireddy
 wrote:
>
> On Tue, Feb 13, 2024 at 2:11 AM Bharath Rupireddy
>  wrote:
> >
> > Hi,
> >
> > Postgres has a good amount of code for dealing with backtraces - two
> > GUCs backtrace_functions and backtrace_on_internal_error,
> > errbacktrace; all of which use core function set_backtrace from
> > elog.c. I've not seen this code being tested at all, see code coverage
> > report - 
> > https://coverage.postgresql.org/src/backend/utils/error/elog.c.gcov.html.
> >
> > I think adding a simple test module (containing no .c files) with only
> > TAP tests will help cover this code. I ended up having it as a
> > separate module under src/test/modules/test_backtrace as I was not
> > able to find an existing TAP file in src/test to add these tests.  I'm
> > able to verify the backtrace related code with the attached patch
> > consistently. The TAP tests rely on the fact that the server emits
> > text "BACKTRACE: " to server logs before logging the backtrace, and
> > the backtrace contains the function name in which the error occurs.
> > I've turned off query statement logging (set log_statement = none,
> > log_min_error_statement = fatal) so that the tests get to see the
> > functions only in the backtrace. Although the CF bot is happy with the
> > attached patch 
> > https://github.com/BRupireddy2/postgres/tree/add_test_module_for_bcktrace_functionality_v1,
> > there might be some more flakiness to it.
> >
> > Thoughts?
>
> Ran pgperltidy on the new TAP test file added. Please see the attached v2 
> patch.

I've now moved the new TAP test file to src/test/modules/test_misc/t
as opposed to a new test module to keep it simple. I was not sure why
I hadn't done that in the first place.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v3-0001-Add-TAP-tests-for-backtrace-functionality.patch
Description: Binary data

80 matches

Mail list logo