Re: [PATCH] Exponential backoff for auth_delay

2024-01-03 Thread Michael Banck
Hi,

On Wed, Dec 27, 2023 at 05:19:54PM +0100, Michael Banck wrote:
> This patch adds exponential backoff so that one can choose a small
> initial value which gets doubled for each failed authentication attempt
> until a maximum wait time (which is 10s by default, but can be disabled
> if so desired).

Here is a new version, hopefully fixing warnings in the documentation
build, per cfbot.


Michael
>From 579f6ce8f968464af06a4695b7a3b66ee94716c8 Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Wed, 27 Dec 2023 15:55:39 +0100
Subject: [PATCH v2] Add optional exponential backoff to auth_delay contrib
 module.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This adds two new GUCs for auth_delay, exp_backoff and max_seconds. The former
controls whether exponential backoff should be used or not, the latter sets an
maximum delay (default is 10s) in case exponential backoff is active.

The exponential backoff is tracked per remote host and doubled for every failed
login attempt (i.e., wrong password, not just missing pg_hba line or database)
and reset to auth_delay.milliseconds after a successful authentication from
that host.

This patch is partly based on a larger (but ultimately rejected) patch by
成之焕.

Authors: Michael Banck, 成之焕
Discussion: https://postgr.es/m/ahwaxacqiwivoehs5yejpqog.1.1668569845751.hmail.zhch...@ceresdata.com
---
 contrib/auth_delay/auth_delay.c  | 202 ++-
 doc/src/sgml/auth-delay.sgml |  41 +++
 src/tools/pgindent/typedefs.list |   1 +
 3 files changed, 243 insertions(+), 1 deletion(-)

diff --git a/contrib/auth_delay/auth_delay.c b/contrib/auth_delay/auth_delay.c
index 8d6e4d2778..95e56db6ec 100644
--- a/contrib/auth_delay/auth_delay.c
+++ b/contrib/auth_delay/auth_delay.c
@@ -14,24 +14,50 @@
 #include 
 
 #include "libpq/auth.h"
+#include "miscadmin.h"
 #include "port.h"
+#include "storage/ipc.h"
+#include "storage/shmem.h"
 #include "utils/guc.h"
 #include "utils/timestamp.h"
 
 PG_MODULE_MAGIC;
 
+#define MAX_CONN_RECORDS 50
+
 /* GUC Variables */
 static int	auth_delay_milliseconds = 0;
+static bool auth_delay_exp_backoff = false;
+static int	auth_delay_max_seconds = 0;
 
 /* Original Hook */
 static ClientAuthentication_hook_type original_client_auth_hook = NULL;
 
+typedef struct AuthConnRecord
+{
+	char		remote_host[NI_MAXHOST];
+	bool		used;
+	double		sleep_time;		/* in milliseconds */
+} AuthConnRecord;
+
+static shmem_startup_hook_type shmem_startup_next = NULL;
+static shmem_request_hook_type shmem_request_next = NULL;
+static AuthConnRecord *acr_array = NULL;
+
+static AuthConnRecord *find_conn_record(char *remote_host, int *free_index);
+static double record_failed_conn_auth(Port *port);
+static double find_conn_max_delay(void);
+static void record_conn_failure(AuthConnRecord *acr);
+static void cleanup_conn_record(Port *port);
+
 /*
  * Check authentication
  */
 static void
 auth_delay_checks(Port *port, int status)
 {
+	double		delay;
+
 	/*
 	 * Any other plugins which use ClientAuthentication_hook.
 	 */
@@ -43,8 +69,150 @@ auth_delay_checks(Port *port, int status)
 	 */
 	if (status != STATUS_OK)
 	{
-		pg_usleep(1000L * auth_delay_milliseconds);
+		if (auth_delay_exp_backoff)
+		{
+			/*
+			 * Exponential backoff per remote host.
+			 */
+			delay = record_failed_conn_auth(port);
+			if (auth_delay_max_seconds > 0)
+delay = Min(delay, 1000L * auth_delay_max_seconds);
+		}
+		else
+			delay = auth_delay_milliseconds;
+		if (delay > 0)
+		{
+			elog(DEBUG1, "Authentication delayed for %g seconds", delay / 1000.0);
+			pg_usleep(1000L * (long) delay);
+		}
+	}
+	else
+	{
+		cleanup_conn_record(port);
+	}
+}
+
+static double
+record_failed_conn_auth(Port *port)
+{
+	AuthConnRecord *acr = NULL;
+	int			j = -1;
+
+	acr = find_conn_record(port->remote_host, &j);
+
+	if (!acr)
+	{
+		if (j == -1)
+
+			/*
+			 * No free space, MAX_CONN_RECORDS reached. Wait as long as the
+			 * largest delay for any remote host.
+			 */
+			return find_conn_max_delay();
+		acr = &acr_array[j];
+		strcpy(acr->remote_host, port->remote_host);
+		acr->used = true;
+		elog(DEBUG1, "new connection: %s, index: %d", acr->remote_host, j);
+	}
+
+	record_conn_failure(acr);
+	return acr->sleep_time;
+}
+
+static AuthConnRecord *
+find_conn_record(char *remote_host, int *free_index)
+{
+	int			i;
+
+	*free_index = -1;
+	for (i = 0; i < MAX_CONN_RECORDS; i++)
+	{
+		if (!acr_array[i].used)
+		{
+			if (*free_index == -1)
+/* record unused element */
+*free_index = i;
+			continue;
+		}
+		if (strcmp(acr_array[i].remote_host, remote_host) == 0)
+			return &acr_array[i];
+	}
+
+	return NULL;
+}
+
+static double
+find_conn_max_delay(void)
+{
+	int			i;
+	double		max_delay = 0.0;
+
+
+	for (i = 0; i < MAX_CONN_RECORDS; i++)
+	{
+		if (acr_array[i].used && acr_array[i].sleep_time > max_delay)
+			max_delay = acr_array[i].sleep_time;
 	}
+
+	return max_delay;
+}
+
+static void
+record_conn_failure(AuthConnRecord *acr

the s_lock_stuck on perform_spin_delay

2024-01-03 Thread Andy Fan

Hi,

from src/backend/storage/lmgr/README:

"""
Spinlocks.  These are intended for *very* short-term locks.  If a lock
is to be held more than a few dozen instructions, or across any sort of
kernel call (or even a call to a nontrivial subroutine), don't use a
spinlock. Spinlocks are primarily used as infrastructure for lightweight
locks.
"""

I totally agree with this and IIUC spin lock is usually used with the
following functions.

#define init_local_spin_delay(status) ..
void perform_spin_delay(SpinDelayStatus *status);
void finish_spin_delay(SpinDelayStatus *status);

During the perform_spin_delay, we have the following codes:

void
perform_spin_delay(SpinDelayStatus *status)

/* Block the process every spins_per_delay tries */
if (++(status->spins) >= spins_per_delay)
{
if (++(status->delays) > NUM_DELAYS)
s_lock_stuck(status->file, status->line, status->func);

the s_lock_stuck will PAINC the entire system.

My question is if someone doesn't obey the rule by mistake (everyone
can make mistake), shall we PANIC on a production environment? IMO I
think it can be a WARNING on a production environment and be a stuck
when 'ifdef USE_ASSERT_CHECKING'.

People may think spin lock may consume too much CPU, but it is not true
in the discussed scene since perform_spin_delay have pg_usleep in it,
and the MAX_DELAY_USEC is 1 second and MIN_DELAY_USEC is 0.001s.

I notice this issue actually because of the patch "Cache relation
sizes?" from Thomas Munro [1], where the latest patch[2] still have the 
following code. 
+   sr = smgr_alloc_sr();  <-- HERE a spin lock is hold
+
+   /* Upgrade to exclusive lock so we can create a mapping. */
+   LWLockAcquire(mapping_lock, LW_EXCLUSIVE); <-- HERE a complex
  operation is needed. it may take a long time.

Our internal testing system found more misuses on our own PG version.

I think a experienced engineer like Thomas can make this mistake and the
patch was reviewed by 3 peoples, the bug is still there. It is not easy
to say just don't do it. 

the attached code show the prototype in my mind. Any feedback is welcome. 

[1]
https://www.postgresql.org/message-id/CA%2BhUKGJg%2BgqCs0dgo94L%3D1J9pDp5hKkotji9A05k2nhYQhF4%2Bw%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/attachment/123659/v5-0001-WIP-Track-relation-sizes-in-shared-memory.patch
  

-- 
Best Regards
Andy Fan

>From 7d7fd0f0e9b13a290bfffaec0ad40773191155f2 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Thu, 4 Jan 2024 14:33:37 +0800
Subject: [PATCH v1 1/1] Don't call s_lock_stuck in production environment

In the current implementation, if a spin lock is misused, the
s_lock_stuck in perform_spin_delay can cause the entire system to
PANIC. In order to balance fault tolerance and the ability to detect
incorrect usage, we can use WARNING to replace s_lock_stuck in the
production environment, but still use s_lock_stuck in builds with
USE_ASSERT_CHECKING.
---
 src/backend/storage/lmgr/s_lock.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 327ac64f7c..9446605122 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -67,6 +67,7 @@ slock_t		dummy_spinlock;
 static int	spins_per_delay = DEFAULT_SPINS_PER_DELAY;
 
 
+#ifdef USE_ASSERT_CHECKING
 /*
  * s_lock_stuck() - complain about a stuck spinlock
  */
@@ -85,6 +86,7 @@ s_lock_stuck(const char *file, int line, const char *func)
 		 func, file, line);
 #endif
 }
+#endif
 
 /*
  * s_lock(lock) - platform-independent portion of waiting for a spinlock.
@@ -132,7 +134,14 @@ perform_spin_delay(SpinDelayStatus *status)
 	if (++(status->spins) >= spins_per_delay)
 	{
 		if (++(status->delays) > NUM_DELAYS)
+		{
+#ifdef USE_ASSERT_CHECKING
 			s_lock_stuck(status->file, status->line, status->func);
+#else
+			if (status->delays % NUM_DELAYS == 0)
+elog(WARNING, "perform spin lock on %s %d times", status->func, status->delays);
+#endif
+		}
 
 		if (status->cur_delay == 0) /* first time to delay? */
 			status->cur_delay = MIN_DELAY_USEC;
-- 
2.34.1



Re: add function argument names to regex* functions.

2024-01-03 Thread jian he
On Thu, Jan 4, 2024 at 7:26 AM Jim Nasby  wrote:
>
> On 1/3/24 5:05 PM, Dian Fay wrote:
>
> Another possibility is `index`, which is relatively short and not a
> reserved keyword ^1. `position` is not as precise but would avoid the
> conceptual overloading of ordinary indices.
>
> I'm not a fan of "index" since that leaves the question of
> whether it's 0 or 1 based. "Position" is a bit better, but I think
> Jian's suggestion of "occurance" is best.
>
> We do have precedent for one-based `index` in Postgres: array types are
> 1-indexed by default! "Occurrence" removes that ambiguity but it's long
> and easy to misspell (I looked it up after typing it just now and it
> _still_ feels off).
>
> How's "instance"?
>
> Presumably someone referencing arguments by name would have just looked up 
> the names via \df or whatever, so presumably misspelling wouldn't be a big 
> issue. But I think "instance" is OK as well.
>
> --
> Jim Nasby, Data Architect, Austin TX

regexp_instr: It has the syntax regexp_instr(string, pattern [, start
[, N [, endoption [, flags [, subexpr ])
oracle:
REGEXP_INSTR (source_char, pattern,  [, position [, occurrence [,
return_opt  [, match_param  [, subexpr ] )

"string" and "source_char" are almost the same descriptive, so maybe
there is no need to change.
"start" is better than "position", imho.
"return_opt" is better than "endoption", (maybe we need change, for
now I didn't)
"flags" cannot be changed to "match_param", given it quite everywhere
in functions-matching.html.

similarly for function regexp_replace, oracle using "repplace_string",
we use "replacement"(mentioned in the doc).
so I don't think we need to change to "repplace_string".

Based on how people google[0], I think `occurrence` is ok, even though
it's verbose.
to change from `N` to `occurrence`, we also need to change the doc,
that is why this patch is more larger.


[0]: 
https://www.google.com/search?q=regex+nth+match&oq=regex+nth+match&gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIGCAEQRRg8MgYIAhBFGDzSAQc2MThqMGo5qAIAsAIA&sourceid=chrome&ie=UTF-8
From cb720fd696df3a34032ec2adaea712604a7e3d42 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Thu, 4 Jan 2024 13:02:35 +0800
Subject: [PATCH v2 1/1] add function argument names to regex.* functions.

 Specifically add function argument names to the following funtions:
  regexp_replace,
  regexp_match,
  regexp_matches
  regexp_count,
  regexp_instr,
  regexp_like,
  regexp_substr,
  regexp_split_to_table
  regexp_split_to_array
 So these functions can be easiler to understand in psql via \df.
 Also more important make these functions can be called in different notaions.
---
 doc/src/sgml/func.sgml  | 38 +-
 src/include/catalog/pg_proc.dat | 71 ++---
 2 files changed, 76 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index cec21e42..b04a427a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3299,7 +3299,7 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in
 
 regexp_instr ( string text, pattern text
  [, start integer
- [, N integer
+ [, occurrence integer
  [, endoption integer
  [, flags text
  [, subexpr integer ] ] ] ] ] )
@@ -3307,7 +3307,7 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in


 Returns the position within string where
-the N'th match of the POSIX regular
+the occurrence'th match of the POSIX regular
 expression pattern occurs, or zero if there is
 no such match; see .

@@ -3478,14 +3478,14 @@ SELECT NOT(ROW(table.*) IS NOT NULL) FROM TABLE; -- detect at least one null in
 
 regexp_substr ( string text, pattern text
  [, start integer
- [, N integer
+ [, occurrence integer
  [, flags text
  [, subexpr integer ] ] ] ] )
 text


 Returns the substring within string that
-matches the N'th occurrence of the POSIX
+matches the occurrence'th occurrence of the POSIX
 regular expression pattern,
 or NULL if there is no such match; see
 .
@@ -5888,13 +5888,13 @@ regexp_count('ABCABCAXYaxy', 'A.', 1, 'i')  4
 
 
  The regexp_instr function returns the starting or
- ending position of the N'th match of a
+ ending position of the occurrence'th match of a
  POSIX regular expression pattern to a string, or zero if there is no
  such match.  It has the syntax
  regexp_instr(string,
  pattern
  , start
- , N
+ , occurrence
  , endoption
  , flags
  , subexpr
@@ -5903,8 +5903,8 @@ regexp_count('ABCABCAXYaxy', 'A.', 1, 'i')  4
  in string, normally from the beginning of
  the string, but if the start parameter is
  provided then beginning from that char

Re: speed up a logical replica setup

2024-01-03 Thread Ashutosh Bapat
On Wed, Jan 3, 2024 at 2:49 PM Amit Kapila  wrote:
>
>  c) Drop the replication slots d) Drop the
> > publications
> >
>
> I am not so sure about dropping publications because, unlike
> subscriptions which can start to pull the data, there is no harm with
> publications. Similar to publications there could be some user-defined
> functions or other other objects which may not be required once the
> standby replica is converted to subscriber. I guess we need to leave
> those to the user.
>

IIUC, primary use of pg_subscriber utility is to start a logical
subscription from a physical base backup (to reduce initial sync time)
as against logical backup taken while creating a subscription. Hence I
am expecting that apart from this difference, the resultant logical
replica should look similar to the logical replica setup using a
logical subscription sync. Hence we should not leave any replication
objects around. UDFs (views, and other objects) may have some use on a
logical replica. We may replicate changes to UDF once DDL replication
is supported. But what good is having the same publications as primary
also on logical replica?

-- 
Best Wishes,
Ashutosh Bapat




Re: speed up a logical replica setup

2024-01-03 Thread Shlok Kyal
Hi,
I was testing the patch with following test cases:

Test 1 :
- Create a 'primary' node
- Setup physical replica using pg_basebackup  "./pg_basebackup –h
localhost –X stream –v –R –W –D ../standby "
- Insert data before and after pg_basebackup
- Run pg_subscriber and then insert some data to check logical
replication "./pg_subscriber –D ../standby -S “host=localhost
port=9000 dbname=postgres” -P “host=localhost port=9000
dbname=postgres” -d postgres"
- Also check pg_publication, pg_subscriber and pg_replication_slots tables.

Observation:
Data is not lost. Replication is happening correctly. Pg_subscriber is
working as expected.

Test 2:
- Create a 'primary' node
- Use normal pg_basebackup but don’t set up Physical replication
"./pg_basebackup –h localhost –v –W –D ../standby"
- Insert data before and after pg_basebackup
- Run pg_subscriber

Observation:
Pg_subscriber command is not completing and is stuck with following
log repeating:
LOG: waiting for WAL to become available at 0/3000168
LOG: invalid record length at 0/3000150: expected at least 24, got 0

Test 3:
- Create a 'primary' node
- Use normal pg_basebackup but don’t set up Physical replication
"./pg_basebackup –h localhost –v –W –D ../standby"
-Insert data before pg_basebackup but not after pg_basebackup
-Run pg_subscriber

Observation:
Pg_subscriber command is not completing and is stuck with following
log repeating:
LOG: waiting for WAL to become available at 0/3000168
LOG: invalid record length at 0/3000150: expected at least 24, got 0

I was not clear about how to use pg_basebackup in this case, can you
let me know if any changes need to be made for test2 and test3.

Thanks and regards
Shlok Kyal




Re: Synchronizing slots from primary to standby

2024-01-03 Thread Dilip Kumar
On Wed, Jan 3, 2024 at 4:20 PM Amit Kapila  wrote:
>
> On Fri, Dec 29, 2023 at 12:32 PM Amit Kapila  wrote:

> > I see your point and agree that users need to be careful. I was trying
> > to compare it with other places like the conninfo used with a
> > subscription where no separate dbname needs to be provided. Now, here
> > the situation is not the same because the same conninfo is used for
> > different purposes (walreceiver doesn't require dbname (dbname is
> > ignored even if present) whereas slotsyncworker requires dbname). I
> > was just trying to see if we can avoid having a new GUC for this
> > purpose. Does anyone else have an opinion on this matter?
> >
>
> Bertrand, Dilip, and others involved in this thread or otherwise, see
> if you can share an opinion on the above point because it would be
> good to get some more opinions before we decide to add a new GUC (for
> dbname) for slotsync worker.

IMHO, as of now we can only use the primary_coninfo and let the user
modify this and add the dbname to this.  In the future, if this
creates some discomfort or we see some complaints about the usage then
we can expand the behavior by providing an additional GUC with dbname.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: speed up a logical replica setup

2024-01-03 Thread Amit Kapila
On Thu, Jan 4, 2024 at 8:52 AM Euler Taveira  wrote:
>
> On Mon, Jan 1, 2024, at 7:14 AM, vignesh C wrote:
>
>
> 5) I felt the target server should be started before completion of
> pg_subscriber:
>
>
> Why?
>

Won't it be a better user experience that after setting up the target
server as a logical replica (subscriber), it started to work
seamlessly without user intervention?

> The initial version had an option to stop the subscriber. I decided to
> remove the option and stop the subscriber by default mainly because (1) it is
> an extra step to start the server (another point is that the WAL retention
> doesn't happen due to additional (synchronized?) replication slots on
> subscriber -- point 2). It was a conservative choice. If point 2 isn't an
> issue, imo point 1 is no big deal.
>

By point 2, do you mean to have a check for "max replication slots"?
It so, the one possibility is to even increase that config, if the
required max_replication_slots is low.

-- 
With Regards,
Amit Kapila.




Re: speed up a logical replica setup

2024-01-03 Thread Amit Kapila
On Thu, Jan 4, 2024 at 8:24 AM Euler Taveira  wrote:
>
> On Thu, Dec 21, 2023, at 3:16 AM, Amit Kapila wrote:
>
> > 2. remove the physical replication slot if the standby is using one
> >(primary_slot_name).
> > 3. provide instructions to promote the logical replica into primary, I mean,
> >stop the replication between the nodes and remove the replication setup
> >(publications, subscriptions, replication slots). Or even include another
> >action to do it. We could add both too.
> >
> > Point 1 should be done. Points 2 and 3 aren't essential but will provide a 
> > nice
> > UI for users that would like to use it.
> >
>
> Isn't point 2 also essential because how would otherwise such a slot
> be advanced or removed?
>
>
> I'm worried about a scenario that you will still use the primary. (Let's say
> the logical replica will be promoted to a staging or dev server.) No 
> connection
> between primary and this new server so the primary slot is useless after the
> promotion.
>

So, you also seem to be saying that it is not required once
pg_subscriber has promoted it. So, why it should be optional to remove
physical_replication_slot? I think we must remove it from the primary
unless there is some other reason.

> A few other points:
> ==
> 1. Previously, I asked whether we need an additional replication slot
> patch created to get consistent LSN and I see the following comment in
> the patch:
>
> + *
> + * XXX we should probably use the last created replication slot to get a
> + * consistent LSN but it should be changed after adding pg_basebackup
> + * support.
>
> Yeah, sure, we may want to do that after backup support and we can
> keep a comment for the same but I feel as the patch stands today,
> there is no good reason to keep it.
>
>
> I'll remove the comment to avoid confusing.
>

My point is to not have an additional slot and keep a comment
indicating that we need an extra slot once we add pg_basebackup
support.

>
> 2.
> + appendPQExpBuffer(str,
> +   "CREATE SUBSCRIPTION %s CONNECTION '%s' PUBLICATION %s "
> +   "WITH (create_slot = false, copy_data = false, enabled = false)",
> +   dbinfo->subname, dbinfo->pubconninfo, dbinfo->pubname);
>
> Shouldn't we enable two_phase by default for newly created
> subscriptions? Is there a reason for not doing so?
>
>
> Why? I decided to keep the default for some settings (streaming,
> synchronous_commit, two_phase, disable_on_error). Unless there is a compelling
> reason to enable it, I think we should use the default. Either way, data will
> arrive on subscriber as soon as the prepared transaction is committed.
>

I thought we could provide a better experience for logical replicas
created by default but I see your point and probably keeping default
values for parameters you mentioned seems reasonable to me.

> 3. How about sync slots on the physical standby if present? Do we want
> to retain those as it is or do we need to remove those? We are
> actively working on the patch [1] for the same.
>
>
> I didn't read the current version of the referred patch but if the proposal is
> to synchronize logical replication slots iif you are using a physical
> replication, as soon as pg_subscriber finishes the execution, there won't be
> synchronization on these logical replication slots because there isn't a
> physical replication anymore. If the goal is a promotion, the current behavior
> is correct because the logical replica will retain WAL since it was converted.
>

I don't understand what you mean by promotion in this context. If
users want to simply promote the standby then there is no need to do
additional things that this tool is doing.

> However, if you are creating a logical replica, this WAL retention is not good
> and the customer should eventually remove these logical replication slots on
> the logical replica.
>

I think asking users to manually remove such slots won't be a good
idea. We might want to either remove them by default or provide an
option to the user.

-- 
With Regards,
Amit Kapila.




Re: Synchronizing slots from primary to standby

2024-01-03 Thread shveta malik
On Wed, Jan 3, 2024 at 6:33 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, January 2, 2024 6:32 PM shveta malik  
> wrote:
> > On Fri, Dec 29, 2023 at 10:25 AM Amit Kapila 
> >
> > The topup patch has also changed app_name to
> > {cluster_name}_slotsyncworker so that we do not confuse between walreceiver
> > and slotsyncworker entry.
> >
> > Please note that there is no change in rest of the patches, changes are in
> > additional 0004 patch alone.
>
> Attach the V56 patch set which supports ALTER SUBSCRIPTION SET (failover).
> This is useful when user want to refresh the publication tables, they can now 
> alter the
> failover option to false and then execute the refresh command.
>
> Best Regards,
> Hou zj

The patches no longer apply to HEAD due to a recent commit 007693f. I
am working on rebasing and will post the new patches soon

thanks
Shveta




Re: speed up a logical replica setup

2024-01-03 Thread Euler Taveira
On Mon, Jan 1, 2024, at 7:14 AM, vignesh C wrote:
> 1) This Assert can fail if source is shutdown:
> +static void
> +drop_replication_slot(PGconn *conn, LogicalRepInfo *dbinfo, const
> char *slot_name)
> +{
> +   PQExpBuffer str = createPQExpBuffer();
> +   PGresult   *res;
> +
> +   Assert(conn != NULL);

Oops. I'll remove it.

> 2) Should we have some checks to see if the max replication slot
> configuration is ok based on the number of slots that will be created,
> we have similar checks in upgrade replication slots in
> check_new_cluster_logical_replication_slots

That's a good idea.

> 3) Should we check if wal_level is set to logical, we have similar
> checks in upgrade replication slots in
> check_new_cluster_logical_replication_slots

That's a good idea.

> 4) The physical replication slot that was created will still be
> present in the primary node, I felt this should be removed.

My proposal is to remove it [1]. It'll be include in the next version.

> 5) I felt the target server should be started before completion of
> pg_subscriber:

Why? The initial version had an option to stop the subscriber. I decided to
remove the option and stop the subscriber by default mainly because (1) it is
an extra step to start the server (another point is that the WAL retention
doesn't happen due to additional (synchronized?) replication slots on
subscriber -- point 2). It was a conservative choice. If point 2 isn't an
issue, imo point 1 is no big deal.


[1] 
https://www.postgresql.org/message-id/e02a2c17-22e5-4ba6-b788-de696ab74f1e%40app.fastmail.com


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Change GUC hashtable to use simplehash?

2024-01-03 Thread jian he
On Wed, Jan 3, 2024 at 10:12 PM John Naylor  wrote:
>
> On Tue, Jan 2, 2024 at 6:56 AM jian he  wrote:
> >
> > My local computer is slow. but here is the test results:
> >
> > select * from bench_cstring_hash_aligned(10);7318.893 ms
> > select * from bench_cstring_hash_unaligned(10);10383.173 ms
> > select * from bench_pgstat_hash(10);   4474.989 ms
> > select * from bench_pgstat_hash_fh(10);  9192.245 ms
> > select * from bench_string_hash(10);2048.008 ms
>
> This presents a 2x to 5x slowdown, so I'm skeptical this is typical --
>  what kind of platform is. For starters, what CPU and compiler?

I still cannot git apply your patch cleanly. in
http://cfbot.cputube.org/ i cannot find your patch.
( so, it might be that I test based on incomplete information).
but only hashfn_unstable.h influences bench_hash/bench_hash.c.

so I attached the whole patch that I had git applied, that is the
changes i applied for the following tests.
how I test using pgbench:
pgbench --no-vacuum  --time=20  --file
/home/jian/tmp/bench_cstring_hash_aligned.sql -M prepared  | grep
latency

The following is tested with another machine, also listed machine spec below.
I tested 3 times, the results is very similar as following:
select * from bench_cstring_hash_aligned(10);4705.686 ms
select * from bench_cstring_hash_unaligned(10);6835.753 ms
select * from bench_pgstat_hash(10);   2678.978 ms
select * from bench_pgstat_hash_fh(10);  6199.017 ms
select * from bench_string_hash(10);847.699 ms

src6=# select version();
  version

 PostgreSQL 17devel on x86_64-linux, compiled by gcc-11.4.0, 64-bit
(1 row)

jian@jian:~/Desktop/pg_src/src6/postgres$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

lscpu:
Architecture:x86_64
  CPU op-mode(s):32-bit, 64-bit
  Address sizes: 46 bits physical, 48 bits virtual
  Byte Order:Little Endian
CPU(s):  20
  On-line CPU(s) list:   0-19
Vendor ID:   GenuineIntel
  Model name:Intel(R) Core(TM) i5-14600K
CPU family:  6
Model:   183
Thread(s) per core:  2
Core(s) per socket:  14
Socket(s):   1
Stepping:1
CPU max MHz: 5300.
CPU min MHz: 800.
BogoMIPS:6988.80
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep
mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht
tm pbe syscall nx pdpe1gb rdtscp l
 m constant_tsc art arch_perfmon pebs bts
rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq
pni pclmulqdq dtes64 monitor ds_cpl vm
 x smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm
sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx
f16c rdrand lahf_lm abm 3dnowprefetc
 h cpuid_fault ssbd ibrs ibpb stibp
ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase
tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
 rdseed adx smap clflushopt clwb intel_pt
sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni
dtherm ida arat pln pts hwp hwp_notify hw
 p_act_window hwp_epp hwp_pkg_req hfi umip pku
ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm
md_clear serialize pconfig arch_l
 br ibt flush_l1d arch_capabilities
Virtualization features:
  Virtualization:VT-x
Caches (sum of all):
  L1d:   544 KiB (14 instances)
  L1i:   704 KiB (14 instances)
  L2:20 MiB (8 instances)
  L3:24 MiB (1 instance)
NUMA:
  NUMA node(s):  1
  NUMA node0 CPU(s): 0-19
Vulnerabilities:
  Gather data sampling:  Not affected
  Itlb multihit: Not affected
  L1tf:  Not affected
  Mds:   Not affected
  Meltdown:  Not affected
  Mmio stale data:   Not affected
  Retbleed:  Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:Mitigation; usercopy/swapgs barriers and
__user pointer sanitization
  Spectre v2:Mitigation; Enhanced IBRS, IBPB conditional,
RSB filling, PBRSB-eIBRS SW sequence
  Srbds: Not affected
  Tsx async abort:   Not affected

jian@jian:~/Desktop/pg_src/src6/postgres$ git log
commit bbbf8cd54a05ad5c92e79c96133f219e80fad77c (HEAD -> master)
Author: j

Re: speed up a logical replica setup

2024-01-03 Thread Euler Taveira
On Thu, Dec 21, 2023, at 3:16 AM, Amit Kapila wrote:
> I think this is an important part. Shall we try to write to some file
> the pending objects to be cleaned up? We do something like that during
> the upgrade.

That's a good idea.

> > 2. remove the physical replication slot if the standby is using one
> >(primary_slot_name).
> > 3. provide instructions to promote the logical replica into primary, I mean,
> >stop the replication between the nodes and remove the replication setup
> >(publications, subscriptions, replication slots). Or even include another
> >action to do it. We could add both too.
> >
> > Point 1 should be done. Points 2 and 3 aren't essential but will provide a 
> > nice
> > UI for users that would like to use it.
> >
> 
> Isn't point 2 also essential because how would otherwise such a slot
> be advanced or removed?

I'm worried about a scenario that you will still use the primary. (Let's say
the logical replica will be promoted to a staging or dev server.) No connection
between primary and this new server so the primary slot is useless after the
promotion.

> A few other points:
> ==
> 1. Previously, I asked whether we need an additional replication slot
> patch created to get consistent LSN and I see the following comment in
> the patch:
> 
> + *
> + * XXX we should probably use the last created replication slot to get a
> + * consistent LSN but it should be changed after adding pg_basebackup
> + * support.
> 
> Yeah, sure, we may want to do that after backup support and we can
> keep a comment for the same but I feel as the patch stands today,
> there is no good reason to keep it.

I'll remove the comment to avoid confusing.

> Also, is there a reason that we
> can't create the slots after backup is complete and before we write
> recovery parameters

No.

> 2.
> + appendPQExpBuffer(str,
> +   "CREATE SUBSCRIPTION %s CONNECTION '%s' PUBLICATION %s "
> +   "WITH (create_slot = false, copy_data = false, enabled = false)",
> +   dbinfo->subname, dbinfo->pubconninfo, dbinfo->pubname);
> 
> Shouldn't we enable two_phase by default for newly created
> subscriptions? Is there a reason for not doing so?

Why? I decided to keep the default for some settings (streaming,
synchronous_commit, two_phase, disable_on_error). Unless there is a compelling
reason to enable it, I think we should use the default. Either way, data will
arrive on subscriber as soon as the prepared transaction is committed.

> 3. How about sync slots on the physical standby if present? Do we want
> to retain those as it is or do we need to remove those? We are
> actively working on the patch [1] for the same.

I didn't read the current version of the referred patch but if the proposal is
to synchronize logical replication slots iif you are using a physical
replication, as soon as pg_subscriber finishes the execution, there won't be
synchronization on these logical replication slots because there isn't a
physical replication anymore. If the goal is a promotion, the current behavior
is correct because the logical replica will retain WAL since it was converted.
However, if you are creating a logical replica, this WAL retention is not good
and the customer should eventually remove these logical replication slots on
the logical replica.

> 4. Can we see some numbers with various sizes of databases (cluster)
> to see how it impacts the time for small to large-size databases as
> compared to the traditional method? This might help us with giving
> users advice on when to use this tool. We can do this bit later as
> well when the patch is closer to being ready for commit.

I'll share it.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Transaction timeout

2024-01-03 Thread Japin Li


On Wed, 03 Jan 2024 at 20:04, Andrey M. Borodin  wrote:
>> On 3 Jan 2024, at 16:46, Andrey M. Borodin  wrote:
>>
>>  I do not understand why, but mailing list did not pick patches that I sent. 
>> I'll retry.
>
>
> Sorry for the noise. Seems like Apple updated something in Mail.App couple of 
> days ago and it started to use strange "Apple-Mail" stuff by default.
> I see patches were attached, but were not recognized by mailing list archives 
> and CFbot.
> Now I've flipped everything to "plain text by default" everywhere. Hope that 
> helps.
>

Thanks for updating the patch, I find the test on Debian with mason failed [1].

Does the timeout is too short for testing? I see the timeouts for lock_timeout
and statement_timeout is more bigger than transaction_timeout.

[1] 
https://api.cirrus-ci.com/v1/artifact/task/5490718928535552/testrun/build-32/testrun/isolation/isolation/regression.diffs




Re: Update for copyright messages to 2024 (Happy New Year!)

2024-01-03 Thread Bruce Momjian
On Mon, Jan  1, 2024 at 07:25:16PM +0900, Michael Paquier wrote:
> Hi,
> (CC-ing Bruce)
> 
> As of this new year, and in the vein of c8e1ba736b2b.  Bruce, are you
> planning an update of the copyright dates with a run of
> ./src/tools/copyright.pl on HEAD, and the smallish updates of the back
> branches?
> 
> And of course, Happy New Year to all!

Done, sorry for the delay.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Add a perl function in Cluster.pm to generate WAL

2024-01-03 Thread Michael Paquier
On Wed, Jan 03, 2024 at 06:39:29PM -0500, Tom Lane wrote:
> I am not real sure what is happening there, but I see that c161ab74f
> changed some details of how that test works, so I wonder if it's
> responsible for these failures.  The timing isn't a perfect match,
> since this commit went in two weeks ago, but I don't see any
> more-recent commits that seem like plausible explanations.

Likely the INSERT query on retain_test that has been removed from the
test is impacting the slot conflict analysis that we'd expect.
--
Michael


signature.asc
Description: PGP signature


doing also VM cache snapshot and restore with pg_prewarm, having more information of the VM inside PostgreSQL

2024-01-03 Thread Cedric Villemain

Hi,

for 15 years pgfincore has been sitting quietly and being used in large 
setups to help in HA and resources management.
It can perfectly stay as is, to be honest I was expecting to one day 
include a windows support and propose that to PostgreSQL, it appears 
getting support on linux and BSD is more than enough today.


So I wonder if there are interest for having virtual memory snapshot and 
restore operations with, for example, pg_prewarm/autowarm ?


Some usecases covered: snapshot/restore cache around cronjobs, around 
dumps, switchover, failover, on stop/start of postgres (think kernel 
upgrade with a cold restart), ...


pgfincore also provides some nice information with mincore (on FreeBSD 
mincore is more interesting) or cachestat, again it can remain as an out 
of tree extension but I will be happy to add to commitfest if there are 
interest from the community.

An example of cachestat output:

postgres=# select *from vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache | nr_dirty | 
nr_writeback | nr_evicted | nr_recently_evicted
-+-+--+--+--+--++- 

  0 |   32768 |    65536 |    62294 |    0 | 
   0 |   3242 |    3242
  32768 |   32768 |    65536 |    39279 |    0 | 
   0 |  26257 |   26257
  65536 |   32768 |    65536 |    22516 |    0 | 
   0 |  43020 |   43020
  98304 |   32768 |    65536 |    24944 |    0 | 
   0 |  40592 |   40592
 131072 |    1672 | 3344 |  487 |    0 | 
   0 |   2857 |    2857



Comments?

---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D


Re: pg_upgrade and logical replication

2024-01-03 Thread Michael Paquier
On Wed, Jan 03, 2024 at 03:18:50PM +0530, Amit Kapila wrote:
> I think it would be good to finish the pending patch to improve the
> IsBinaryUpgrade check [1] which we decided to do once this patch is
> ready. Would you like to take that up or do you want me to finish it?
>
> [1] - https://www.postgresql.org/message-id/ZU2TeVkUg5qEi7Oy%40paquier.xyz
> [2] - https://www.postgresql.org/message-id/ZVQtUTdJACnsbbpd%40paquier.xyz

Yep, that's on my TODO.  I can send a new version at the beginning of
next week.  No problem.
--
Michael


signature.asc
Description: PGP signature


Re: Adding facility for injection points (or probe points?) for more advanced tests

2024-01-03 Thread Michael Paquier
On Tue, Jan 02, 2024 at 11:14:56PM -0600, Nathan Bossart wrote:
> Should we specify HASH_FIXED_SIZE, too?  This hash table will be in the
> main shared memory segment and therefore won't be able to expand too far
> beyond the declared maximum size.

Good point.

> I'm wondering how important it is to cache the callbacks locally.
> load_external_function() won't reload an already-loaded library, so AFAICT
> this is ultimately just saving a call to dlsym().

This keeps a copy to a callback under the same address space, and I
guess that it would matter if the code where a callback is added gets
very hot because this means less function pointers.  At the end I
would keep the cache as the code to handle it is neither complex nor
long, while being isolated in its own paths.

> I think  is supposed to be  here.

Okay.

> 0003 and 0004 add tests to the test_injection_points module.  Is the idea
> that we'd add any tests that required injection points here?  I think it'd
> be better if we could move the tests closer to the logic they're testing,
> but perhaps that is difficult because you also need to define the callback
> functions somewhere.  Hm...

Yeah.  Agreed that the final result should not have these tests in the
module test_injection_points.  What I was thinking here is to move
002_invalid_checkpoint_after_promote.pl to src/test/recovery/ and pull
the module with the callbacks with an EXTRA_INSTALL.
--
Michael


signature.asc
Description: PGP signature


Re: SET ROLE x NO RESET

2024-01-03 Thread Nico Williams
On Tue, Jan 02, 2024 at 12:36:38PM -0500, Robert Haas wrote:
> IMHO, the best solution here would be a protocol message to change the
> session user. The pooler could use that repeatedly on the same
> session, but refuse to propagate such messages from client
> connections.

But this requires upgrading clients too.

IMO `SET ROLE .. NO RESET` would be terribly useful.  One could build:

 - login systems (e.g., bearer tokens, passwords) in SQL / PlPgSQL / etc

 - sudo-like things

Though maybe `NO RESET` isn't really needed to build these, since after
all one could use an unprivileged role and a SECURITY DEFINER function
that does the `SET ROLE` following some user-defined authentication
method, and so what if the client can RESET the role, since that brings
it back to the otherwise unprivileged role.

Who needs to RESET roles anyways?  Answer: connection pools, but not
every connection is used via a pool.  This brings up something: attempts
to reset a NO RESET session need to fail in such a way that a connection
pool can detect this and disconnect, or else it needs to fail by
terminating the connection altogether.

Nico
-- 




Re: SET ROLE x NO RESET

2024-01-03 Thread Nico Williams
On Sat, Dec 30, 2023 at 10:16:59AM -0600, Eric Hanson wrote:
> What do you think of adding a NO RESET option to the SET ROLE command?

I've wanted this forever.  Consider using this to implement user
authentication mechanisms in user-defined SQL functions that use `SET
ROLE` with `NO RESET` to "login" the user.  One could implement JWT (or
whatever bearer token schemes) on the server side in PlPgSQL w/ pgcrypto
this way, with zero changes to PG itself, no protocol changes, etc.

For bearer token schemes one could acquire the token externally to the
client and then just `SELECT login(?)`, bind the token, and execute to
login.

Nico
-- 




Re: Add a perl function in Cluster.pm to generate WAL

2024-01-03 Thread Tom Lane
Michael Paquier  writes:
> I have added a comment about pg_logical_emit_message() being in
> non-transactional mode and the flush implied by pg_switch_wal() as
> that's important, edited a bit the whole, then applied the patch.

Buildfarm member skink has failed 3 times in
035_standby_logical_decoding.pl in the last couple of days:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-01-03%2020%3A07%3A15

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-01-03%2017%3A09%3A27

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-01-01%2020%3A10%3A18

These all look like

# poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where 
datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f

although it's notable that two different tests are involved
(vacuum vs. vacuum full).

I am not real sure what is happening there, but I see that c161ab74f
changed some details of how that test works, so I wonder if it's
responsible for these failures.  The timing isn't a perfect match,
since this commit went in two weeks ago, but I don't see any
more-recent commits that seem like plausible explanations.

regards, tom lane




Re: WIP Incremental JSON Parser

2024-01-03 Thread Nico Williams
On Tue, Jan 02, 2024 at 10:14:16AM -0500, Robert Haas wrote:
> It seems like a pretty significant savings no matter what. Suppose the
> backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
> create an 1MB buffer and feed the data to the parser in 1MB chunks.
> Well, that saves 2GB less 1MB, full stop. Now if we address the issue
> you raise here in some way, we can potentially save even more memory,
> which is great, but even if we don't, we still saved a bunch of memory
> that could not have been saved in any other way.

You could also build a streaming incremental parser.  That is, one that
outputs a path and a leaf value (where leaf values are scalar values,
`null`, `true`, `false`, numbers, and strings).  Then if the caller is
doing something JSONPath-like then the caller can probably immediately
free almost all allocations and even terminate the parse early.

Nico
-- 




Re: add function argument names to regex* functions.

2024-01-03 Thread Jim Nasby


  
  
On 1/3/24 5:05 PM, Dian Fay wrote:


  

  Another possibility is `index`, which is relatively short and not a
reserved keyword ^1. `position` is not as precise but would avoid the
conceptual overloading of ordinary indices.



I'm not a fan of "index" since that leaves the question of
whether it's 0 or 1 based. "Position" is a bit better, but I think
Jian's suggestion of "occurance" is best.

  
  
We do have precedent for one-based `index` in Postgres: array types are
1-indexed by default! "Occurrence" removes that ambiguity but it's long
and easy to misspell (I looked it up after typing it just now and it
_still_ feels off).

How's "instance"?


Presumably someone referencing arguments by name would have just
  looked up the names via \df or whatever, so presumably misspelling
  wouldn't be a big issue. But I think "instance" is OK as well.

-- 
Jim Nasby, Data Architect, Austin TX
  





Re: Reducing output size of nodeToString

2024-01-03 Thread Matthias van de Meent
On Tue, 2 Jan 2024 at 11:30, Peter Eisentraut  wrote:
>
> On 06.12.23 22:08, Matthias van de Meent wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size signficantly.
> >
> > The basic techniques used are
> >   - Don't emit scalar fields when they contain a default value, and
> > make the reading code aware of this.
> >   - Reasonable defaults are set for most datatypes, and overrides can
> > be added with new pg_node_attr() attributes. No introspection into
> > non-null Node/Array/etc. is being done though.
> >   - Reset more fields to their default values before storing the values.
> >   - Don't write trailing 0s in outDatum calls for by-ref types. This
> > saves many bytes for Name fields, but also some other pre-existing
> > entry points.
>
> Based on our discussions, my understanding is that you wanted to produce
> an updated patch set that is split up a bit.

I mentioned that I've been working on implementing (but have not yet
completed) a binary serialization format, with an implementation based
on Andres' generated metadata idea. However, that requires more
elaborate infrastructure than is currently available, so while I said
I'd expected it to be complete before the Christmas weekend, it'll
take some more time - I'm not sure it'll be ready for PG17.

In the meantime here's an updated version of the v0 patch, formally
keeping the textual format alive, while reducing the size
significantly (nearing 2/3 reduction), taking your comments into
account. I think the gains are worth the  consideration without taking
into account the as-of-yet unimplemented binary format.

> My suggestion is to make incremental patches along these lines:
> [...]

Something like the attached? It splits out into the following
0001: basic 'omit default values'
0002: reset location and other querystring-related node fields for all
catalogs of type pg_node_tree.
0003: add default marking on typmod fields.
0004 & 0006: various node fields marked with default() based on
observed common or initial values of those fields
0005: truncate trailing 0s from outDatum
0007 (new): do run-length + gap coding for bitmapset and the various
integer list types. This saves a surprising amount of bytes.

> The last one I have some doubts about, as previously expressed, but the
> first few seem sensible to me.  By splitting it up we can consider these
> incrementally.

That makes a lot of sense. The numbers for the full patchset do seem
quite positive though: The metrics of the query below show a 40%
decrease in size of a fresh pg_rewrite (standard toast compression)
and a 5% decrease in size of the template0 database. The uncompressed
data of pg_rewrite.ev_action is also 60% smaller.

select pg_database_size('template0') as "template0"
 , pg_total_relation_size('pg_rewrite') as "pg_rewrite"
 , sum(pg_column_size(ev_action)) as "compressed"
 , sum(octet_length(ev_action)) as "raw"
from pg_rewrite;

 version | template0 | pg_rewrite | compressed |   raw
-|---+++-
 master  |   7545359 | 761856 | 573307 | 2998712
 0001|   7365135 | 622592 | 438224 | 1943772
 0002|   7258639 | 573440 | 401660 | 1835803
 0003|   7258639 | 565248 | 386211 | 1672539
 0004|   7176719 | 483328 | 317099 | 1316552
 0005|   7176719 | 483328 | 315556 | 1300420
 0006|   7160335 | 466944 | 302806 | 1208621
 0007|   7143951 | 450560 | 287659 | 1187237

While looking through the data, I noticed the larger views now consist
for a significant portion out of range table entries, specifically the
Alias and Var nodes (which are mostly repeated and/or repetative
values, but split across Nodes). I think column-major storage would be
more efficient to write, but I'm not sure it's worth the effort in
planner code.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v1-0001-pg_node_tree-Don-t-serialize-fields-with-type-def.patch
Description: Binary data


v1-0002-pg_node_tree-reset-node-location-before-catalog-s.patch
Description: Binary data


v1-0005-NodeSupport-Don-t-emit-trailing-0s-in-outDatum.patch
Description: Binary data


v1-0004-NodeSupport-add-some-more-default-markers-for-var.patch
Description: Binary data


v1-0003-Nodesupport-add-support-for-custom-default-values.patch
Description: Binary data


v1-0007-NodeSupport-Apply-RLE-and-differential-encoding-o.patch
Description: Binary data


v1-0006-NodeSupport-Apply-some-more-defaults-serializatio.patch
Description: Binary data


Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED

2024-01-03 Thread Cedric Villemain

Hi,

I wonder what you think of making pg_prewarm use recent addition on 
smgrprefetch and readv ?



In order to try, I did it anyway in the attached patches. They contain 
no doc update, but I will proceed if it is of interest.


In summary:

1. The first one adds a new check on parameters (checking last block is 
indeed not before first block).

Consequence is an ERROR is raised instead of silently doing nothing.

2. The second one does implement smgrprefetch with range and loops by 
default per segment to still have a check for interrupts.


3. The third one provides smgrreadv instead of smgrread,  by default on 
a range of 8 buffers. I am absolutely unsure that I used readv correctly...


Q: posix_fadvise may not work exactly the way you think it does, or does 
it ?



In details, and for the question:

It's not so obvious that the "feature" is really required or wanted, 
depending on what are the expectations from user point of view.


The kernel decides on what to do with posix_fadvise calls, and how we 
pass parameters does impact the decision.
With the current situation where prefetch is done step by step, block by 
block, they are very probably most of the time all loaded even if those 
from the beginning of the relation can be discarded at the end of the 
prefetch.


However,  if instead you provide a real range, or the magic len=0 to 
posix_fadvise, then blocks are "more" loaded according to effective vm 
pressure (which is not the case on the previous example).
As a result only a small part of the relation might be loaded, and this 
is probably not what end-users expect despite being probably a good 
choice (you can still free cache beforehand to help the kernel).


An example, below I'm using vm_relation_cachestat() which provides linux 
cachestat output, and vm_relation_fadvise() to unload cache, and 
pg_prewarm for the demo:


# clear cache: (nr_cache is the number of file system pages in cache, 
not postgres blocks)


```
postgres=# select block_start, block_count, nr_pages, nr_cache from 
vm_relation_cachestat('foo',range:=1024*32);

block_start | block_count | nr_pages | nr_cache
-+-+--+--
  0 |   32768 |    65536 |    0
  32768 |   32768 |    65536 |    0
  65536 |   32768 |    65536 |    0
  98304 |   32768 |    65536 |    0
 131072 |    1672 | 3344 |    0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm

132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from 
vm_relation_cachestat('foo',range:=1024*32);

block_start | block_count | nr_pages | nr_cache
-+-+--+--
  0 |   32768 |    65536 |  320
  32768 |   32768 |    65536 |    0
  65536 |   32768 |    65536 |    0
  98304 |   32768 |    65536 |    0
 131072 |    1672 | 3344 |  320  <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral 
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from 
vm_relation_cachestat('foo',range:=1024*32);

block_start | block_count | nr_pages | nr_cache
-+-+--+--
  0 |   32768 |    65536 |    65536
  32768 |   32768 |    65536 |    65536
  65536 |   32768 |    65536 |    65536
  98304 |   32768 |    65536 |    65536
 131072 |    1672 | 3344 | 3344

```

The duration of the last example is also really significant: full 
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have 
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work, 
here each call is from block 0 with len of 132744 and a range of 1 block 
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or 
WILLNEED, but distinct duration on the first "load":


```

postgres=# select * from 
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');

vm_relation_fadvise
-

(1 row)

Time: 25.202 ms
postgres=# select * from 
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');

vm_relation_fadvise
-

(1 row)

Time: 1523.636 ms (00:01.524) <- not free !
postgres=# select * from 
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');

vm_relation_fadvise
-

(1 row)

Time: 24.967 ms
```

Thank you for your time reading this longer than expected email.

Comments ?


---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D
From eeb24acd782f1dc2afff80fccfcec442521b9622 Mon Sep 17 0

Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-01-03 Thread Jim Nasby


  
  
On 1/3/24 10:25 AM, Cédric Villemain
  wrote:

Hi
  Palak,
  
  
  I did a quick review of the patch:
  
  
  +CREATE FUNCTION pg_buffercache_invalidate(IN int, IN bool default
  true)
  
  +RETURNS bool
  
  +AS 'MODULE_PATHNAME', 'pg_buffercache_invalidate'
  
  +LANGUAGE C PARALLEL SAFE;
  
  
  --> Not enforced anywhere, but you can also add a comment to
  the function, for end users...
  

The arguments should also have names...

  +    force = PG_GETARG_BOOL(1);
  
  
  I think you also need to test PG_ARGISNULL with force parameter.
  

Actually, that's true for the first argument as well. Or, just mark
the function as STRICT.
-- 
Jim Nasby, Data Architect, Austin TX
  





Re: add function argument names to regex* functions.

2024-01-03 Thread Dian Fay
> > Another possibility is `index`, which is relatively short and not a
> > reserved keyword ^1. `position` is not as precise but would avoid the
> > conceptual overloading of ordinary indices.
>
> I'm not a fan of "index" since that leaves the question of
> whether it's 0 or 1 based. "Position" is a bit better, but I think
> Jian's suggestion of "occurance" is best.

We do have precedent for one-based `index` in Postgres: array types are
1-indexed by default! "Occurrence" removes that ambiguity but it's long
and easy to misspell (I looked it up after typing it just now and it
_still_ feels off).

How's "instance"?




Re: Reducing output size of nodeToString

2024-01-03 Thread Matthias van de Meent
On Wed, 3 Jan 2024 at 03:02, David Rowley  wrote:
>
> On Thu, 14 Dec 2023 at 19:21, Matthias van de Meent
>  wrote:
> >
> > On Thu, 7 Dec 2023 at 13:09, David Rowley  wrote:
> > > We could also easily serialize plans to binary format for copying to
> > > parallel workers rather than converting them to a text-based
> > > serialized format. It would also allow us to do things like serialize
> > > PREPAREd plans into a nicely compact single allocation that we could
> > > just pfree in a single pfree call on DEALLOCATE.
> >
> > I'm not sure what benefit you're refering to. If you mean "it's more
> > compact than the current format" then sure; but the other points can
> > already be covered by either the current nodeToString format, or by
> > nodeCopy-ing the prepared plan into its own MemoryContext, which would
> > allow us to do essentially the same thing.
>
> There's significantly less memory involved in just having a plan
> serialised into a single chunk of memory vs a plan stored in its own
> MemoryContext.  With the serialised plan, you don't have any power of
> 2 rounding up wastage that aset.c does and don't need extra space for
> all the MemoryChunks that would exist for every single palloc'd chunk
> in the MemoryContext version.

I was envisioning this to use the Bump memory context you proposed
over in [0], as to the best of my knowledge prepared plans are not
modified, so nodeCopy-ing a prepared plan into bump context could be a
good use case for those contexts. This should remove the issue of
rounding and memorychunk wastage in aset.

> I think it would be nice if one day in the future if a PREPAREd plan
> could have multiple different plans cached. We could then select which
> one to use by looking at statistics for the given parameters and
> choose the plan that's most suitable for the given parameters.   Of
> course, this is a whole entirely different project. I mention it just
> because being able to serialise a plan would make the memory
> management and overhead for such a feature much more manageable.
> There'd likely need to be some eviction logic in such a feature as the
> number of possible plans for some complex query is quite likely to be
> much more than we'd care to cache.

Yeah, that'd be nice, but is also definitely future work.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: 
https://www.postgresql.org/message-id/flat/CAApHDvqGSpCU95TmM%3DBp%3D6xjL_nLys4zdZOpfNyWBk97Xrdj2w%40mail.gmail.com




Re: Password leakage avoidance

2024-01-03 Thread Jim Nasby


  
  
On 1/3/24 7:53 AM, Robert Haas wrote:


  Also, +1 for the general idea. I don't think this is a whole answer to
the problem of passwords appearing in log files because (1) you have
to be using libpq in order to make use of this and (2) you have to
actually use it instead of just doing something else and complaining
about the problem. But neither of those things is a reason not to have
it. There's no reason why a sophisticated user who goes through libpq
shouldn't have an easy way to do this instead of being asked to
reimplement it if they want the functionality.

ISTM the only way to really move the needle (short of removing
  all SQL support for changing passwords) would be a GUC that allows
  disabling the use of SQL for setting passwords. While that doesn't
  prevent leakage, it does at least force users to use a secure
  method of setting passwords.

-- 
Jim Nasby, Data Architect, Austin TX
  





Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2024-01-03 Thread Jelte Fennema-Nio
On Wed, 3 Jan 2024 at 23:13, Tom Lane  wrote:
> I like Nathan's wording.

To be clear, I don't want to block this patch on the wording of that
single comment. So, if you feel Nathan's wording was better, I'm fine
with that too. But let me respond to your arguments anyway:

> Your assertion is contradicted by cases as
> obvious as -O0

My suggestion specifically mentions optimizing compilers, -O0 is by
definition not an optimizing compiler.

> just how far back might gcc choose to do that unrolling?

gcc 5.1 and clang 3.0 (possibly earlier, but this is the oldest I was
able to test the code with on godbolt). As seen upthread:

> I did some testing on godbolt.org and both versions of the macros
> result in the same assembly when compiling with -O2 (and even -O1)
> when compiling with ancient versions of gcc (5.1) and clang (3.0):
> https://godbolt.org/z/WqfTbhe4e

> Does the size of the loop body matter?)

I copy pasted a simple printf ~800 times and the answer seems to be
no, it doesn't matter: https://godbolt.org/z/EahYPa8KM




Re: add function argument names to regex* functions.

2024-01-03 Thread Jim Nasby


  
  
On 1/1/24 12:05 PM, Dian Fay wrote:


  I agree that the parameter name `n` is not ideal. For example, in
`regexp_replace` it's easy to misinterpret it as "make up to n
replacements". This has not been a problem when `n` only lives in the
documentation which explains exactly what it does, but that context is
not readily available in code expressing `n => 3`.


Agreed; IMO it's worth diverging from what Oracle has done here.

  
Another possibility is `index`, which is relatively short and not a
reserved keyword ^1. `position` is not as precise but would avoid the
conceptual overloading of ordinary indices.


I'm not a fan of "index" since that leaves the question of
  whether it's 0 or 1 based. "Position" is a bit better, but I think
  Jian's suggestion of "occurance" is best.

--
Jim Nasby, Data Architect, Austin TX
  





Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2024-01-03 Thread Tom Lane
Jelte Fennema-Nio  writes:
> The "we expect" reads to me as if we're not very sure that compilers
> do this optimization. Even though we are quite sure. Maybe some small
> changes like this to clarify that.

> The outer loop only does a single iteration, so we expect that **any**
> optimizing compilers will unroll it, thereby optimizing it away. **We
> know for sure that gcc and clang do this optimization.**

I like Nathan's wording.  Your assertion is contradicted by cases as
obvious as -O0, and I'm sure a lot of other holes could be poked in it
as well (e.g, just how far back might gcc choose to do that unrolling?
Does the size of the loop body matter?)

regards, tom lane




Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2024-01-03 Thread Nathan Bossart
On Wed, Jan 03, 2024 at 10:57:07PM +0100, Jelte Fennema-Nio wrote:
> Overall your light edits look good to me. The commit message is very
> descriptive and I like the shortening of the comments. The only thing
> I feel is that I think lost some my original intent is this sentence:
> 
> + * different types.  The outer loop only does a single iteration, so we 
> expect
> + * optimizing compilers will unroll it, thereby optimizing it away.
> 
> The "we expect" reads to me as if we're not very sure that compilers
> do this optimization. Even though we are quite sure. Maybe some small
> changes like this to clarify that.
> 
> The outer loop only does a single iteration, so we expect that **any**
> optimizing compilers will unroll it, thereby optimizing it away. **We
> know for sure that gcc and clang do this optimization.**

WFM.  Thanks for reviewing the edits.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2024-01-03 Thread Jelte Fennema-Nio
On Wed, 3 Jan 2024 at 20:55, Nathan Bossart  wrote:
>
> I spent some time preparing this for commit, which only amounted to some
> light edits.  I am posting a new version of the patch in order to get one
> more round of cfbot coverage and to make sure there is no remaining
> feedback.

Overall your light edits look good to me. The commit message is very
descriptive and I like the shortening of the comments. The only thing
I feel is that I think lost some my original intent is this sentence:

+ * different types.  The outer loop only does a single iteration, so we expect
+ * optimizing compilers will unroll it, thereby optimizing it away.

The "we expect" reads to me as if we're not very sure that compilers
do this optimization. Even though we are quite sure. Maybe some small
changes like this to clarify that.

The outer loop only does a single iteration, so we expect that **any**
optimizing compilers will unroll it, thereby optimizing it away. **We
know for sure that gcc and clang do this optimization.**




Re: Things I don't like about \du's "Attributes" column

2024-01-03 Thread Pavel Luzanov

On 02.01.2024 22:38, Robert Haas wrote:


To add to that a bit, I would probably never ask a user to give me the
output of \du to troubleshoot some issue. I would either ask them for
pg_dumpall -g output, or I'd ask them to give me the raw contents of
pg_authid. That's because I know that either of those things are going
to tell me about ALL the properties of the role, or at least all of
the properties of the role that are stored in pg_authid, without
omitting anything that some hacker thought was uninteresting. I don't
know that \du is going to do that, and I'm not going to want to read
the code to figure out which cases it thinks are uninteresting,
*especially* if it behaves differently by version.


\d commands are a convenient way to see the contents of the system
catalogs. Short commands, instead of long SQL queries. Imo, this is their
main purpose.

Interpreting values ('No connection' instead of 0 and so on)
can be useful if the actual values are easy to identify. If there is
doubt whether it will be clear, then it is better to show it as is.
The documentation contains a description of the system catalogs.
It tells you how to interpret the values correctly.



The counterargument is that if you don't omit anything, the output
gets very long, which is a problem, because either you go wide, and
then you get wrapping, or you use multiple-lines, and then if there
are 500 users the output goes on forever.


This can be mostly solved by using extended mode. Key properties for \du,
all others for \du+. Also \du+ can used with \x.
Of course, the question arises as to which properties are key and
which are not. Here we need to reach a compromise.


Personally,
I'd assume that if CONNECTION LIMIT isn't mentioned, it's unlimited.
But a lot of the other options are less clear. Probably NOSUPERUSER is
the default and SUPERUSER is the exception, but it's very unclear
whether LOGIN or NOLOGIN is should be treated as the "normal" case,
given that the feature encompasses users and groups and non-login
roles that people access via SET ROLE and things that look like groups
but are also used as login roles.

And with some of the other options, it's just harder to remember
whether there's a default and what it is exactly than for other object
types.


psql provides a handy tool for solving such questions - ECHO_HIDDEN variable.
But it is very important that the query text is easily transformed into 
the command output.


Proposed patch tries to implement this approach.

--
Pavel Luzanov
Postgres Professional:https://postgrespro.com


Re: More new SQL/JSON item methods

2024-01-03 Thread Peter Eisentraut

On 03.01.24 13:01, Peter Eisentraut wrote:

On 07.12.23 14:24, Jeevan Chalke wrote:

We have the same issue with integer conversion and need a fix.

Unfortunately, I was using int8in() for the conversion of numeric 
values. We should be using numeric_int8() instead. However, there is 
no opt_error version of the same.


So, I have introduced a numeric_int8_opt_error() version just like we 
have one for int4, i.e. numeric_int4_opt_error(), to suppress the 
error. These changes are in the 0001 patch. (All other patch numbers 
are now increased by 1)


I have used this new function to fix this reported issue and used 
numeric_int4_opt_error() for integer conversion.


I have committed the 0001 and 0002 patches for now.

The remaining patches look reasonable to me, but I haven't reviewed them 
in detail.


The 0002 patch had to be reverted, because we can't change the order of 
the enum values in JsonPathItemType.  I have instead committed a 
different patch that adjusts the various switch cases to observe the 
current order of the enum.  That also means that the remaining patches 
that add new item methods need to add the new enum values at the end and 
adjust the rest of their code accordingly.






Re: Add new for_each macros for iterating over a List that do not require ListCell pointer

2024-01-03 Thread Nathan Bossart
I spent some time preparing this for commit, which only amounted to some
light edits.  I am posting a new version of the patch in order to get one
more round of cfbot coverage and to make sure there is no remaining
feedback.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From f91cac27a322ee3f60c93354f03da2ccb3ef11e2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Jan 2024 12:44:45 -0600
Subject: [PATCH v8 1/1] Add macros for looping through a List without a
 ListCell.

Many foreach loops only use the ListCell pointer to retrieve the
content of the cell, like so:

ListCell   *lc;

foreach(lc, mylist)
{
int myint = lfirst_int(lc);

... et cetera ...
}

This ListCell pointer serves little purpose, except perhaps to
check whether we actually completed the loop, in which case the
pointer will be set to NULL.  However, most existing foreach loops
don't do that.

This commit adds a few convenience macros that automatically
declare the loop variable and retrieve the current cell's contents.
This allows us to rewrite the previous loop like this:

foreach_int(myint, mylist)
{
... et cetera ...
}

This commit also adjusts a few existing loops in order to add
coverage for the new/adjusted macros.  There is presently no plan
to bulk update all foreach loops in the code base, as that could
introduce significant back-patching pain.  Instead, these macros
are primarily intended for use in new code.

Author: Jelte Fennema-Nio
Reviewed-by: TODO
Discussion: https://postgr.es/m/CAGECzQSwXKnxGwW1_Q5JE%2B8Ja20kyAbhBHO04vVrQsLcDciwXA%40mail.gmail.com
---
 src/backend/executor/execExpr.c |  9 +--
 src/backend/replication/logical/relation.c  |  4 +-
 src/backend/replication/logical/tablesync.c |  6 +-
 src/backend/replication/pgoutput/pgoutput.c |  7 +--
 src/include/nodes/pg_list.h | 68 ++---
 5 files changed, 70 insertions(+), 24 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 2c62b0c9c8..e06886a733 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -216,7 +216,6 @@ ExecInitQual(List *qual, PlanState *parent)
 	ExprState  *state;
 	ExprEvalStep scratch = {0};
 	List	   *adjust_jumps = NIL;
-	ListCell   *lc;
 
 	/* short-circuit (here and in ExecQual) for empty restriction list */
 	if (qual == NIL)
@@ -250,10 +249,8 @@ ExecInitQual(List *qual, PlanState *parent)
 	scratch.resvalue = &state->resvalue;
 	scratch.resnull = &state->resnull;
 
-	foreach(lc, qual)
+	foreach_ptr(Expr, node, qual)
 	{
-		Expr	   *node = (Expr *) lfirst(lc);
-
 		/* first evaluate expression */
 		ExecInitExprRec(node, state, &state->resvalue, &state->resnull);
 
@@ -265,9 +262,9 @@ ExecInitQual(List *qual, PlanState *parent)
 	}
 
 	/* adjust jump targets */
-	foreach(lc, adjust_jumps)
+	foreach_int(jump, adjust_jumps)
 	{
-		ExprEvalStep *as = &state->steps[lfirst_int(lc)];
+		ExprEvalStep *as = &state->steps[jump];
 
 		Assert(as->opcode == EEOP_QUAL);
 		Assert(as->d.qualexpr.jumpdone == -1);
diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index a581ac9a4d..41834a2554 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -746,11 +746,9 @@ static Oid
 FindUsableIndexForReplicaIdentityFull(Relation localrel, AttrMap *attrmap)
 {
 	List	   *idxlist = RelationGetIndexList(localrel);
-	ListCell   *lc;
 
-	foreach(lc, idxlist)
+	foreach_oid(idxoid, idxlist)
 	{
-		Oid			idxoid = lfirst_oid(lc);
 		bool		isUsableIdx;
 		Relation	idxRel;
 		IndexInfo  *idxInfo;
diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index 4d056c16c8..fdd674adeb 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -1036,11 +1036,11 @@ fetch_remote_table_info(char *nspname, char *relname,
 
 		/* Build the pubname list. */
 		initStringInfo(&pub_names);
-		foreach(lc, MySubscription->publications)
+		foreach_node(String, pubstr, MySubscription->publications)
 		{
-			char	   *pubname = strVal(lfirst(lc));
+			char	   *pubname = strVal(pubstr);
 
-			if (foreach_current_index(lc) > 0)
+			if (foreach_current_index(pubstr) > 0)
 appendStringInfoString(&pub_names, ", ");
 
 			appendStringInfoString(&pub_names, quote_literal_cstr(pubname));
diff --git a/src/backend/replication/pgoutput/pgoutput.c b/src/backend/replication/pgoutput/pgoutput.c
index 25a95076cf..37450b55f9 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -2234,7 +2234,6 @@ cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
 {
 	HASH_SEQ_STATUS hash_seq;
 	RelationSyncEntry *entry;
-	ListCell   *lc;
 
 	Assert(RelationSyncCache != NULL);
 
@@ -2247,15 +2246,15 @@ cleanup_rel_sync_cache(TransactionId xid, bool is_commit)
 		

Re: Set log_lock_waits=on by default

2024-01-03 Thread Jeremy Schneider
On 12/21/23 6:58 AM, Nikolay Samokhvalov wrote:
> On Thu, Dec 21, 2023 at 05:29 Laurenz Albe  > wrote:
> 
> Here is a patch to implement this.
> Being stuck behind a lock for more than a second is almost
> always a problem, so it is reasonable to turn this on by default.
> 
> 
> I think it's a very good idea. On all heavily loaded systems I have
> observed so far, we always have turned it on. 1s (default
> deadlock_timeout) is quite large value for web/mobile apps, meaning that
> default frequency of logging is quite low, so any potential suffering
> from observer effect doesn't happen -- saturation related active session
> number happens much, much earlier, even if you have very slow disk IO
> for logging.

FWIW, enabling this setting has also been a long-time "happiness hint"
that I've passed along to people.

What would be the worst case amount of logging that we're going to
generate at scale? I think the worst case would largely scale according
to connection count? So if someone had a couple thousand backends on a
busy top-end system, then I guess they might generate up to a couple
thousand log messages every second or two under load after this
parameter became enabled with a 1 second threshold?

I'm not aware of any cases where enabling this parameter with a 1 second
threshold overwhelmed the logging collector (unlike, for example,
log_statement=all) but I wanted to pose the question in the interest of
being careful.


> At the same time, I like the idea by Robert to separate logging of log
> waits and deadlock_timeout logic -- the current implementation is a
> quite confusing for new users. I also had cases when people wanted to
> log lock waits earlier than deadlock detection. And also, most always
> lock wait logging lacks the information another the blocking session
> (its state, and last query, first of all), but is maybe an off topic
> worthing another effort of improvements.

I agree with this, though it's equally true that proliferation of new
GUCs is confusing for new users. I hope the project avoids too low of a
bar for adding new GUCs. But using the deadlock_timeout GUC for this
completely unrelated log threshold really doesn't make sense.

-Jeremy


-- 
http://about.me/jeremy_schneider





Re: SET ROLE x NO RESET

2024-01-03 Thread Jelte Fennema-Nio
On Tue, 2 Jan 2024 at 23:23, Michał Kłeczek  wrote:
> > On 2 Jan 2024, at 18:36, Robert Haas  wrote:
> > IMHO, the best solution here would be a protocol message to change the
> > session user. The pooler could use that repeatedly on the same
> > session, but refuse to propagate such messages from client
> > connections.
>
> I think that is a different use case and both are needed.


FYI I implemented something just now that's pretty much what Robert
was talking about:
https://www.postgresql.org/message-id/flat/CAGECzQR%253D1t1TL-eS9HAjoGysdprPci5K7-C353PnON6W-_s9uw%2540mail.gmail.com

> In my case I have scripts that I want to execute with limited privileges
> and make sure the scripts cannot escape the sandbox via RESET ROLE.

Depending on the desired workflow I think that could work for you too.
Because it allows you to do this (and use -f script.sql instead of -c
'select ...):

❯ psql "user=postgres _pq_.protocol_managed_params=role options='-c
role=pg_read_all_data'" -c 'select current_user; set role postgres'
   current_user
──
 pg_read_all_data
(1 row)

ERROR:  42501: parameter can only be set at the protocol level "role"
LOCATION:  set_config_with_handle, guc.c:3583
Time: 0.667 ms




Re: WIP Incremental JSON Parser

2024-01-03 Thread Andrew Dunstan



On 2024-01-03 We 10:12, Robert Haas wrote:

On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan  wrote:

Say we have a document with an array 1m objects, each with a field
called "color". As it stands we'll allocate space for that field name 1m
times. Using a hash table we'd allocated space for it once. And
allocating the memory isn't free, although it might be cheaper than
doing hash lookups.

I guess we can benchmark it and see what the performance impact of using
a hash table might be.

Another possibility would be simply to have the callback free the field
name after use. for the parse_manifest code that could be a one-line
addition to the code at the bottom of json_object_manifest_field_start().

Yeah. So I'm arguing that allocating the memory each time and then
freeing it sounds cheaper than looking it up in the hash table every
time, discovering it's there, and thus skipping the allocate/free.

I might be wrong about that. It's just that allocating and freeing a
small chunk of memory should boil down to popping it off of a linked
list and then pushing it back on. And that sounds cheaper than hashing
the string and looking for it in a hash bucket.




OK, cleaning up in the client code will be much simpler, so let's go 
with that for now and revisit it later if necessary.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Assorted typo fixes

2024-01-03 Thread Dagfinn Ilmari Mannsåker
Robert Haas  writes:

> On Mon, Jan 1, 2024 at 6:05 PM Dagfinn Ilmari Mannsåker
>  wrote:
>> Thanks. I've fixed the commit message (and elaborated it a bit more why
>> I think it's a valid and safe fix).
>
> Regarding 0001:
>
> - AIUI, check_decls.m4 is copied from an upstream project, so I don't
> think we should tinker with it.

It contains modified versions of a few macros from Autoconf's
general.m4¹, specifically _AC_UNDECLARED_WARNING (since renamed to
_AC_UNDECLARED_BUILTIN upstream) and _AC_CHECK_DECL_BODY.  That has
since been updated² to spell François' name correctly, so I think we
should follow suit (and maybe also check if our override is even still
necessary).

[1]: 
http://git.savannah.gnu.org/gitweb/?p=autoconf.git;a=history;f=lib/autoconf/general.m4;hb=HEAD
[2]: 
http://git.savannah.gnu.org/gitweb/?p=autoconf.git;a=commitdiff;h=8a228e9d58363ad3ebdb89a05bd77568d1d863b7

> - I'm not convinced by encrypter->encryptor
> - I'm not convinced by multidimension-aware->multidimensional-aware

I don't feel particularly strongy about these.

> - I'm not convinced by cachable->cacheable

If nothing else, consistency.  There are 13 occurrences of "cacheable"
and only three of "cachable" in the tree.

> - You corrected restorting to restarting, but I'm wondering if Andres
> intended restoring?

Yeah, re-reading the sentence that's clearly meant to be "restoring".

> Committed the rest of 0001.
>
> 0002-0005 look OK to me, so I committed those as well.

Thanks!

> With regard to 0006, we typically use indexes rather than indices as
> the plural of "index", although exceptions exist.

We (mostly) use indexes when referring to database indexes (as in btree,
gist, etc.), but indices when referring to offsets in arrays, which is
what this variable is about.

- ilmari




Re: Extension Enhancement: Buffer Invalidation in pg_buffercache

2024-01-03 Thread Cédric Villemain

Hi Palak,

I did a quick review of the patch:

+CREATE FUNCTION pg_buffercache_invalidate(IN int, IN bool default true)
+RETURNS bool
+AS 'MODULE_PATHNAME', 'pg_buffercache_invalidate'
+LANGUAGE C PARALLEL SAFE;

--> Not enforced anywhere, but you can also add a comment to the 
function, for end users...


+PG_FUNCTION_INFO_V1(pg_buffercache_invalidate);
+Datum
+pg_buffercache_invalidate(PG_FUNCTION_ARGS)
+{
+    Buffer        bufnum;

"Buffer blocknum" is not correct in this context I believe. Buffer is 
when you have to manage Local buffer too (negative number).
Here uint32 is probably the good choice at the end, as used in 
pg_buffercache in other places.


Also in this extension bufferid is used, not buffernum.

+    bufnum = PG_GETARG_INT32(0);

+    if (bufnum <= 0 || bufnum > NBuffers)

maybe have a look at pageinspect and its PG_GETARG_UINT32.


+    {
+        ereport(ERROR,
+                (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                 errmsg("buffernum is not valid")));

https://www.postgresql.org/docs/16/error-style-guide.html let me think 
that message like 'buffernum is not valid' can be enhanced: out of 
range, cannot be negative or exceed number of shared buffers ? Maybe 
add the value to the message.


+
+    }
+
+    /*
+     * Check whether to force invalidate the dirty buffer. The default 
value of force is true.

+     */
+
+    force = PG_GETARG_BOOL(1);

I think you also need to test PG_ARGISNULL with force parameter.

+/*
+ * Try Invalidating a buffer using bufnum.
+ * If the buffer is invalid, the function returns false.
+ * The function checks for dirty buffer and flushes the dirty buffer 
before invalidating.

+ * If the buffer is still dirty it returns false.
+ */
+bool
+TryInvalidateBuffer(Buffer bufnum, bool force)
+{
+    BufferDesc *bufHdr = GetBufferDescriptor(bufnum - 1);

this is not safe, GetBufferDescriptor() accepts uint, but can receive 
negative here. Use uint32 and bufferid.


+    uint32        buf_state;
+    ReservePrivateRefCountEntry();
+
+    buf_state = LockBufHdr(bufHdr);
+    if ((buf_state & BM_VALID) == BM_VALID)
+    {
+        /*
+         * The buffer is pinned therefore cannot invalidate.
+         */
+        if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+        {
+            UnlockBufHdr(bufHdr, buf_state);
+            return false;
+        }
+        if ((buf_state & BM_DIRTY) == BM_DIRTY)
+        {
+            /*
+             * If the buffer is dirty and the user has not asked to 
clear the dirty buffer return false.

+             * Otherwise clear the dirty buffer.
+             */
+            if(!force){
+                return false;

probably need to unlockbuffer here too.

+            }
+            /*
+             * Try once to flush the dirty buffer.
+             */
+            ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+            PinBuffer_Locked(bufHdr);
+            LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), 
LW_SHARED);

+            FlushBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
+ LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
+            UnpinBuffer(bufHdr);

I am unsure of this area (the code is correct, but I wonder why there is 
no static code for this part -from pin to unpin- in PostgreSQL), and 
maybe better to go with FlushOneBuffer() ?
Also it is probably required to account for the shared buffer eviction 
in some pg_stat* view or table.
Not sure how disk syncing is handled after this sequence nor if it's 
important ?



+            buf_state = LockBufHdr(bufHdr);
+            if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+            {
+                UnlockBufHdr(bufHdr, buf_state);
+                return false;
+            }
+
+            /*
+             * If its dirty again or not valid anymore give up.
+             */
+
+            if ((buf_state & (BM_DIRTY | BM_VALID)) != (BM_VALID))
+            {
+                UnlockBufHdr(bufHdr, buf_state);
+                return false;
+            }
+
+        }
+
+        InvalidateBuffer(bufHdr);
+        return true;
+    }
+    else
+    {
+        UnlockBufHdr(bufHdr, buf_state);
+        return false;
+    }


Maybe safe to remove the else {} ...
Maybe more tempting to start the big if with the following instead less 
nested...):

+    if ((buf_state & BM_VALID) != BM_VALID)
+    {
+        UnlockBufHdr(bufHdr, buf_state);
+        return false;
+    }

Doc and test are absent.

---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D





Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-03 Thread Jelte Fennema-Nio
On Wed, 3 Jan 2024 at 00:43, Jacob Burroughs  wrote:
> What if we...

Great suggestions! Attached is a v3 version of the patchset that
implements all of this, including documentation.


v3-0001-libpq-Handle-NegotiateProtocolVersion-message-mor.patch
Description: Binary data


v3-0002-libpq-Include-minor-version-number-in-PQprotocolV.patch
Description: Binary data


v3-0004-Add-support-to-change-GUCs-at-the-protocol-level.patch
Description: Binary data


v3-0003-Bump-protocol-version-to-3.1.patch
Description: Binary data


Re: Password leakage avoidance

2024-01-03 Thread Jonathan S. Katz

On 1/2/24 7:23 AM, Sehrope Sarkuni wrote:
Having worked on and just about wrapped up the JDBC driver patch for 
this, couple thoughts:


2. Password encoding should be split out and made available as its own 
functions. Not just as part of a wider "create _or_ alter a user's 
password" function attached to a connection. We went a step further and 
added an intermediate function that generates the "ALTER USER ... 
PASSWORD" SQL.


I agree with this. It's been a minute, but I had done some refactoring 
on the backend-side to support the "don't need a connection" case for 
SCRAM secret generation functions on the server-side[1]. But I think in 
general we should split out the password generation functions, which 
leads to:


5. Our SCRAM specific function allows for customizing the algo iteration 
and salt parameters. That topic came up on hackers previously[1]. Our 
high level "alterUserPassword(...)" function does not have those options 
but it is available as part of our PasswordUtil SCRAM API for advanced 
users who want to leverage it. The higher level functions have defaults 
for iteration counts (4096) and salt size (16-bytes).


This seems like a good approach -- the regular function just has the 
defaults (which can be aligned to the PostgreSQL defaults) (or inherit 
from the server configuration, which then requires the connection to be 
present) and then have a more advanced API available.


Thanks,

Jonathan

[1] 
https://www.postgresql.org/message-id/3a9b7126-01a0-7e1a-1b2a-a76df6176725%40postgresql.org


OpenPGP_signature.asc
Description: OpenPGP digital signature


Re: add AVX2 support to simd.h

2024-01-03 Thread Nathan Bossart
On Wed, Jan 03, 2024 at 09:13:52PM +0700, John Naylor wrote:
> On Tue, Jan 2, 2024 at 11:11 PM Nathan Bossart  
> wrote:
>> I'm tempted to propose that we move forward with this patch as-is after
>> adding a buildfarm machine that compiles with -mavx2 or -march=x86-64-v3.
> 
> That means that we would be on the hook to fix it if it breaks, even
> though nothing uses it yet in a normal build. I have pending patches
> that will break, or get broken by, this, so minus-many from me until
> there is an availability story.

How will this break your patches?  Is it just a matter of adding more AVX2
support, or something else?

If the requirement is that normal builds use AVX2, then I fear we will be
waiting a long time.  IIUC the current proposals (building multiple
binaries or adding a configuration option that maps to compiler flags)
would still be opt-in, and I'm not sure we can mandate AVX2 support for all
x86_64 builds anytime soon.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: WIP Incremental JSON Parser

2024-01-03 Thread Robert Haas
On Wed, Jan 3, 2024 at 9:59 AM Andrew Dunstan  wrote:
> Say we have a document with an array 1m objects, each with a field
> called "color". As it stands we'll allocate space for that field name 1m
> times. Using a hash table we'd allocated space for it once. And
> allocating the memory isn't free, although it might be cheaper than
> doing hash lookups.
>
> I guess we can benchmark it and see what the performance impact of using
> a hash table might be.
>
> Another possibility would be simply to have the callback free the field
> name after use. for the parse_manifest code that could be a one-line
> addition to the code at the bottom of json_object_manifest_field_start().

Yeah. So I'm arguing that allocating the memory each time and then
freeing it sounds cheaper than looking it up in the hash table every
time, discovering it's there, and thus skipping the allocate/free.

I might be wrong about that. It's just that allocating and freeing a
small chunk of memory should boil down to popping it off of a linked
list and then pushing it back on. And that sounds cheaper than hashing
the string and looking for it in a hash bucket.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2024-01-03 Thread Robert Haas
On Fri, Dec 22, 2023 at 12:00 AM Alexander Lakhin  wrote:
> My quick experiment shows that that TimestampDifferenceMilliseconds call
> always returns zero, due to it's arguments swapped.

Thanks. Tom already changed the unsigned -> int stuff in a separate
commit, so I just pushed the fixes to PrepareForIncrementalBackup,
both the one I had before, and swapping the arguments to
TimestampDifferenceMilliseconds.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: WIP Incremental JSON Parser

2024-01-03 Thread Andrew Dunstan



On 2024-01-03 We 08:45, Robert Haas wrote:

On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan  wrote:

Yeah. One idea I had yesterday was to stash the field names, which in
large JSON docs tent to be pretty repetitive, in a hash table instead of
pstrduping each instance. The name would be valid until the end of the
parse, and would only need to be duplicated by the callback function if
it were needed beyond that. That's not the case currently with the
parse_manifest code. I'll work on using a hash table.

IMHO, this is not a good direction. Anybody who is parsing JSON
probably wants to discard the duplicated labels and convert other
heavily duplicated strings to enum values or something. (e.g. if every
record has {"color":"red"} or {"color":"green"}). So the hash table
lookups will cost but won't really save anything more than just
freeing the memory not needed, but will probably be more expensive.



I don't quite follow.

Say we have a document with an array 1m objects, each with a field 
called "color". As it stands we'll allocate space for that field name 1m 
times. Using a hash table we'd allocated space for it once. And 
allocating the memory isn't free, although it might be cheaper than 
doing hash lookups.


I guess we can benchmark it and see what the performance impact of using 
a hash table might be.


Another possibility would be simply to have the callback free the field 
name after use. for the parse_manifest code that could be a one-line 
addition to the code at the bottom of json_object_manifest_field_start().




The parse_manifest code does seem to pfree the scalar values it no
longer needs fairly well, so maybe we don't need to to anything there.

Hmm. This makes me wonder if you've measured how much actual leakage there is?



No I haven't. I have simply theorized about how much memory we might 
consume if nothing were done by the callers to free the memory.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: add AVX2 support to simd.h

2024-01-03 Thread John Naylor
On Tue, Jan 2, 2024 at 11:11 PM Nathan Bossart  wrote:
>
> Perhaps I was too optimistic about adding support for newer instructions...
>
> I'm tempted to propose that we move forward with this patch as-is after
> adding a buildfarm machine that compiles with -mavx2 or -march=x86-64-v3.

That means that we would be on the hook to fix it if it breaks, even
though nothing uses it yet in a normal build. I have pending patches
that will break, or get broken by, this, so minus-many from me until
there is an availability story.




Re: Change GUC hashtable to use simplehash?

2024-01-03 Thread John Naylor
On Tue, Jan 2, 2024 at 6:56 AM jian he  wrote:
>
> My local computer is slow. but here is the test results:
>
> select * from bench_cstring_hash_aligned(10);7318.893 ms
> select * from bench_cstring_hash_unaligned(10);10383.173 ms
> select * from bench_pgstat_hash(10);   4474.989 ms
> select * from bench_pgstat_hash_fh(10);  9192.245 ms
> select * from bench_string_hash(10);2048.008 ms

This presents a 2x to 5x slowdown, so I'm skeptical this is typical --
 what kind of platform is. For starters, what CPU and compiler?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-03 Thread John Naylor
On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada  wrote:

> I agree that we expose RT_LOCK_* functions and have tidstore use them,
> but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> calls part. I think that even if we expose them, we will still need to
> do something like "if (TidStoreIsShared(ts))
> shared_rt_lock_share(ts->tree.shared)", no?

I'll come back to this topic separately.

> I've attached a new patch set. From v47 patch, I've merged your
> changes for radix tree, and split the vacuum integration patch into 3
> patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
> use a simple TID array for one-pass strategy (0008 patch), and replace
> has_lpdead_items with "num_offsets > 0" (0009 patch), while
> incorporating your review comments on the vacuum integration patch

Nice!

> (sorry for making it difficult to see the changes from v47 patch).

It's actually pretty clear. I just have a couple comments before
sharing my latest cleanups:

(diff'ing between v47 and v48):

--   /*
-* In the shared case, TidStoreControl and radix_tree are backed by the
-* same DSA area and rt_memory_usage() returns the value including both.
-* So we don't need to add the size of TidStoreControl separately.
-*/
if (TidStoreIsShared(ts))
-   return sizeof(TidStore) +
shared_rt_memory_usage(ts->tree.shared);
+   rt_mem = shared_rt_memory_usage(ts->tree.shared);
+   else
+   rt_mem = local_rt_memory_usage(ts->tree.local);

-   return sizeof(TidStore) + sizeof(TidStore) +
local_rt_memory_usage(ts->tree.local);
+   return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem;

Upthread, I meant that I don't see the need to include the size of
these structs *at all*. They're tiny, and the blocks/segments will
almost certainly have some empty space counted in the total anyway.
The returned size is already overestimated, so this extra code is just
a distraction.

- if (result->num_offsets + bmw_popcount(w) > result->max_offset)
+ if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD)
>= result->max_offset)

I believe this math is wrong. We care about "result->num_offsets +
BITS_PER_BITMAPWORD", right?
Also, it seems if the condition evaluates to equal, we still have
enough space, in which case ">" the max is the right condition.

- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (off < 1 || off > MaxOffsetNumber)

This can now use OffsetNumberIsValid().

> 0013 to 0015 patches are also updates from v47 patch.

> I'm thinking that we should change the order of the patches so that
> tidstore patch requires the patch for changing DSA segment sizes. That
> way, we can remove the complex max memory calculation part that we no
> longer use from the tidstore patch.

I don't think there is any reason to have those calculations at all at
this point. Every patch in every version should at least *work
correctly*, without kludging m_w_m and without constraining max
segment size. I'm fine with the latter remaining in its own thread,
and I hope we can consider it an enhancement that respects the admin's
configured limits more effectively, and not a pre-requisite for not
breaking. I *think* we're there now, but it's hard to tell since 0015
was at the very end. As I said recently, if something still fails, I'd
like to know why. So for v49, I took the liberty of removing the DSA
max segment patches for now, and squashing v48-0015.

In addition for v49, I have quite a few cleanups:

0001 - This hasn't been touched in a very long time, but I ran
pgindent and clarified a comment
0002 - We no longer need to isolate the rightmost bit anywhere, so
removed that part and revised the commit message accordingly.

radix tree:
0003 - v48 plus squashed v48-0013
0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated,
and I fixed most of the rest.
0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore.
0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A
bit simpler.
0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE.
0008 - Removes a holdover from the multi-value leaves era.
0009 - It occurred to me that we need to have unique names for memory
contexts for different instantiations of the template. This is one way
to do it, by using the configured RT_PREFIX in the context name. I
also took an extra step to make the size class fanout show up
correctly on different platforms, but that's probably overkill and
undesirable, and I'll probably use only the class name next time.
0010/11 - Make the array functions less surprising and with more
informative names.
0012 - Restore a useful technique from Andres's prototype. This part
has been slow for a long time, so much that it showed up in a profile
where this path wasn't even taken much.

tid store / vacuum:
0013/14 - Same as v48 TID store, with review squashed
0015 - Rationalize comment and starting value.

Re: Password leakage avoidance

2024-01-03 Thread Dave Cramer
On Wed, 3 Jan 2024 at 08:53, Robert Haas  wrote:

> On Sun, Dec 24, 2023 at 12:06 PM Jonathan S. Katz 
> wrote:
> > We're likely to have new algorithms in the future, as there is a draft
> > RFC for updating the SCRAM hashes, and already some regulatory bodies
> > are looking to deprecate SHA256. My concern with relying on the
> > "encrypted_password" GUC (which is why PQencryptPasswordConn takes
> > "conn") makes it any easier for users to choose the algorithm, or if
> > they need to rely on the server/session setting.
>
> Yeah, I agree. It doesn't make much sense to me to propose that a GUC,
> which is a server-side setting, should control client-side behavior.
>
> Also, +1 for the general idea. I don't think this is a whole answer to
> the problem of passwords appearing in log files because (1) you have
> to be using libpq in order to make use of this


JDBC has it as of yesterday. I would imagine other clients will implement
it.
Dave Cramer

>
>


Re: Password leakage avoidance

2024-01-03 Thread Robert Haas
On Sun, Dec 24, 2023 at 12:06 PM Jonathan S. Katz  wrote:
> We're likely to have new algorithms in the future, as there is a draft
> RFC for updating the SCRAM hashes, and already some regulatory bodies
> are looking to deprecate SHA256. My concern with relying on the
> "encrypted_password" GUC (which is why PQencryptPasswordConn takes
> "conn") makes it any easier for users to choose the algorithm, or if
> they need to rely on the server/session setting.

Yeah, I agree. It doesn't make much sense to me to propose that a GUC,
which is a server-side setting, should control client-side behavior.

Also, +1 for the general idea. I don't think this is a whole answer to
the problem of passwords appearing in log files because (1) you have
to be using libpq in order to make use of this and (2) you have to
actually use it instead of just doing something else and complaining
about the problem. But neither of those things is a reason not to have
it. There's no reason why a sophisticated user who goes through libpq
shouldn't have an easy way to do this instead of being asked to
reimplement it if they want the functionality.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: SLRU optimization - configurable buffer pool and partitioning the SLRU lock

2024-01-03 Thread Robert Haas
On Wed, Jan 3, 2024 at 12:08 AM Dilip Kumar  wrote:
> Yeah, this is indeed an interesting idea.  So I think if we are
> interested in working in this direction maybe this can be submitted in
> a different thread, IMHO.

Yeah, that's something quite different from the patch before us.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: WIP Incremental JSON Parser

2024-01-03 Thread Robert Haas
On Wed, Jan 3, 2024 at 6:57 AM Andrew Dunstan  wrote:
> Yeah. One idea I had yesterday was to stash the field names, which in
> large JSON docs tent to be pretty repetitive, in a hash table instead of
> pstrduping each instance. The name would be valid until the end of the
> parse, and would only need to be duplicated by the callback function if
> it were needed beyond that. That's not the case currently with the
> parse_manifest code. I'll work on using a hash table.

IMHO, this is not a good direction. Anybody who is parsing JSON
probably wants to discard the duplicated labels and convert other
heavily duplicated strings to enum values or something. (e.g. if every
record has {"color":"red"} or {"color":"green"}). So the hash table
lookups will cost but won't really save anything more than just
freeing the memory not needed, but will probably be more expensive.

> The parse_manifest code does seem to pfree the scalar values it no
> longer needs fairly well, so maybe we don't need to to anything there.

Hmm. This makes me wonder if you've measured how much actual leakage there is?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Show WAL write and fsync stats in pg_stat_io

2024-01-03 Thread Nazir Bilal Yavuz
Hi,

On Sun, 31 Dec 2023 at 03:58, Michael Paquier  wrote:
>
> On Tue, Dec 26, 2023 at 03:35:52PM +0300, Nazir Bilal Yavuz wrote:
> > On Tue, 26 Dec 2023 at 13:10, Michael Paquier  wrote:
> >> I am not sure while the whole point of the exercise is to have all the
> >> I/O related data in a single view.  Something that I've also found a
> >> bit disturbing yesterday while looking at your patch is the fact that
> >> the operation size is guessed from the context and object type when
> >> querying the view because now everything is tied to BLCKSZ.  This
> >> patch extends it with two more operation sizes, and there are even
> >> cases where it may be a variable.  Could it be a better option to
> >> extend pgstat_count_io_op_time() so as callers can themselves give the
> >> size of the operation?
> >
> > Do you mean removing the op_bytes column and tracking the number of
> > bytes in reads, writes, and extends? If so, that makes sense to me but
> > I don't want to remove the number of operations; I believe that has a
> > value too. We can extend the pgstat_count_io_op_time() so it can both
> > track the number of bytes and the number of operations.
>
> Apologies if my previous wording sounded confusing.  The idea I had in
> mind was to keep op_bytes in pg_stat_io, and extend it so as a value
> of NULL (or 0, or -1) is a synonym as "writes", "extends" and "reads"
> as a number of bytes.

Oh, I understand it now. Yes, that makes sense.
I thought removing op_bytes completely ( as you said "This patch
extends it with two more operation sizes, and there are even cases
where it may be a variable" ) from pg_stat_io view then adding
something like {read | write | extend}_bytes and {read | write |
extend}_calls could be better, so that we don't lose any information.

> > Also, it is not directly related to this patch but vectored IO [1] is
> > coming soon; so the number of operations could be wrong since vectored
> > IO could merge a couple of operations.
>
> Hmm.  I have not checked this patch series so I cannot say for sure,
> but we'd likely just want to track the number of bytes if a single
> operation has a non-equal size rather than registering in pg_stat_io N
> rows with different op_bytes, no?

Yes, that is correct.

> I am looping in Thomas Munro in CC for comments.

Thanks for doing that.

> > If we want to entirely disable it, we can add
> >
> > if (MyBackendType == B_WAL_RECEIVER && io_object == IOOBJECT_WAL)
> > return;
> >
> > to the top of the pgstat_count_io_op_time() since all IOOBJECT_WAL
> > calls are done by this function, then we can disable it at
> > pgstat_tracks_io_bktype().
>
> Yeah, a limitation like that may be acceptable for now.  Tracking the
> WAL writer and WAL sender activities can be relevant in a lot of cases
> even if we don't have the full picture for the WAL receiver yet.

I added that and disabled B_WAL_RECEIVER backend with comments
explaining why. v8 is attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From dc258a5b168f15c9771f174924261b151193 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 3 Jan 2024 15:36:19 +0300
Subject: [PATCH v8] Show WAL stats on pg_stat_io (except streaming
 replication)

This patch aims to showing WAL stats per backend on pg_stat_io view.

With this patch, it can be seen how many WAL operations it makes, their
context, types and total timings per backend in pg_stat_io view.

In this path new IOContext IOCONTEXT_INIT is introduced, it is for IO
operations done while creating the things. Currently, it is used only
together with IOObject IOOBJECT_WAL.

IOOBJECT_WAL means IO operations related to WAL.
IOOBJECT_WAL / IOCONTEXT_NORMAL means IO operations done on already
created WAL segments.
IOOBJECT_WAL / IOCONTEXT_INIT means IO operations done while creating
the WAL segments.

This patch currently covers:
- Documentation
- IOOBJECT_WAL / IOCONTEXT_NORMAL / read, write and fsync stats on
  pg_stat_io.
- IOOBJECT_WAL / IOCONTEXT_INIT / write and fsync stats on pg_stat_io.

doesn't cover:
- Streaming replication WAL IO.
---
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/pgstat.h  |   6 +-
 src/backend/access/transam/xlog.c |  58 +--
 src/backend/access/transam/xlogrecovery.c |  10 ++
 src/backend/catalog/system_views.sql  |  15 ++-
 src/backend/utils/activity/pgstat_io.c| 119 --
 src/backend/utils/adt/pgstatfuncs.c   |  24 ++---
 src/test/regress/expected/rules.out   |  27 +++--
 src/test/regress/expected/stats.out   |  53 ++
 src/test/regress/sql/stats.sql|  25 +
 doc/src/sgml/monitoring.sgml  |  29 --
 11 files changed, 292 insertions(+), 80 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5b67784731a..26197dbb817 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5785,9 +5785,9 @@
 { oid => '1136', descr => 'statistics: 

Re: add function argument names to regex* functions.

2024-01-03 Thread Peter Eisentraut

On 28.12.23 04:28, jian he wrote:

I looked around the oracle implementation in [1], and the oracle regex
related function argumentation name in [2]
I use the doc [3] syntax explanation and add the related function names.

Current, regex.* function syntax seems fine. but only parameter `N`
seems a little bit weird.
If we change the function's argument name, we also need to change
function syntax explanation in the doc; vise versa.


So, it looks like Oracle already has defined parameter names for these, 
so we should make ours match.






Re: Transaction timeout

2024-01-03 Thread Andrey M. Borodin


> On 3 Jan 2024, at 16:46, Andrey M. Borodin  wrote:
> 
>  I do not understand why, but mailing list did not pick patches that I sent. 
> I'll retry.


Sorry for the noise. Seems like Apple updated something in Mail.App couple of 
days ago and it started to use strange "Apple-Mail" stuff by default.
I see patches were attached, but were not recognized by mailing list archives 
and CFbot.
Now I've flipped everything to "plain text by default" everywhere. Hope that 
helps.


Best regards, Andrey Borodin.


v21-0001-Introduce-transaction_timeout.patch
Description: Binary data


v21-0004-fix-reschedule-timeout-for-each-commmand.patch
Description: Binary data


v21-0003-Try-to-enable-transaction_timeout-before-next-co.patch
Description: Binary data


v21-0002-Add-better-tests-for-transaction_timeout.patch
Description: Binary data


Re: More new SQL/JSON item methods

2024-01-03 Thread Peter Eisentraut

On 07.12.23 14:24, Jeevan Chalke wrote:

We have the same issue with integer conversion and need a fix.

Unfortunately, I was using int8in() for the conversion of numeric 
values. We should be using numeric_int8() instead. However, there is no 
opt_error version of the same.


So, I have introduced a numeric_int8_opt_error() version just like we 
have one for int4, i.e. numeric_int4_opt_error(), to suppress the error. 
These changes are in the 0001 patch. (All other patch numbers are now 
increased by 1)


I have used this new function to fix this reported issue and used 
numeric_int4_opt_error() for integer conversion.


I have committed the 0001 and 0002 patches for now.

The remaining patches look reasonable to me, but I haven't reviewed them 
in detail.





Re: WIP Incremental JSON Parser

2024-01-03 Thread Andrew Dunstan



On 2024-01-02 Tu 10:14, Robert Haas wrote:

On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan  wrote:

Quite a long time ago Robert asked me about the possibility of an
incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
performance is significantly worse that that of the current Recursive
Descent parser. Nevertheless, I'm attaching my current WIP state for it,
and I'll add it to the next CF to keep the conversation going.

Thanks for doing this. I think it's useful even if it's slower than
the current parser, although that probably necessitates keeping both,
which isn't great, but I don't see a better alternative.


One possible use would be in parsing large manifest files for
incremental backup. However, it struck me a few days ago that this might
not work all that well. The current parser and the new parser both
palloc() space for each field name and scalar token in the JSON (unless
they aren't used, which is normally not the case), and they don't free
it, so that particularly if done in frontend code this amounts to a
possible memory leak, unless the semantic routines do the freeing
themselves. So while we can save some memory by not having to slurp in
the whole JSON in one hit, we aren't saving any of that other allocation
of memory, which amounts to almost as much space as the raw JSON.

It seems like a pretty significant savings no matter what. Suppose the
backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
create an 1MB buffer and feed the data to the parser in 1MB chunks.
Well, that saves 2GB less 1MB, full stop. Now if we address the issue
you raise here in some way, we can potentially save even more memory,
which is great, but even if we don't, we still saved a bunch of memory
that could not have been saved in any other way.

As far as addressing that other issue, we could address the issue
either by having the semantic routines free the memory if they don't
need it, or alternatively by having the parser itself free the memory
after invoking any callbacks to which it might be passed. The latter
approach feels more conceptually pure, but the former might be the
more practical approach. I think what really matters here is that we
document who must or may do which things. When a callback gets passed
a pointer, we can document either that (1) it's a palloc'd chunk that
the calllback can free if they want or (2) that it's a palloc'd chunk
that the caller must not free or (3) that it's not a palloc'd chunk.
We can further document the memory context in which the chunk will be
allocated, if applicable, and when/if the parser will free it.



Yeah. One idea I had yesterday was to stash the field names, which in 
large JSON docs tent to be pretty repetitive, in a hash table instead of 
pstrduping each instance. The name would be valid until the end of the 
parse, and would only need to be duplicated by the callback function if 
it were needed beyond that. That's not the case currently with the 
parse_manifest code. I'll work on using a hash table.


The parse_manifest code does seem to pfree the scalar values it no 
longer needs fairly well, so maybe we don't need to to anything there.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Transaction timeout

2024-01-03 Thread Andrey M. Borodin
On 3 Jan 2024, at 11:39, Andrey M. Borodin  wrote:On 1 Jan 2024, at 19:28, Andrey M. Borodin  wrote: 3. Check that timeout is not rescheduled by new queries (Nik's case)The test of Nik's case was not stable enough together with COMMIT AND CHAIN. So I've separated these cases into different permutations.Looking through CI logs it seems variation in sleeps and actual timeouts easily reach 30+ms. I'm not entirely sure we can reach 100% stable tests without too big timeouts.Best regards, Andrey Borodin. I do not understand why, but mailing list did not pick patches that I sent. I'll retry.

v21-0001-Introduce-transaction_timeout.patch
Description: Binary data


v21-0004-fix-reschedule-timeout-for-each-commmand.patch
Description: Binary data


v21-0003-Try-to-enable-transaction_timeout-before-next-co.patch
Description: Binary data


v21-0002-Add-better-tests-for-transaction_timeout.patch
Description: Binary data
Best regards, Andrey Borodin.

Re: Random pg_upgrade test failure on drongo

2024-01-03 Thread Amit Kapila
On Tue, Jan 2, 2024 at 10:30 AM Alexander Lakhin  wrote:
>
> 28.12.2023 06:08, Hayato Kuroda (Fujitsu) wrote:
> > Dear Alexander,
> >
> >> I agree with your analysis and would like to propose a PoC fix (see
> >> attached). With this patch applied, 20 iterations succeeded for me.
> > There are no reviewers so that I will review again. Let's move the PoC
> > to the concrete patch. Note that I only focused on fixes of random failure,
> > other parts are out-of-scope.
>
> Thinking about that fix more, I'm not satisfied with the approach proposed.
> First, it turns every unlink operation into two write operations
> (rename + unlink), not to say about new risks of having stale .tmp files
> (perhaps, it's ok for regular files (MoveFileEx can overwrite existing
> files), but not for directories)
> Second, it does that on any Windows OS versions, including modern ones,
> which are not affected by the issue, as we know.
>
> So I started to think about other approach: to perform unlink as it's
> implemented now, but then wait until the DELETE_PENDING state is gone.
>

There is a comment in the code which suggests we shouldn't wait
indefinitely. See "However, we won't wait indefinitely for someone
else to close the file, as the caller might be holding locks and
blocking other backends."

> And I was very surprised to see that this state is not transient in our case.
> Additional investigation showed that the test fails not because some aside
> process opens a file (concretely, {template1_id/postgres_id}/2683), that is
> being deleted, but because of an internal process that opens the file and
> holds a handle to it indefinitely.
> And the internal process is ... background writer (BgBufferSync()).
>
> So, I tried just adding bgwriter_lru_maxpages = 0 to postgresql.conf and
> got 20 x 10 tests passing.
>
> Thus, it we want just to get rid of the test failure, maybe it's enough to
> add this to the test's config...
>

What about checkpoints? Can't it do the same while writing the buffers?

-- 
With Regards,
Amit Kapila.




Re: Parent/child context relation in pg_get_backend_memory_contexts()

2024-01-03 Thread Melih Mutlu
Hi,

Thanks for reviewing. Please find the updated patch attached.

torikoshia , 4 Ara 2023 Pzt, 07:43 tarihinde
şunu yazdı:

> I reviewed v3 patch and here are some minor comments:
>
> > + 
> > +   > role="column_definition">
> > +   path int4
>
> Should 'int4' be 'int4[]'?
> Other system catalog columns such as pg_groups.grolist distinguish
> whther the type is a array or not.
>

Right! Done.


>
> > +   Path to reach the current context from TopMemoryContext.
> > Context ids in
> > +   this list represents all parents of the current context. This
> > can be
> > +   used to build the parent and child relation.
>
> It seems last "." is not necessary considering other explanations for
> each field end without it.
>

Done.


> +const char *parent, int level, int
> *context_id,
> +List *path, Size
> *total_bytes_inc_chidlren)
>
> 'chidlren' -> 'children'
>

Done.


> +   elog(LOG, "pg_get_backend_memory_contexts called");
>
> Is this message necessary?
>

I guess I added this line for debugging and then forgot to remove. Now
removed.

There was warning when applying the patch:
>
>% git apply
>
> ../patch/pg_backend_memory_context_refine/v3-0001-Adding-id-parent_id-into-pg_backend_memory_contex.patch
>
> ../patch/pg_backend_memory_context_refine/v3-0001-Adding-id-parent_id-into-pg_backend_memory_contex.patch:282:
>
> trailing whitespace.
>select count(*) > 0
>
> ../patch/pg_backend_memory_context_refine/v3-0001-Adding-id-parent_id-into-pg_backend_memory_contex.patch:283:
>
> trailing whitespace.
>from contexts
>warning: 2 lines add whitespace errors.
>

Fixed.

Thanks,
-- 
Melih Mutlu
Microsoft


v4-0001-Adding-id-parent_id-into-pg_backend_memory_contex.patch
Description: Binary data


Re: Switching XLog source from archive to streaming when primary available

2024-01-03 Thread Bharath Rupireddy
On Thu, Dec 28, 2023 at 5:26 PM Bharath Rupireddy
 wrote:
>
> I took a closer look at v14 and came up with the following changes:
>
> 1. Used advance_wal introduced by commit c161ab74f7.
> 2. Simplified the core logic and new TAP tests.
> 3. Reworded the comments and docs.
> 4. Simplified new DEBUG messages.
>
> I've attached the v15 patch for further review.

Per a recent commit c538592, FATAL-ized perl warnings in the newly
added TAP test and attached the v16 patch.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v16-0001-Allow-standby-to-switch-WAL-source-from-archive-.patch
Description: Binary data


Re: Synchronizing slots from primary to standby

2024-01-03 Thread Bertrand Drouvot
Hi,

On Wed, Jan 03, 2024 at 04:20:03PM +0530, Amit Kapila wrote:
> On Fri, Dec 29, 2023 at 12:32 PM Amit Kapila  wrote:
> >
> > On Fri, Dec 29, 2023 at 6:59 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila  
> > > wrote:
> > > >
> > > > >
> > > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC
> > > > > parameter, say slot_sync_dbname, to specify the database to connect.
> > > > > The slot_sync_dbname overwrites the dbname if primary_conninfo also
> > > > > specifies it. If both don't have a dbname, raise an error.
> > > > >
> > > >
> > > > Would the users prefer to provide a value for a separate GUC instead
> > > > of changing primary_conninfo? It is possible that we can have some
> > > > users prefer to use one GUC and others prefer a separate GUC but we
> > > > should add a new GUC if we are sure that is what users would prefer.
> > > > Also, even if have to consider this option, I think we can easily
> > > > later add a new GUC to provide a dbname in addition to having the
> > > > provision of giving it in primary_conninfo.
> > >
> > > I think having two separate GUCs is more flexible for example when
> > > users want to change the dbname to connect. It makes sense that the
> > > slotsync worker wants to use the same connection string as the
> > > walreceiver uses. But I guess today most primary_conninfo settings
> > > that are set manually or are generated by tools such as pg_basebackup
> > > don't have dbname. If we require a dbname in primary_conninfo, many
> > > tools will need to be changed. Once the connection string is
> > > generated, it would be tricky to change the dbname in it, as Shveta
> > > mentioned. The users will have to carefully select the database to
> > > connect when taking a base backup.
> > >
> >
> > I see your point and agree that users need to be careful. I was trying
> > to compare it with other places like the conninfo used with a
> > subscription where no separate dbname needs to be provided. Now, here
> > the situation is not the same because the same conninfo is used for
> > different purposes (walreceiver doesn't require dbname (dbname is
> > ignored even if present) whereas slotsyncworker requires dbname). I
> > was just trying to see if we can avoid having a new GUC for this
> > purpose. Does anyone else have an opinion on this matter?
> >
> 
> Bertrand, Dilip, and others involved in this thread or otherwise, see
> if you can share an opinion on the above point because it would be
> good to get some more opinions before we decide to add a new GUC (for
> dbname) for slotsync worker.
> 

I think that as long as enable_syncslot is off then there is no need to add the
dbname in primary_conninfo (means there is no need to change an existing 
primary_conninfo
for the ones that don't use the sync slot feature).

So given that primary_conninfo does not necessary need to be changed (for ones 
that
don't use the sync slot feature) and that adding a new GUC looks more a one-way 
door
change to me, I'd vote to keep the patch as it is (we can still revisit this 
later
on and add a new GUC if we feel the need based on user's feedback).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Separate memory contexts for relcache and catcache

2024-01-03 Thread Melih Mutlu
Hi,

torikoshia , 4 Ara 2023 Pzt, 07:59 tarihinde
şunu yazdı:

> Hi,
>
> I also think this change would be helpful.
>
> I imagine you're working on the Andres's comments and you already notice
> this, but v1 patch cannot be applied to HEAD.
> For the convenience of other reviewers, I marked it 'Waiting on Author'.
>

Thanks for letting me know. I rebased the patch. PFA new version.




Andres Freund , 12 Eki 2023 Per, 20:01 tarihinde şunu
yazdı:

> Hi,
>
> Have you checked what the source of the remaining allocations in
> CacheMemoryContext are?
>

It's mostly typecache, around 2K. Do you think typecache also needs a
separate context?

Given the size of both CatCacheMemoryContext and RelCacheMemoryContext in a
> new backend, I think it might be worth using non-default aset parameters. A
> bit ridiculous to increase block sizes from 8k upwards in every single
> connection made to postgres ever.
>

Considering it starts from ~262K, what would be better for init size?
256K?

> +static void
> > +CreateCatCacheMemoryContext()
>
> We typically use (void) to differentiate from an older way of function
> declarations that didn't have argument types.
>
Done.

> +{
> > + if (!CacheMemoryContext)
> > + CreateCacheMemoryContext();
>
> I wish we just made sure that cache memory context were created in the
> right
> place, instead of spreading this check everywhere...
>

That would be nice. Do you have a suggestion about where that right place
would be?

I'd just delete these comments, they're just pointlessly restating the code.
>

Done.

Thanks,
-- 
Melih Mutlu
Microsoft


v2-0001-Separate-memory-contexts-for-relcache-and-catcach.patch
Description: Binary data


Re: remaining sql/json patches

2024-01-03 Thread jian he
some more minor issues:
SELECT * FROM JSON_TABLE(jsonb '{"a":[123,2]}', '$'
COLUMNS (item int[] PATH '$.a' error on error, foo text path '$'
error on error)) bar;
ERROR:  JSON path expression in JSON_VALUE should return singleton scalar item

the error message seems not so great, imho.
since the JSON_TABLE doc entries didn't mention that
JSON_TABLE actually transformed to json_value, json_query, json_exists.

JSON_VALUE even though cannot specify KEEP | OMIT QUOTES.
It might be a good idea to mention the default is to omit quotes  in the doc.
because JSON_TABLE actually transformed to json_value, json_query, json_exists.
JSON_TABLE can specify quotes behavior freely.

bother again, i kind of get what the function transformJsonTableChildPlan do,
but adding more comments would make it easier to understand

(json_query)
+This function must return a JSON string, so if the path expression
+returns multiple SQL/JSON items, you must wrap the result using the
+WITH WRAPPER clause. If the wrapper is
+UNCONDITIONAL, an array wrapper will always
+be applied, even if the returned value is already a single JSON object
+or an array, but if it is CONDITIONAL, it
will not be
+applied to a single array or object. UNCONDITIONAL
+is the default.  If the result is a scalar string, by default the value
+returned will have surrounding quotes making it a valid JSON value,
+which can be made explicit by specifying KEEP
QUOTES.
+Conversely, quotes can be omitted by specifying OMIT
QUOTES.
+The returned data_type has the
same semantics
+as for constructor functions like json_objectagg;
+the default returned type is jsonb.

+   
+Returns the result of applying the
+path_expression to the
+context_item using the
+PASSING values. The
+extracted value must be a single SQL/JSON scalar
+item. For results that are objects or arrays, use the
+json_query function instead.
+The returned data_type has the
same semantics
+as for constructor functions like json_objectagg.
+The default returned type is text.
+The ON ERROR and ON EMPTY
+clauses have similar semantics as mentioned in the description of
+json_query.
+   

+The returned data_type has the
same semantics
+as for constructor functions like json_objectagg.

IMHO, the above description is not so good, since the function
json_objectagg is listed in functions-aggregate.html,
using Ctrl + F in the browser cannot find json_objectagg in functions-json.html.

for json_query, maybe we can rephrase like:
the RETURNING clause, which specifies the data type returned. It must
be a type for which there is a cast from text to that type.
By default, the jsonb type is returned.

json_value:
the RETURNING clause, which specifies the data type returned. It must
be a type for which there is a cast from text to that type.
By default, the text type is returned.




Re: remaining sql/json patches

2024-01-03 Thread jian he
On Fri, Dec 22, 2023 at 9:01 PM jian he  wrote:
>
> Hi
>
> + /* FALLTHROUGH */
> + case JTC_EXISTS:
> + case JTC_FORMATTED:
> + {
> + Node   *je;
> + CaseTestExpr *param = makeNode(CaseTestExpr);
> +
> + param->collation = InvalidOid;
> + param->typeId = cxt->contextItemTypid;
> + param->typeMod = -1;
> +
> + if (rawc->wrapper != JSW_NONE &&
> + rawc->quotes != JS_QUOTES_UNSPEC)
> + ereport(ERROR,
> + (errcode(ERRCODE_SYNTAX_ERROR),
> + errmsg("cannot use WITH WRAPPER clause for formatted colunmns"
> + " without also specifying OMIT/KEEP QUOTES"),
> + parser_errposition(pstate, rawc->location)));
>
> typo, should be "formatted columns".
> I suspect people will be confused with the meaning of "formatted column".
> maybe we can replace this part:"cannot use WITH WRAPPER clause for
> formatted column"
> to
> "SQL/JSON  WITH WRAPPER behavior must not be specified when FORMAT
> clause is used"
>
> SELECT * FROM JSON_TABLE(jsonb '"world"', '$' COLUMNS (item text
> FORMAT JSON PATH '$' with wrapper KEEP QUOTES));
> ERROR:  cannot use WITH WRAPPER clause for formatted colunmns without
> also specifying OMIT/KEEP QUOTES
> LINE 1: ...T * FROM JSON_TABLE(jsonb '"world"', '$' COLUMNS (item text ...
>  ^
> this error is misleading, since now I am using WITH WRAPPER clause for
> formatted columns and specified KEEP QUOTES.
>

Hi. still based on v33.
JSON_TABLE:
I also refactor parse_jsontable.c error reporting, now the error
message will be consistent with json_query.
now you can specify wrapper freely as long as you don't specify
wrapper and quote at the same time.
overall, json_table behavior is more consistent with json_query and json_value.
I also added some tests.

+void
+ExecEvalJsonCoercion(ExprState *state, ExprEvalStep *op,
+ ExprContext *econtext)
+{
+ JsonCoercion *coercion = op->d.jsonexpr_coercion.coercion;
+ ErrorSaveContext *escontext = op->d.jsonexpr_coercion.escontext;
+ Datum res = *op->resvalue;
+ bool resnull = *op->resnull;
+
+ if (coercion->via_populate)
+ {
+ void *cache = op->d.jsonexpr_coercion.json_populate_type_cache;
+
+ *op->resvalue = json_populate_type(res, JSONBOID,
+   coercion->targettype,
+   coercion->targettypmod,
+   &cache,
+   econtext->ecxt_per_query_memory,
+   op->resnull, (Node *) escontext);
+ }
+ else if (coercion->via_io)
+ {
+ FmgrInfo   *input_finfo = op->d.jsonexpr_coercion.input_finfo;
+ Oid typioparam = op->d.jsonexpr_coercion.typioparam;
+ char   *val_string = resnull ? NULL :
+ JsonbUnquote(DatumGetJsonbP(res));
+
+ (void) InputFunctionCallSafe(input_finfo, val_string, typioparam,
+ coercion->targettypmod,
+ (Node *) escontext,
+ op->resvalue);
+ }
via_populate, via_io should be mutually exclusive.
your patch, in some cases, both (coercion->via_io) and
(coercion->via_populate) are true.
(we can use elog find out).
I refactor coerceJsonFuncExprOutput, so now it will be mutually exclusive.
I also add asserts to it.

By default, json_query keeps quotes, json_value omit quotes.
However, json_table will be transformed to json_value or json_query
based on certain criteria,
that means we need to explicitly set the JsonExpr->omit_quotes in the
function transformJsonFuncExpr
for case JSON_QUERY_OP and JSON_VALUE_OP.

We need to know the QUOTE behavior in the function ExecEvalJsonCoercion.
Because for ExecEvalJsonCoercion, the coercion datum source can be a
scalar string item,
scalar items means RETURNING clause is dependent on QUOTE behavior.
keep quotes, omit quotes the results are different.
consider
JSON_QUERY(jsonb'{"rec": "[1,2]"}', '$.rec' returning int4range omit quotes);
and
JSON_QUERY(jsonb'{"rec": "[1,2]"}', '$.rec' returning int4range omit quotes);

to make sure ExecEvalJsonCoercion can distinguish keep and omit quotes,
I added a bool keep_quotes to struct JsonCoercion.
(maybe there is a more simple way, so far, that's what I come up with).
the keep_quotes value will be settled in the function transformJsonFuncExpr.
After refactoring, in ExecEvalJsonCoercion, keep_quotes is true then
call JsonbToCString, else call JsonbUnquote.

example:
SELECT JSON_QUERY(jsonb'{"rec": "{1,2,3}"}', '$.rec' returning int[]
omit quotes);
without my changes, return NULL
with my changes:
 {1,2,3}

JSON_VALUE:
main changes:
--- a/src/test/regress/expected/jsonb_sqljson.out
+++ b/src/test/regress/expected/jsonb_sqljson.out
@@ -301,7 +301,11 @@ SELECT JSON_VALUE(jsonb '"2017-02-20"', '$'
RETURNING date) + 9;
 -- Test NULL checks execution in domain types
 CREATE DOMAIN sqljsonb_int_not_null AS int NOT NULL;
 SELECT JSON_VALUE(jsonb 'null', '$' RETURNING sqljsonb_int_not_null);
-ERROR:  domain sqljsonb_int_not_null does not allow null values
+ json_value
+
+
+(1 row)
+
I think the change is correct, given `SELECT JSON_VALUE(jsonb 'null',
'$' RETURNING int4range);` returns NULL.

I also attached a test.sql, without_patch.out (apply v33 only),
with_patch.out (my changes based on v33).
So you can see the difference after 

Re: Synchronizing slots from primary to standby

2024-01-03 Thread Amit Kapila
On Fri, Dec 29, 2023 at 12:32 PM Amit Kapila  wrote:
>
> On Fri, Dec 29, 2023 at 6:59 AM Masahiko Sawada  wrote:
> >
> > On Wed, Dec 27, 2023 at 7:43 PM Amit Kapila  wrote:
> > >
> > > >
> > > > 3) The slotsync worker uses primary_conninfo but also uses a new GUC
> > > > parameter, say slot_sync_dbname, to specify the database to connect.
> > > > The slot_sync_dbname overwrites the dbname if primary_conninfo also
> > > > specifies it. If both don't have a dbname, raise an error.
> > > >
> > >
> > > Would the users prefer to provide a value for a separate GUC instead
> > > of changing primary_conninfo? It is possible that we can have some
> > > users prefer to use one GUC and others prefer a separate GUC but we
> > > should add a new GUC if we are sure that is what users would prefer.
> > > Also, even if have to consider this option, I think we can easily
> > > later add a new GUC to provide a dbname in addition to having the
> > > provision of giving it in primary_conninfo.
> >
> > I think having two separate GUCs is more flexible for example when
> > users want to change the dbname to connect. It makes sense that the
> > slotsync worker wants to use the same connection string as the
> > walreceiver uses. But I guess today most primary_conninfo settings
> > that are set manually or are generated by tools such as pg_basebackup
> > don't have dbname. If we require a dbname in primary_conninfo, many
> > tools will need to be changed. Once the connection string is
> > generated, it would be tricky to change the dbname in it, as Shveta
> > mentioned. The users will have to carefully select the database to
> > connect when taking a base backup.
> >
>
> I see your point and agree that users need to be careful. I was trying
> to compare it with other places like the conninfo used with a
> subscription where no separate dbname needs to be provided. Now, here
> the situation is not the same because the same conninfo is used for
> different purposes (walreceiver doesn't require dbname (dbname is
> ignored even if present) whereas slotsyncworker requires dbname). I
> was just trying to see if we can avoid having a new GUC for this
> purpose. Does anyone else have an opinion on this matter?
>

Bertrand, Dilip, and others involved in this thread or otherwise, see
if you can share an opinion on the above point because it would be
good to get some more opinions before we decide to add a new GUC (for
dbname) for slotsync worker.

-- 
With Regards,
Amit Kapila.




Re: Next step towards 64bit XIDs: Switch to FullTransactionId for PGPROC->xid and XLogRecord->xl_xid

2024-01-03 Thread Pavel Borisov
Hi, Maxim, and Happy New Year!

On Mon, 1 Jan 2024 at 10:15, Maxim Orlov  wrote:

> On Fri, 29 Dec 2023 at 16:36, Matthias van de Meent <
> boekewurm+postg...@gmail.com> wrote:
>
>>
>> I don't think this is an actionable change, as this wastes 4 more bytes
>> (or 8 with alignment) in nearly all WAL records that don't use the
>> HEAP/HEAP2/XLOG rmgrs, which would then be up to 10 (if not 14, when
>> 64but-aligned) bytes per record. Unless something like [0] gets committed
>> this will add a significant write overhead to all operations, even if they
>> are not doing anything that needs an XID.
>>
>> Also, I don't think we need to support transactions that stay alive and
>> change things for longer than 2^31 concurrently created transactions, so we
>> could well add a fxid to each WAL segment header (and checkpoint record?)
>> and calculate the fxid of each record as a relative fxid off of that.
>>
>> Thank you for your reply.  Yes, this is a good idea.  Actually, this is
> exactly what I was trying to do at first.
> But in this case, we have to add more locks on TransamVariables->nextXid,
> and I've abandoned the idea.
> Maybe, I should look in this direction.
>
> On Sun, 31 Dec 2023 at 03:44, Michael Paquier  wrote:
>
>> And FWIW, echoing with Matthias, making these generic structures
>> arbitrary larger is a non-starter.  We should try to make them
>> shorter.
>>
>
> Yeah, obviously, this is patch make WAL bigger.  I'll try to look into the
> idea of fxid calculation, as mentioned above.
> It might in part be a "chicken and the egg" situation.  It is very hard to
> split overall 64xid patch into smaller pieces
> with each part been a meaningful and beneficial for current 32xids
> Postgres.
>

The discussion quoted by you as OP [0] shows that there is a wide range of
opinions on whether we need 64-bit XIDs and what is the better way to
implement it. We can also continue discussing implementation details in
that thread [0].

I agree that the step towards something big should not make things worse
just now. Does increasing the WAL header for several bytes make performance
change detectable at all? Maybe it's worth simulating a workload with a
large amount of WAL records and seeing how it works. I think it's regarding
this patchset only and could remove or substantiate the main questions
about the current patchset.

[0]
https://www.postgresql.org/message-id/CACG=ezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe=pyyj...@mail.gmail.com

Kind regards,
Pavel Borisov.


Re: XLog size reductions: Reduced XLog record header size for PG17

2024-01-03 Thread Pavel Borisov
Hi and Happy New Year!

I've looked through the patches and the change seems quite small and
justified. But at the second round, some doubt arises on whether this long
patchset indeed introduces enough performance gain? I may be wrong, but it
saves only several bytes and the performance gain would be only in some
specific artificial workload.  Did you do some measurements? Do we have
several percent performance-wise?

Kind regards,
Pavel Borisov


Re: pg_upgrade and logical replication

2024-01-03 Thread Amit Kapila
On Wed, Jan 3, 2024 at 11:33 AM Michael Paquier  wrote:
>
> On Wed, Jan 03, 2024 at 11:24:50AM +0530, Amit Kapila wrote:
> > I think the next possible step here is to document how to upgrade the
> > logical replication nodes as previously discussed in this thread [1].
> > IIRC, there were a few issues with the steps mentioned but if we want
> > to document those we can start a separate thread for it as that
> > involves both publishers and subscribers.
> >
> > [1] - 
> > https://www.postgresql.org/message-id/CALDaNm2pe7SoOGtRkrTNsnZPnaaY%2B2iHC40HBYCSLYmyRg0wSw%40mail.gmail.com
>
> Yep.  A second thing is whether it makes sense to have more automated
> test coverage when it comes to the interferences between subscribers
> and publishers with more complex node structures.
>

I think it would be good to finish the pending patch to improve the
IsBinaryUpgrade check [1] which we decided to do once this patch is
ready. Would you like to take that up or do you want me to finish it?

[1] - https://www.postgresql.org/message-id/ZU2TeVkUg5qEi7Oy%40paquier.xyz
[2] - https://www.postgresql.org/message-id/ZVQtUTdJACnsbbpd%40paquier.xyz

-- 
With Regards,
Amit Kapila.




Re: speed up a logical replica setup

2024-01-03 Thread vignesh C
On Wed, 3 Jan 2024 at 14:49, Amit Kapila  wrote:
>
> On Wed, Jan 3, 2024 at 12:09 PM vignesh C  wrote:
> >
> > On Wed, 1 Nov 2023 at 19:28, Ashutosh Bapat
> >  wrote:
> > >
> > > At this stage the standby would have various replication objects like
> > > publications, subscriptions, origins inherited from the upstream
> > > server and possibly very much active. With failover slots, it might
> > > inherit replication slots. Is it intended that the new subscriber also
> > > acts as publisher for source's subscribers OR that the new subscriber
> > > should subscribe to the upstreams of the source? Some use cases like
> > > logical standby might require that but a multi-master multi-node setup
> > > may not. The behaviour should be user configurable.
> >
> > How about we do like this:
> > a) Starting the server in binary upgrade mode(so that the existing
> > subscriptions will not try to connect to the publishers)
> >
>
> Can't we simply do it by starting the server with
> max_logical_replication_workers = 0 or is there some other need to
> start in binary upgrade mode?

I agree, max_logical_replication_workers = 0 is enough for our case.

>  b) Disable
> > the subscriptions
> >
>
> Why not simply drop the subscriptions?

Dropping subscriptions is ok as these subscriptions will not be required.

>  c) Drop the replication slots d) Drop the
> > publications
> >
>
> I am not so sure about dropping publications because, unlike
> subscriptions which can start to pull the data, there is no harm with
> publications. Similar to publications there could be some user-defined
> functions or other other objects which may not be required once the
> standby replica is converted to subscriber. I guess we need to leave
> those to the user.

Yes, that makes sense.

Regards,
Vignesh




Re: speed up a logical replica setup

2024-01-03 Thread Amit Kapila
On Wed, Jan 3, 2024 at 12:09 PM vignesh C  wrote:
>
> On Wed, 1 Nov 2023 at 19:28, Ashutosh Bapat
>  wrote:
> >
> > At this stage the standby would have various replication objects like
> > publications, subscriptions, origins inherited from the upstream
> > server and possibly very much active. With failover slots, it might
> > inherit replication slots. Is it intended that the new subscriber also
> > acts as publisher for source's subscribers OR that the new subscriber
> > should subscribe to the upstreams of the source? Some use cases like
> > logical standby might require that but a multi-master multi-node setup
> > may not. The behaviour should be user configurable.
>
> How about we do like this:
> a) Starting the server in binary upgrade mode(so that the existing
> subscriptions will not try to connect to the publishers)
>

Can't we simply do it by starting the server with
max_logical_replication_workers = 0 or is there some other need to
start in binary upgrade mode?

 b) Disable
> the subscriptions
>

Why not simply drop the subscriptions?

 c) Drop the replication slots d) Drop the
> publications
>

I am not so sure about dropping publications because, unlike
subscriptions which can start to pull the data, there is no harm with
publications. Similar to publications there could be some user-defined
functions or other other objects which may not be required once the
standby replica is converted to subscriber. I guess we need to leave
those to the user.

 e) Then restart the server in normal(non-upgrade) mode.
> f) The rest of pg_subscriber work like
> create_all_logical_replication_slots, create_subscription,
> set_replication_progress, enable_subscription, etc
> This will be done by default. There will be an option
> --clean-logical-replication-info provided to allow DBA not to clean
> the objects if DBA does not want to remove these objects.
>

I agree that some kind of switch to control this action would be useful.

-- 
With Regards,
Amit Kapila.




Re: Update docs for default value of fdw_tuple_cost

2024-01-03 Thread Chris Travers
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:tested, passed

Good catch.  This is a trivial fix and so I hope we can just get it in right 
away.

The new status of this patch is: Ready for Committer


Re: verify predefined LWLocks have entries in wait_event_names.txt

2024-01-03 Thread Bertrand Drouvot
Hi,

On Tue, Jan 02, 2024 at 10:49:03PM -0500, Robert Haas wrote:
> On Tue, Jan 2, 2024 at 4:45 PM Nathan Bossart  
> wrote:
> > That seems to date back to commit 14a9101.  I can agree that the suffix is
> > somewhat redundant since these are already marked as type "LWLock", but
> > I'll admit I've been surprised by this before, too.  IMHO it makes this
> > proposed test more important because you can't just grep for a different
> > lock to find all the places you need to update.
> 
> I agree. I am pretty sure that the reason this happened in the first
> place is that I grepped for the name of some other LWLock and adjusted
> things for the new lock at every place where that found a hit.
> 
> > > - Check in both directions instead of just one?
> > >
> > > - Verify ordering?
> >
> > To do those things, I'd probably move the test to one of the scripts that
> > generates the documentation or header file (pg_wait_events doesn't tell us
> > whether a lock is predefined or what order it's listed in).  That'd cause
> > failures at build time instead of during testing, which might be kind of
> > nice, too.
> 
> Yeah, I think that would be better.

+1 to add a test and put in a place that would produce failures at build time.
I think that having the test in the script that generates the header file is 
more
appropriate (as building the documentation looks less usual to me when working 
on
a patch).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com