Re: Weird test mixup

2024-03-17 Thread Andrey M. Borodin



> On 18 Mar 2024, at 06:04, Michael Paquier  wrote:
> 
> new function call injection_points_local() that can
> be added on top of a SQL test to make it concurrent-safe.

Maybe consider function injection_points_attach_local(‘point name’) instead of 
static switch?
Or even injection_points_attach_global(‘point name’), while function 
injection_points_attach(‘point name’) will be global? This would favour writing 
concurrent test by default…


Best regards, Andrey Borodin.



Re: pg16: XX000: could not find pathkey item to sort

2024-03-17 Thread Ashutosh Bapat
On Thu, Mar 14, 2024 at 3:45 PM David Rowley  wrote:

> On Thu, 14 Mar 2024 at 18:23, Ashutosh Bapat
>  wrote:
> > I don't understand why root->query_pathkeys has both a and b. "a" is
> there because of GROUP BY and ORDER BY clause. But why "b"?
>
> So that the ORDER BY aggregate function can be evaluated without
> nodeAgg.c having to perform the sort. See
> adjust_group_pathkeys_for_groupagg().
>

Thanks. To me, it looks like we are gathering pathkeys, which if used to
sort the result of overall join, would avoid sorting in as many as
aggregates as possible.

relation_can_be_sorted_early() finds, pathkeys which if used to sort the
given relation, would help sorting the overall join. Contrary to what I
said earlier, it might help if the base relation is sorted on "a" and "b".
What I find weird is that the sorting is not pushed down to the partitions,
where it would help most.

#explain verbose SELECT a, sum(b order by b) FROM t GROUP BY a ORDER BY a;
 QUERY PLAN


 GroupAggregate  (cost=362.21..398.11 rows=200 width=12)
   Output: t.a, sum(t.b ORDER BY t.b)
   Group Key: t.a
   ->  Sort  (cost=362.21..373.51 rows=4520 width=8)
 Output: t.a, t.b
 Sort Key: t.a, t.b
 ->  Append  (cost=0.00..87.80 rows=4520 width=8)
   ->  Seq Scan on public.tp1 t_1  (cost=0.00..32.60 rows=2260
width=8)
 Output: t_1.a, t_1.b
   ->  Seq Scan on public.td t_2  (cost=0.00..32.60 rows=2260
width=8)
 Output: t_2.a, t_2.b
(11 rows)

and that's the case even without parallel plans

#explain verbose SELECT a, sum(b order by b) FROM t GROUP BY a ORDER BY a;
 QUERY PLAN


 GroupAggregate  (cost=362.21..398.11 rows=200 width=12)
   Output: t.a, sum(t.b ORDER BY t.b)
   Group Key: t.a
   ->  Sort  (cost=362.21..373.51 rows=4520 width=8)
 Output: t.a, t.b
 Sort Key: t.a, t.b
 ->  Append  (cost=0.00..87.80 rows=4520 width=8)
   ->  Seq Scan on public.tp1 t_1  (cost=0.00..32.60 rows=2260
width=8)
 Output: t_1.a, t_1.b
   ->  Seq Scan on public.td t_2  (cost=0.00..32.60 rows=2260
width=8)
 Output: t_2.a, t_2.b
(11 rows)

But it could be just because the corresponding plan was not found to be
optimal. May be because there isn't enough data in those tables.

If the problem you speculate is different from this one, I am not able to
see it. It might help give an example query or explain more.

-- 
Best Wishes,
Ashutosh Bapat


RE: speed up a logical replica setup

2024-03-17 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

Thanks for updating the patch. Here are my comments.
I used Grammarly to proofread sentences.
(The tool strongly recommends to use active voice, but I can ignore for now)

01.

"After a successful run, the state of the target server is analagous to a fresh
logical replication setup."
a/analagous/analogous

02.

"The main difference between the logical replication setup and 
pg_createsubscriber
is the initial data copy."

Grammarly suggests:
"The initial data copy is the main difference between the logical replication
setup and pg_createsubscriber."

03.

"Only the synchronization phase is done, which ensures each table is brought up
to a synchronized state."

This sentence is not very clear to me. How about:
"pg_createsubscriber does only the synchronization phase, ensuring each table's
replication state is ready."

04.

"The pg_createsubscriber targets large database systems because most of the
execution time is spent making the initial data copy."

Hmm, but initial data sync by logical replication also spends most of time to
make the initial data copy. IIUC bottlenecks are a) this application must stop
and start server several times, and b) only the serial copy works. Can you
clarify them?

05.

It is better to say the internal difference between pg_createsubscriber and the
initial sync by logical replication. For example:
pg_createsubscriber uses a physical replication mechanism to ensure the standby
catches up until a certain point. Then, it converts to the standby to the
subscriber by promoting and creating subscriptions.

06.

"If these are not met an error will be reported."

Grammarly suggests:
"If these are not met, an error will be reported."

07.

"The given target data directory must have the same system identifier than the
source data directory."

Grammarly suggests:
"The given target data directory must have the same system identifier as the
source data directory."

08.

"If a standby server is running on the target data directory or it is a base
backup from the source data directory, system identifiers are the same."

This line is not needed if bullet-style is not used. The line is just a 
supplement,
not prerequisite.

09.

"The source server must accept connections from the target server. The source 
server must not be in recovery."

Grammarly suggests:
"The source server must accept connections from the target server and not be in 
recovery."

10.

"Publications cannot be created in a read-only cluster."

Same as 08, this line is not needed if bullet-style is not used.

11.

"pg_createsubscriber usually starts the target server with different connection
settings during the transformation steps. Hence, connections to target server
might fail."
 
Grammarly suggests:
"pg_createsubscriber usually starts the target server with different connection
settings during transformation. Hence, connections to the target server might 
fail."

12.

"During the recovery process,"

Grammarly suggests:
"During recovery,"

13.

"replicated so an error would occur."

Grammarly suggests:
"replicated, so an error would occur."

14.

"It would avoid situations in which WAL files from the source server might be
used by the target server."

Grammarly suggests:
"It would avoid situations in which the target server might use WAL files from
the source server."

15.

"a LSN"

s/a/an

16.

"of write-ahead"

s/of/of the/

17.

"specifies promote"

We can do double-quote for the word promote.

18.

"are also added so it avoids"

Grammarly suggests:
"are added to avoid"

19.

"is accepting read-write transactions"

Grammarly suggests:
"accepts read-write transactions"

20.

New options must be also documented as well. This helps not only users but also
reviewers.
(Sometimes we cannot identify that the implementation is intentinal or not.)

21.

Also, not sure the specification is good. I preferred to specify them by format
string. Because it can reduce the number of arguments and I cannot find use 
cases
which user want to control the name of objects.

However, your approach has a benefit which users can easily identify the 
generated
objects by pg_createsubscriber. How do other think?

22.

```
#define BASE_OUTPUT_DIR "pg_createsubscriber_output.d"
```

No one refers the define.

23.
 
```
}   CreateSubscriberOptions;
...
}   LogicalRepInfo;
```

Declarations after the "{" are not needed, because we do not do typedef.

22.

While seeing definitions of functions, I found that some pointers are declared
as const, but others are not. E.g., "char *lsn" in setup_recovery() won' be
changed but not the constant. Is it just missing or is there another rule?

23.

```
static int  num_dbs = 0;
static int  num_pubs = 0;
static int  num_subs = 0;
static int  num_replslots = 0;
```

I think the name is bit confusing. The number of generating 
publications/subscriptions/replication slots
are always same as the number of databases. They just indicate the 

Re: RFC: Logging plan of the running query

2024-03-17 Thread torikoshia

On 2024-03-14 04:33, Robert Haas wrote:
Thanks for your review!

On Wed, Mar 13, 2024 at 1:28 AM torikoshia  
wrote:

- I saw no way to find the next node to be executed from the planstate
tree, so the patch wraps all the ExecProcNode of the planstate tree at
CHECK_FOR_INTERRUPTS().


I don't think it does this correctly, because some node types have
children other than the left and right node. See /* special child
plans */ in ExplainNode().


Agreed.


But also ... having to wrap the entire plan tree like this seems
pretty awful. I don't really like the idea of a large-scan plan
modification like this in the middle of the query.


Yeah, but I haven't come up with other ideas to select the appropriate 
node to wrap.



Andres, did you have some clever idea for this feature that would
avoid the need to do this?




On 2024-03-14 10:02, jian he wrote:
On Wed, Mar 13, 2024 at 1:28 PM torikoshia  
wrote:


On Fri, Feb 16, 2024 at 11:42 PM torikoshia 


wrote:
> I'm not so sure about the implementation now, i.e. finding the next
> node
> to be executed from the planstate tree, but I'm going to try this
> approach.

Attached a patch which takes this approach.



one minor issue.
I understand the regress test, compare the expected outcome with
testrun outcome,
but can you enlighten me about how you check if the change you made in
contrib/auto_explain/t/001_auto_explain.pl is correct.
(i am not familiar with perl).


$pg_log_plan_query_output saves the output plan of pg_log_query_plan() 
and $auto_explain_output saves the output plan of auto_explain.

The test checks the both are the same using cmp_ok().

Did I answer your question?

diff --git a/src/include/commands/explain.h 
b/src/include/commands/explain.h

index cf195f1359..2d06bf297e 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -17,6 +17,8 @@
 #include "lib/stringinfo.h"
 #include "parser/parse_node.h"
+extern PGDLLIMPORT bool ProcessLogQueryPlanInterruptActive;

diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 054dd2bf62..3c6bd1ea7c 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -167,6 +167,7 @@ struct Node;
 extern bool message_level_is_interesting(int elevel);
+extern PGDLLIMPORT bool ProcessLogQueryPlanInterruptActive;

utils/elog.h is already included in src/include/postgres.h.
you don't need to declare ProcessLogQueryPlanInterruptActive at
include/commands/explain.h?
generally a variable should only declare once?


Yeah, this declaration is not necessary and we should add include 
commands/explain.h to src/backend/access/transam/xact.c.


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation




Re: PostgreSQL 17 Release Management Team & Feature Freeze

2024-03-17 Thread Michael Paquier
On Mon, Mar 18, 2024 at 03:49:24AM +, Hayato Kuroda (Fujitsu) wrote:
> I think the entry can be closed:
> 
> ```
> pg_upgrade with --link mode failing upgrade of publishers
> Commit: 29d0a77fa660
> Owner: Amit Kapila
> ```
> 
> The reported failure was only related with the test script, not the feature.
> The issue has already been analyzed and the fix patch was pushed as 
> f17529b710.

If you think that this is OK, and as far as I can see this looks OK on
the thread, then this open item should be moved under "resolved before
17beta1", mentioning the commit involved in the fix.  Please see [1]
for examples.

[1]: 
https://wiki.postgresql.org/wiki/PostgreSQL_16_Open_Items#resolved_before_16beta1
--
Michael


signature.asc
Description: PGP signature


Re: Add new error_action COPY ON_ERROR "log"

2024-03-17 Thread Michael Paquier
On Mon, Mar 18, 2024 at 12:05:17PM +0900, Kyotaro Horiguchi wrote:
> And I believe that CONTEXT, if it exists, is augmentation information
> to the primary messages. The objective of the precedent for the use of
> relname_only was somewhat different, but this use also seems legit.
> 
> In short, I think the distribution between message types (primary and
> context) is fine as it is in the latest patch.

Okay, thanks for the feedback.
--
Michael


signature.asc
Description: PGP signature


Re: Fix for timestamp lag issue from emit_log_hook when GUC log_line_prefix has '%m'

2024-03-17 Thread Michael Paquier
On Sun, Mar 17, 2024 at 07:35:57PM +0530, Kambam Vinay wrote:
> Thanks Michael for the review. Agree with your comment on the patch.
> updated the patch with recommended change.

That should be fine.  I would suggest to document why the reset is
done at this location, aka this is to ensure that the same formatted
timestamp is used across the board, for all log destinations as well
as hook callers that rely on log_line_prefix.

While reviewing, I have noticed that a comment was not correct: JSON
logs also use the formatted timestamp via get_formatted_log_time().

I may be able to get this one committed just before the feature freeze
of v17, as timestamp consistency for hooks that call
log_status_format() is narrow.  For now, I have added an entry in the
CF app to keep track of it:
https://commitfest.postgresql.org/48/4901/
--
Michael
From db20e006f16de889ceb884d63dd25828ec711185 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 18 Mar 2024 14:08:45 +0900
Subject: [PATCH v2] set saved_timeval_set to false before executing
 emit_log_hook()

---
 src/backend/utils/error/elog.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 52bc01058c..ea96047352 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -154,7 +154,7 @@ static int	recursion_depth = 0;	/* to detect actual recursion */
 
 /*
  * Saved timeval and buffers for formatted timestamps that might be used by
- * both log_line_prefix and csv logs.
+ * log_line_prefix, csv logs and json logs.
  */
 static struct timeval saved_timeval;
 static bool saved_timeval_set = false;
@@ -1678,6 +1678,14 @@ EmitErrorReport(void)
 	CHECK_STACK_DEPTH();
 	oldcontext = MemoryContextSwitchTo(edata->assoc_context);
 
+	/*
+	 * Reset the formatted timestamp fields before emitting any logs.  This
+	 * includes all the log destinations and the hook, as the latter may use
+	 * Log_line_prefix.
+	 */
+	saved_timeval_set = false;
+	formatted_log_time[0] = '\0';
+
 	/*
 	 * Call hook before sending message to log.  The hook function is allowed
 	 * to turn off edata->output_to_server, so we must recheck that afterward.
@@ -2948,8 +2956,6 @@ log_status_format(StringInfo buf, const char *format, ErrorData *edata)
 	appendStringInfo(buf, "%ld", log_line_number);
 break;
 			case 'm':
-/* force a log timestamp reset */
-formatted_log_time[0] = '\0';
 (void) get_formatted_log_time();
 
 if (padding != 0)
@@ -3151,9 +3157,6 @@ send_message_to_server_log(ErrorData *edata)
 
 	initStringInfo();
 
-	saved_timeval_set = false;
-	formatted_log_time[0] = '\0';
-
 	log_line_prefix(, edata);
 	appendStringInfo(, "%s:  ", _(error_severity(edata->elevel)));
 
-- 
2.43.0



signature.asc
Description: PGP signature


Re: speed up a logical replica setup

2024-03-17 Thread Amit Kapila
On Mon, Mar 18, 2024 at 9:37 AM Amit Kapila  wrote:
>
> On Sat, Mar 16, 2024 at 9:13 PM Euler Taveira  wrote:
> >
> > On Fri, Mar 15, 2024, at 3:34 AM, Amit Kapila wrote:
> >
> > Did you consider adding options for publication/subscription/slot
> > names as mentioned in my previous email? As discussed in a few emails
> > above, it would be quite confusing for users to identify the logical
> > replication objects once the standby is converted to subscriber.
> >
> >
> > Yes. I was wondering to implement after v1 is pushed. I started to write a 
> > code
> > for it but I wasn't sure about the UI. The best approach I came up with was
> > multiple options in the same order. (I don't provide short options to avoid
> > possible portability issues with the order.) It means if I have 3 databases 
> > and
> > the following command-line:
> >
> > pg_createsubscriber ... --database pg1 --database pg2 --database3 
> > --publication
> > pubx --publication puby --publication pubz
> >
> > pubx, puby and pubz are created in the database pg1, pg2, and pg3 
> > respectively.
> >
>
> With this syntax, you want to give the user the option to specify
> publication/subscription/slot name for each database? If so, won't it
> be too much verbose?
>
> > It seems we care only for publications created on the primary. Isn't
> > it possible that some of the publications have been replicated to
> > standby by that time, for example, in case failure happens after
> > creating a few publications? IIUC, we don't care for standby cleanup
> > after failure because it can't be used for streaming replication
> > anymore. So, the only choice the user has is to recreate the standby
> > and start the pg_createsubscriber again. This sounds questionable to
> > me as to whether users would like this behavior. Does anyone else have
> > an opinion on this point?
> >
> >
> > If it happens after creating a publication and before promotion, the cleanup
> > routine will drop the publications on primary and it will eventually be 
> > applied
> > to the standby via replication later.
> >
>
> Do you mean to say that the next time if user uses
> pg_createsubscriber, it will be ensured that all the prior WAL will be
> replicated? I think we need to ensure that after the restart of this
> tool and before it attempts to create the publications again, the WAL
> of the previous drop has to be replayed.
>

On further thinking, the WAL for such dropped publications should get
replayed eventually before the WAL for new publications (the
publications which will be created after restart) unless the required
WAL is removed on primary due to some reason.

-- 
With Regards,
Amit Kapila.




Re: remaining sql/json patches

2024-03-17 Thread Amit Langote
On Wed, Mar 13, 2024 at 5:47 AM Alvaro Herrera  wrote:
> About 0002:
>
> I think we should just drop it.  Look at the changes it produces in the
> plans for aliases XMLTABLE:
>
> > @@ -1556,7 +1556,7 @@ SELECT f.* FROM xmldata, LATERAL 
> > xmltable('/ROWS/ROW[COUNTRY_NAME="Japan" or COU
> > Output: f."COUNTRY_NAME", f."REGION_ID"
> > ->  Seq Scan on public.xmldata
> >   Output: xmldata.data
> > -   ->  Table Function Scan on "xmltable" f
> > +   ->  Table Function Scan on "XMLTABLE" f
> >   Output: f."COUNTRY_NAME", f."REGION_ID"
> >   Table Function Call: XMLTABLE(('/ROWS/ROW[COUNTRY_NAME="Japan" or 
> > COUNTRY_NAME="India"]'::text) PASSING (xmldata.data) COLUMNS "COUNTRY_NAME" 
> > text, "REGION_ID" integer)
> >   Filter: (f."COUNTRY_NAME" = 'Japan'::text)
>
> Here in text-format EXPLAIN, we already have the alias next to the
> "xmltable" moniker, when an alias is present.  This matches the
> query itself as well as the labels used in the "Output:" display.
> If an alias is not present, then this says just 'Table Function Scan on 
> "xmltable"'
> and the rest of the plans refers to this as "xmltable", so it's also
> fine.
>
> > @@ -1591,7 +1591,7 @@ SELECT f.* FROM xmldata, LATERAL 
> > xmltable('/ROWS/ROW[COUNTRY_NAME="Japan" or COU
> > "Parent Relationship": "Inner", 
> > 
> >  +
> > "Parallel Aware": false,
> > 
> >  +
> > "Async Capable": false, 
> > 
> >  +
> > -   "Table Function Name": "xmltable",  
> > 
> >  +
> > +   "Table Function Name": "XMLTABLE",  
> > 
> >  +
> > "Alias": "f",   
> > 
> >  +
> > "Output": ["f.\"COUNTRY_NAME\"", "f.\"REGION_ID\""],
> > 
> >  +
> > "Table Function Call": 
> > "XMLTABLE(('/ROWS/ROW[COUNTRY_NAME=\"Japan\" or 
> > COUNTRY_NAME=\"India\"]'::text) PASSING (xmldata.data) COLUMNS 
> > \"COUNTRY_NAME\" text, \"REGION_ID\" integer)",+
>
> This is the JSON-format explain.  Notice that the "Alias" member already
> shows the alias "f", so the only thing this change is doing is
> uppercasing the "xmltable" to "XMLTABLE".  We're not really achieving
> anything here.
>
> I think the only salvageable piece from this, **if anything**, is making
> the "xmltable" literal string into uppercase.  That might bring a little
> clarity to the fact that this is a keyword and not a user-introduced
> name.
>
>
> In your 0003 I think this would only have relevance in this query,
>
> +-- JSON_TABLE() with alias
> +EXPLAIN (COSTS OFF, VERBOSE)
> +SELECT * FROM
> +   JSON_TABLE(
> +   jsonb 'null', 'lax $[*]' PASSING 1 + 2 AS a, json '"foo"' AS "b c"
> +   COLUMNS (
> +   id FOR ORDINALITY,
> +   "int" int PATH '$',
> +   "text" text PATH '$'
> +   )) json_table_func;
> + 
>   QUERY PLAN
>
> +--
> --
> + Table Function Scan on "JSON_TABLE" json_table_func
> +   Output: id, "int", text
> +   Table Function Call: JSON_TABLE('null'::jsonb, '$[*]' AS 
> json_table_path_0 PASSING 3 AS a, '"foo"'::jsonb AS "b c" COLUMNS (id FOR 
> ORDINALITY, "int" integer PATH '$', text text PATH '$') PLAN 
> (json_table_path_0))
> +(3 rows)
>
> and I'm curious to see what this would output if this was to be run
> without the 0002 patch.  If I understand things correctly, the alias
> would be displayed anyway, meaning 0002 doesn't get us anything.

Patch 0002 came about because old versions of json_table patch were
changing ExplainTargetRel() incorrectly to use rte->tablefunc to get
the function type to set objectname, but rte->tablefunc is NULL
because add_rte_to_flat_rtable() zaps it.  You pointed that out in
[1].

However, 

Re: SQL:2011 application time

2024-03-17 Thread jian he
one more minor issue related to error reporting.
I've only applied v28, 0001 to 0005.

-- (parent_id, valid_at) REFERENCES [implicit]
-- FOREIGN KEY part should specify PERIOD
CREATE TABLE temporal_fk_rng2rng (
id int4range,
valid_at daterange,
parent_id int4range,
CONSTRAINT temporal_fk_rng2rng_pk PRIMARY KEY (id, valid_at WITHOUT OVERLAPS),
CONSTRAINT temporal_fk_rng2rng_fk FOREIGN KEY (parent_id, valid_at)
REFERENCES temporal_rng
);
ERROR:  number of referencing and referenced columns for foreign key disagree

-- (parent_id, PERIOD valid_at) REFERENCES (id)
CREATE TABLE temporal_fk_rng2rng (
id int4range,
valid_at daterange,
parent_id int4range,
CONSTRAINT temporal_fk_rng2rng_pk PRIMARY KEY (id, valid_at WITHOUT OVERLAPS),
CONSTRAINT temporal_fk_rng2rng_fk FOREIGN KEY (parent_id, PERIOD valid_at)
REFERENCES temporal_rng (id)
);
ERROR:  foreign key uses PERIOD on the referencing table but not the
referenced table

these error messages seem somehow inconsistent with the comments above?


+ else
+ {
+ /*
+ * Check it's a btree; currently this can never fail since no other
+ * index AMs support unique indexes.  If we ever did have other types
+ * of unique indexes, we'd need a way to determine which operator
+ * strategy number is equality.  (Is it reasonable to insist that
+ * every such index AM use btree's number for equality?)
+ */
+ if (amid != BTREE_AM_OID)
+ elog(ERROR, "only b-tree indexes are supported for foreign keys");
+ eqstrategy = BTEqualStrategyNumber;
+ }

the comments say never fail.
but it actually failed. see:

+-- (parent_id) REFERENCES [implicit]
+-- This finds the PK (omitting the WITHOUT OVERLAPS element),
+-- but it's not a b-tree index, so it fails anyway.
+-- Anyway it must fail because the two sides have a different
definition of "unique".
+CREATE TABLE temporal_fk_rng2rng (
+ id int4range,
+ valid_at daterange,
+ parent_id int4range,
+ CONSTRAINT temporal_fk_rng2rng_pk PRIMARY KEY (id, valid_at WITHOUT OVERLAPS),
+ CONSTRAINT temporal_fk_rng2rng_fk FOREIGN KEY (parent_id)
+ REFERENCES temporal_rng
+);
+ERROR:  only b-tree indexes are supported for foreign keys

because in transformFkeyGetPrimaryKey.
we have `if (indexStruct->indisexclusion && i == indexStruct->indnatts - 1)`
we have pk_with_period, fk_with_period in Constraint struct.

maybe we can add a bool argument to transformFkeyGetPrimaryKey
indicate, this primary key is a conperiod constraint.
then we can check condition: the primary key is a conperiod constraint
and fk_with_period or is pk_with_period is false

I've made a patch to make these error reporting more accurate.
you can further refine it.


v28-0001-refactor-transformFkeyGetPrimaryKey-for-bette.no-cfbot
Description: Binary data


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-17 Thread Amit Kapila
On Mon, Mar 18, 2024 at 9:58 AM Bharath Rupireddy
 wrote:
>
> On Sat, Mar 16, 2024 at 3:55 PM Amit Kapila  wrote:
> >
> > > What's proposed with max_slot_xid_age is that during checkpoint we
> > > look at slot's xmin and catalog_xmin, and the current system txn id.
> > > Then, if the XID age of (xmin, catalog_xmin) and current_xid crosses
> > > max_slot_xid_age, we invalidate the slot.
> > >
> >
> > I can see that in your patch (in function
> > InvalidatePossiblyObsoleteSlot()). As per my understanding, we need
> > something similar for slot xids in ComputeXidHorizons() as we are
> > doing WAL in KeepLogSeg(). In KeepLogSeg(), we compute the minimum LSN
> > location required by slots and then adjust it for
> > 'max_slot_wal_keep_size'. On similar lines, currently in
> > ComputeXidHorizons(), we compute the minimum xid required by slots
> > (procArray->replication_slot_xmin and
> > procArray->replication_slot_catalog_xmin) but then don't adjust it for
> > 'max_slot_xid_age'. I could be missing something in this but it is
> > better to keep discussing this
>
> After invalidating slots because of max_slot_xid_age, the
> procArray->replication_slot_xmin and
> procArray->replication_slot_catalog_xmin are recomputed immediately in
> InvalidateObsoleteReplicationSlots->ReplicationSlotsComputeRequiredXmin->ProcArraySetReplicationSlotXmin.
> And, later the XID horizons in ComputeXidHorizons are computed before
> the vacuum on each table via GetOldestNonRemovableTransactionId.
> Aren't these enough?
>

IIUC, this will be delayed by one cycle in the vacuum rather than
doing it when the slot's xmin age is crossed and it can be
invalidated.

 Do you want the XID horizons recomputed
> immediately, something like the below?
>

I haven't thought of the exact logic but we can try to mimic the
handling similar to WAL.

-- 
With Regards,
Amit Kapila.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-17 Thread Bharath Rupireddy
On Sat, Mar 16, 2024 at 3:55 PM Amit Kapila  wrote:
>
> > What's proposed with max_slot_xid_age is that during checkpoint we
> > look at slot's xmin and catalog_xmin, and the current system txn id.
> > Then, if the XID age of (xmin, catalog_xmin) and current_xid crosses
> > max_slot_xid_age, we invalidate the slot.
> >
>
> I can see that in your patch (in function
> InvalidatePossiblyObsoleteSlot()). As per my understanding, we need
> something similar for slot xids in ComputeXidHorizons() as we are
> doing WAL in KeepLogSeg(). In KeepLogSeg(), we compute the minimum LSN
> location required by slots and then adjust it for
> 'max_slot_wal_keep_size'. On similar lines, currently in
> ComputeXidHorizons(), we compute the minimum xid required by slots
> (procArray->replication_slot_xmin and
> procArray->replication_slot_catalog_xmin) but then don't adjust it for
> 'max_slot_xid_age'. I could be missing something in this but it is
> better to keep discussing this

After invalidating slots because of max_slot_xid_age, the
procArray->replication_slot_xmin and
procArray->replication_slot_catalog_xmin are recomputed immediately in
InvalidateObsoleteReplicationSlots->ReplicationSlotsComputeRequiredXmin->ProcArraySetReplicationSlotXmin.
And, later the XID horizons in ComputeXidHorizons are computed before
the vacuum on each table via GetOldestNonRemovableTransactionId.
Aren't these enough? Do you want the XID horizons recomputed
immediately, something like the below?

/* Invalidate replication slots based on xmin or catalog_xmin age */
if (max_slot_xid_age > 0)
{
if (InvalidateObsoleteReplicationSlots(RS_INVAL_XID_AGE,
   0, InvalidOid,
   InvalidTransactionId))
{
ComputeXidHorizonsResult horizons;

/*
 * Some slots have been invalidated; update the XID horizons
 * as a side-effect.
 */
ComputeXidHorizons();
}
}

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-17 Thread Masahiko Sawada
On Sun, Mar 17, 2024 at 11:46 AM John Naylor  wrote:
>
> On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada  wrote:
> >
> > On Fri, Mar 15, 2024 at 4:36 PM John Naylor  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  
> > > wrote:
>
> > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > > > overwrite) the value, I think we should not expect that found is
> > > > always false.
> > >
> > > I find that a puzzling statement, since 1) it was designed for
> > > insert-only workloads, not actual overwrite IIRC and 2) the tests will
> > > now fail if the same block is set twice, since we just switched the
> > > tests to use a remnant of vacuum's old array. Having said that, I
> > > don't object to removing artificial barriers to using it for purposes
> > > not yet imagined, as long as test_tidstore.sql warns against that.
> >
> > I think that if it supports only insert-only workload and expects the
> > same block is set only once, it should raise an error rather than an
> > assertion. It's odd to me that the function fails only with an
> > assertion build assertions even though it actually works fine even in
> > that case.
>
> After thinking some more, I think you're right -- it's too
> heavy-handed to throw an error/assert and a public function shouldn't
> make assumptions about the caller. It's probably just a matter of
> documenting the function (and it's lack of generality), and the tests
> (which are based on the thing we're replacing).

Removed 'found' in 0003 patch.

>
> > As for test_tidstore you're right that the test code doesn't handle
> > the case where setting the same block twice. I think that there is no
> > problem in the fixed-TIDs tests, but we would need something for
> > random-TIDs tests so that we don't set the same block twice. I guess
> > it could be trivial since we can use SQL queries to generate TIDs. I'm
> > not sure how the random-TIDs tests would be like, but I think we can
> > use SELECT DISTINCT to eliminate the duplicates of block numbers to
> > use.
>
> Also, I don't think we need random blocks, since the radix tree tests
> excercise that heavily already.
>
> Random offsets is what I was thinking of (if made distinct and
> ordered), but even there the code is fairy trivial, so I don't have a
> strong feeling about it.

Agreed.

>
> > > Given the above two things, I think this function's comment needs
> > > stronger language about its limitations. Perhaps even mention that
> > > it's intended for, and optimized for, vacuum. You and I have long
> > > known that tidstore would need a separate, more complex, function to
> > > add or remove individual tids from existing entries, but it might be
> > > good to have that documented.
> >
> > Agreed.
>
> How about this:
>
>  /*
> - * Set the given TIDs on the blkno to TidStore.
> + * Create or replace an entry for the given block and array of offsets
>   *
> - * NB: the offset numbers in offsets must be sorted in ascending order.
> + * NB: This function is designed and optimized for vacuum's heap scanning
> + * phase, so has some limitations:
> + * - The offset numbers in "offsets" must be sorted in ascending order.
> + * - If the block number already exists, the entry will be replaced --
> + *   there is no way to add or remove offsets from an entry.
>   */
>  void
>  TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber 
> *offsets,

Looks good.

>
> I think we can stop including the debug-tid-store patch for CI now.
> That would allow getting rid of some unnecessary variables.

Agreed.

>
> + * Prepare to iterate through a TidStore. Since the radix tree is locked 
> during
> + * the iteration, TidStoreEndIterate() needs to be called when finished.
>
> + * Concurrent updates during the iteration will be blocked when inserting a
> + * key-value to the radix tree.
>
> This is outdated. Locking is optional. The remaining real reason now
> is that TidStoreEndIterate needs to free memory. We probably need to
> say something about locking, too, but not this.

Fixed.

>
> + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> + * in one block. We return the block numbers in ascending order and the 
> offset
> + * numbers in each result is also sorted in ascending order.
> + */
> +TidStoreIterResult *
> +TidStoreIterateNext(TidStoreIter *iter)
>
> The wording is a bit awkward.

Fixed.

>
> +/*
> + * Finish an iteration over TidStore. This needs to be called after finishing
> + * or when existing an iteration.
> + */
>
> s/existing/exiting/ ?
>
> It seems to say we need to finish after finishing. Maybe more precise wording.

Fixed.

>
> +/* Extract TIDs from the given key-value pair */
> +static void
> +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> BlocktableEntry *page)
>
> This is a leftover from the old encoding scheme. This should really
> take a "BlockNumber blockno" not a "key", and the only call site
> should probably cast the uint64 to 

Re: speed up a logical replica setup

2024-03-17 Thread Amit Kapila
On Sat, Mar 16, 2024 at 9:13 PM Euler Taveira  wrote:
>
> On Fri, Mar 15, 2024, at 3:34 AM, Amit Kapila wrote:
>
> Did you consider adding options for publication/subscription/slot
> names as mentioned in my previous email? As discussed in a few emails
> above, it would be quite confusing for users to identify the logical
> replication objects once the standby is converted to subscriber.
>
>
> Yes. I was wondering to implement after v1 is pushed. I started to write a 
> code
> for it but I wasn't sure about the UI. The best approach I came up with was
> multiple options in the same order. (I don't provide short options to avoid
> possible portability issues with the order.) It means if I have 3 databases 
> and
> the following command-line:
>
> pg_createsubscriber ... --database pg1 --database pg2 --database3 
> --publication
> pubx --publication puby --publication pubz
>
> pubx, puby and pubz are created in the database pg1, pg2, and pg3 
> respectively.
>

With this syntax, you want to give the user the option to specify
publication/subscription/slot name for each database? If so, won't it
be too much verbose?

> It seems we care only for publications created on the primary. Isn't
> it possible that some of the publications have been replicated to
> standby by that time, for example, in case failure happens after
> creating a few publications? IIUC, we don't care for standby cleanup
> after failure because it can't be used for streaming replication
> anymore. So, the only choice the user has is to recreate the standby
> and start the pg_createsubscriber again. This sounds questionable to
> me as to whether users would like this behavior. Does anyone else have
> an opinion on this point?
>
>
> If it happens after creating a publication and before promotion, the cleanup
> routine will drop the publications on primary and it will eventually be 
> applied
> to the standby via replication later.
>

Do you mean to say that the next time if user uses
pg_createsubscriber, it will be ensured that all the prior WAL will be
replicated? I think we need to ensure that after the restart of this
tool and before it attempts to create the publications again, the WAL
of the previous drop has to be replayed.

-- 
With Regards,
Amit Kapila.




RE: PostgreSQL 17 Release Management Team & Feature Freeze

2024-03-17 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> We are pleased to announce the Release Management Team (RMT) (cc'd)
> for the PostgreSQL 17 release:
> - Robert Haas
> - Heikki Linnakangas
> - Michael Paquier

Thanks for managing the release of PostgreSQL to proceed the right direction!

> You can track open items for the PostgreSQL 17 release here:
> https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

I think the entry can be closed:

```
pg_upgrade with --link mode failing upgrade of publishers
Commit: 29d0a77fa660
Owner: Amit Kapila
```

The reported failure was only related with the test script, not the feature.
The issue has already been analyzed and the fix patch was pushed as f17529b710.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 





Re: Statistics Import and Export

2024-03-17 Thread Corey Huinker
>
>
> * pg_set_relation_stats() needs to do an ACL check so you can't set the
> stats on someone else's table. I suggest honoring the new MAINTAIN
> privilege as well.
>

Added.


>
> * If possible, reading from pg_stats (instead of pg_statistic) would be
> ideal because pg_stats already does the right checks at read time, so a
> non-superuser can export stats, too.
>

Done. That was sorta how it was originally, so returning to that wasn't too
hard.


>
> * If reading from pg_stats, should you change the signature of
> pg_set_relation_stats() to have argument names matching the columns of
> pg_stats (e.g. most_common_vals instead of stakind/stavalues)?
>

Done.


>
> In other words, make this a slightly higher level: conceptually
> exporting/importing pg_stats rather than pg_statistic. This may also
> make the SQL export queries simpler.
>

Eh, about the same.


> Also, I'm wondering about error handling. Is some kind of error thrown
> by pg_set_relation_stats() going to abort an entire restore? That might
> be easy to prevent with pg_restore, because it can just omit the stats,
> but harder if it's in a SQL file.
>

Aside from the oid being invalid, there's not a whole lot that can go wrong
in set_relation_stats(). The error checking I did closely mirrors that in
analyze.c.

Aside from the changes you suggested, as well as the error reporting change
you suggested for pg_dump, I also filtered out attempts to dump stats on
views.

A few TAP tests are still failing and I haven't been able to diagnose why,
though the failures in parallel dump seem to be that it tries to import
stats on indexes that haven't been created yet, which is odd because I sent
the dependency.

All those changes are available in the patches attached.
From ba411ce31c25193cf05bc4206dcb8f2bf32af0c7 Mon Sep 17 00:00:00 2001
From: Corey Huinker 
Date: Mon, 11 Mar 2024 14:18:39 -0400
Subject: [PATCH v10 1/2] Create pg_set_relation_stats, pg_set_attribute_stats.

These functions will be used by pg_dump/restore and pg_upgrade to convey
relation and attribute statistics from the source database to the
target. This would be done instead of vacuumdb --analyze-in-stages.

Both functions take an oid to identify the target relation that will
receive the statistics. There is nothing requiring that relation to be
the same one as the one exported, though the statistics would have to
make sense in the context of the new relation.

The parameters of pg_set_attribute_stats intentionaly mirror the columns
in the view pg_stats, with the ANYARRAY types casted to TEXT. Those
values will be cast to arrays of the basetype of the attribute, and that
operation may fail if the attribute type has changed.

The statistics imported by pg_set_attribute_stats are imported
transactionally like any other operation. However, pg_set_relation_stats
does it's update in-place, which is to say non-transactionally. This is
in line with what ANALYZE does to avoid table bloat in pg_class.

These functions also allows for tweaking of table statistics in-place,
allowing the user to inflate rowcounts, skew histograms, etc, to see
what those changes will evoke from the query planner.
---
 src/include/catalog/pg_proc.dat   |  15 +
 src/include/statistics/statistics.h   |   2 +
 src/backend/statistics/Makefile   |   3 +-
 src/backend/statistics/meson.build|   1 +
 src/backend/statistics/statistics.c   | 521 ++
 .../regress/expected/stats_export_import.out  | 211 +++
 src/test/regress/parallel_schedule|   2 +-
 src/test/regress/sql/stats_export_import.sql  | 188 +++
 doc/src/sgml/func.sgml|  83 +++
 9 files changed, 1024 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/statistics/statistics.c
 create mode 100644 src/test/regress/expected/stats_export_import.out
 create mode 100644 src/test/regress/sql/stats_export_import.sql

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 700f7daf7b..3070d72d7a 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12170,4 +12170,19 @@
   proargtypes => 'int2',
   prosrc => 'gist_stratnum_identity' },
 
+# Statistics Import
+{ oid => '8048',
+  descr => 'set statistics on relation',
+  proname => 'pg_set_relation_stats', provolatile => 'v', proisstrict => 't',
+  proparallel => 'u', prorettype => 'bool',
+  proargtypes => 'oid float4 int4 int4',
+  proargnames => '{relation,reltuples,relpages,relallvisible}',
+  prosrc => 'pg_set_relation_stats' },
+{ oid => '8049',
+  descr => 'set statistics on attribute',
+  proname => 'pg_set_attribute_stats', provolatile => 'v', proisstrict => 'f',
+  proparallel => 'u', prorettype => 'bool',
+  proargtypes => 'oid name bool float4 int4 float4 text _float4 text float4 text _float4 _float4',
+  proargnames => 

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-17 Thread Amit Kapila
On Sun, Mar 17, 2024 at 2:03 PM Bharath Rupireddy
 wrote:
>
> On Sat, Mar 16, 2024 at 3:55 PM Amit Kapila  wrote:
> >
> > procArray->replication_slot_catalog_xmin) but then don't adjust it for
> > 'max_slot_xid_age'. I could be missing something in this but it is
> > better to keep discussing this and try to move with another parameter
> > 'inactive_replication_slot_timeout' which according to me can be kept
> > at slot level instead of a GUC but OTOH we need to see the arguments
> > on both side and then decide which makes more sense.
>
> Hm. Are you suggesting inactive_timeout to be a slot level parameter
> similar to 'failover' property added recently by
> c393308b69d229b664391ac583b9e07418d411b6 and
> 73292404370c9900a96e2bebdc7144f7010339cf? With this approach, one can
> set inactive_timeout while creating the slot either via
> pg_create_physical_replication_slot() or
> pg_create_logical_replication_slot() or CREATE_REPLICATION_SLOT or
> ALTER_REPLICATION_SLOT command, and postgres tracks the
> last_inactive_at for every slot based on which the slot gets
> invalidated. If this understanding is right, I can go ahead and work
> towards it.
>

Yeah, I have something like that in mind. You can prepare the patch
but it would be good if others involved in this thread can also share
their opinion.

> Alternatively, we can go the route of making GUC a list of key-value
> pairs of {slot_name, inactive_timeout}, but this kind of GUC for
> setting slot level parameters is going to be the first of its kind, so
> I'd prefer the above approach.
>

I would prefer a slot-level parameter in this case rather than a GUC.

-- 
With Regards,
Amit Kapila.




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-17 Thread Amit Kapila
On Sun, Mar 17, 2024 at 7:40 PM Alexander Korotkov  wrote:
>
> On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
>  wrote:
> > On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila  wrote:
> > > In general, it seems this patch has been stuck for a long time on the
> > > decision to choose an appropriate UI (syntax), and we thought of
> > > moving it further so that the other parts of the patch can be
> > > reviewed/discussed. So, I feel before pushing this we should see
> > > comments from a few (at least two) other senior members who earlier
> > > shared their opinion on the syntax. I know we don't have much time
> > > left but OTOH pushing such a change (where we didn't have a consensus
> > > on syntax) without much discussion at this point of time could lead to
> > > discussions after commit.
> >
> > +1 to gain consensus first on the syntax changes. With this, we might
> > be violating the SQL standard for explicit txn commands (I stand for
> > correction about the SQL standard though).
>
> Thank you for your feedback.  Generally, I agree it's correct to get
> consensus on syntax first.  And yes, this patch has been here since
> 2016.  We didn't get consensus for syntax for 8 years.  Frankly
> speaking, I don't see a reason why this shouldn't take another 8
> years.  At the same time the ability to wait on standby given LSN is
> replayed seems like pretty basic and simple functionality.  Thus, it's
> quite frustrating it already took that long and still unclear when/how
> this could be finished.
>
> My current attempt was to commit minimal implementation as less
> invasive as possible.  A new clause for BEGIN doesn't require
> additional keywords and doesn't introduce additional statements.  But
> yes, this is still a new qual.  And, yes, Amit you're right that even
> if I had committed that, there was still a high risk of further
> debates and revert.
>
> Given my specsis about agreement over syntax, I'd like to check
> another time if we could go without new syntax at all.  There was an
> attempt to implement waiting for lsn as a function.  But function
> holds a snapshot, which could prevent WAL records from being replayed.
> Releasing a snapshot could break the parent query.  But now we have
> procedures, which need a dedicated statement for the call and can even
> control transactions.  Could we implement a waitlsn in a procedure
> that:
>
> 1. First, check that it was called with non-atomic context (that is,
> it's not called within a transaction). Trigger error if called with
> atomic context.
> 2. Release a snapshot to be able to wait without risk of WAL replay
> stuck.  Procedure is still called within the snapshot.  It's a bit of
> a hack to release a snapshot, but Vacuum statements already do so.
>

Can you please provide a bit more details with some example what is
the existing problem with functions and how using procedures will
resolve it? How will this this address the implicit transaction case
or do we have any other workaround for those cases?

-- 
With Regards,
Amit Kapila.




Re: Add new error_action COPY ON_ERROR "log"

2024-03-17 Thread Kyotaro Horiguchi
At Fri, 15 Mar 2024 16:57:25 +0900, Michael Paquier  wrote 
in 
> On Wed, Mar 13, 2024 at 07:32:37PM +0530, Bharath Rupireddy wrote:
> > I'm attaching the v8 patch set implementing the above idea. With this,
> > [1] is sent to the client, [2] is sent to the server log. This
> > approach not only reduces the duplicate info in the NOTICE and CONTEXT
> > messages, but also makes it easy for users to get all the necessary
> > info in the NOTICE message without having to set extra parameters to
> > get CONTEXT message.
> 
> In terms of eliminating the information duplication while allowing
> clients to know the value, the attribute and the line involved in the
> conversion failure, the approach of tweaking the context information
> has merits, I guess.
> 
> How do others feel about this kind of tuning?

If required, I think that we have already included the minimum
information necessary for the primary diagnosis, including locations,
within the primary messages. Here is an example:

> LOG:  0: invalid record length at 0/18049F8: expected at least 24, got 0

And I believe that CONTEXT, if it exists, is augmentation information
to the primary messages. The objective of the precedent for the use of
relname_only was somewhat different, but this use also seems legit.

In short, I think the distribution between message types (primary and
context) is fine as it is in the latest patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




PostgreSQL 17 Release Management Team & Feature Freeze

2024-03-17 Thread Michael Paquier
Hi,

We are pleased to announce the Release Management Team (RMT) (cc'd)
for the PostgreSQL 17 release:
- Robert Haas
- Heikki Linnakangas
- Michael Paquier

You can find information about the responsibilities of the RMT here:
https://wiki.postgresql.org/wiki/Release_Management_Team

Additionally, the RMT has set the feature freeze to be **April 8, 2024
at 0:00 AoE** (see [1]).  This is the last time to commit features for
PostgreSQL 17.  In other words, no new PostgreSQL 17 feature can be
committed after April 8, 2024 at 0:00 AoE.  As mentioned last year in
[2], this uses the "standard" feature freeze date/time.

You can track open items for the PostgreSQL 17 release here:
https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

Please let us know if you have any questions.

On behalf of the PG17 RMT,

Michael

[1]: https://en.wikipedia.org/wiki/Anywhere_on_Earth
[2]: 
https://www.postgresql.org/message-id/9fbe60ec-fd1b-6ee0-240d-af7fc4442...@postgresql.org
--
Michael


signature.asc
Description: PGP signature


Re: Autogenerate some wait events code and documentation

2024-03-17 Thread Noah Misch
On Mon, Mar 18, 2024 at 08:02:24AM +0900, Michael Paquier wrote:
> > 1. Don't back-patch wait events to v17+.  Use the closest existing event.
> > 2. Let wait_event_names.txt back-patches control the enum order.  For 
> > example,
> >a line could have an annotation that controls its position relative to 
> > the
> >auto-sorted lines.  For another example, the generator could stop 
> > sorting.
> > 3. Accept the renumbering, because the consequence isn't that horrible.

> I see an option (4), similar to your (2) without the per-line
> annotation: add a new magic keyword like the existing "Section" that
> is used in the first lines of generate-wait_event_types.pl where we
> generate tab-separated lines with the section name as prefix of each
> line.  So I can think of something like:
> Section: ClassName - WaitEventFoo
> FOO_1 "Waiting in foo1"
> FOO_2 "Waiting in foo2"
> Backpatch:
> BAR_1 "Waiting in bar1"
> BAR_2 "Waiting in bar2"
> 
> Then force the ordering for the docs and keep the elements in the
> backpatch section at the end of the enums in the order in the txt.
> One thing that could make sense is to enforce that "Backpatch" is at
> the end of a section, meaning that we would need a second keyword like
> a "Section: EndBackpatch" or something like that.  That's not strictly
> necessary IMO as the format of the txt is easy to follow.

Works for me, with or without the trailing keyword line.




Re: Weird test mixup

2024-03-17 Thread Michael Paquier
On Sat, Mar 16, 2024 at 08:40:21AM +0900, Michael Paquier wrote:
> Linking all the points to a PID with a injection_points_attach_local()
> that switches a static flag while registering a before_shmem_exit() to
> do an automated cleanup sounds like the simplest approach to me based
> on what I'm reading on this thread.

Please find a patch to do exactly that, without touching the backend
APIs.  0001 adds a new function call injection_points_local() that can
be added on top of a SQL test to make it concurrent-safe.  0002 is the
fix for the GIN tests.

I am going to add an open item to not forget about all that.

Comments are welcome.
--
Michael
From a4c219f0142a36b2a009af1f9e9affdeab6ab383 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 18 Mar 2024 09:53:05 +0900
Subject: [PATCH v2 1/2] Introduce runtime conditions in module
 injection_points

This adds a new SQL function injection_points_local() that can be used
to force injection points to be run only on the process where they are
attached.  This is handy for SQL tests to:
- detach automatically local injection points when the process exits.
- allow tests with injection points to run concurrently with other test
suites.
---
 .../expected/injection_points.out |  77 
 .../injection_points--1.0.sql |  11 ++
 .../injection_points/injection_points.c   | 176 ++
 .../injection_points/sql/injection_points.sql |  20 ++
 src/tools/pgindent/typedefs.list  |   1 +
 5 files changed, 285 insertions(+)

diff --git a/src/test/modules/injection_points/expected/injection_points.out b/src/test/modules/injection_points/expected/injection_points.out
index 827456ccc5..f24a602def 100644
--- a/src/test/modules/injection_points/expected/injection_points.out
+++ b/src/test/modules/injection_points/expected/injection_points.out
@@ -115,4 +115,81 @@ NOTICE:  notice triggered for injection point TestInjectionLog2
  
 (1 row)
 
+SELECT injection_points_detach('TestInjectionLog2');
+ injection_points_detach 
+-
+ 
+(1 row)
+
+-- Runtime conditions
+SELECT injection_points_attach('TestConditionError', 'error');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+-- Any follow-up injection point attached will be local to this process.
+SELECT injection_points_local();
+ injection_points_local 
+
+ 
+(1 row)
+
+SELECT injection_points_attach('TestConditionLocal1', 'error');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_attach('TestConditionLocal2', 'notice');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_run('TestConditionLocal1'); -- error
+ERROR:  error triggered for injection point TestConditionLocal1
+SELECT injection_points_run('TestConditionLocal2'); -- notice
+NOTICE:  notice triggered for injection point TestConditionLocal2
+ injection_points_run 
+--
+ 
+(1 row)
+
+-- reload, local injection points should be gone.
+\c
+SELECT injection_points_run('TestConditionLocal1'); -- nothing
+ injection_points_run 
+--
+ 
+(1 row)
+
+SELECT injection_points_run('TestConditionLocal2'); -- nothing
+ injection_points_run 
+--
+ 
+(1 row)
+
+SELECT injection_points_run('TestConditionError'); -- error
+ERROR:  error triggered for injection point TestConditionError
+SELECT injection_points_detach('TestConditionError');
+ injection_points_detach 
+-
+ 
+(1 row)
+
+-- Attaching injection points that use the same name as one defined locally
+-- previously should work.
+SELECT injection_points_attach('TestConditionLocal1', 'error');
+ injection_points_attach 
+-
+ 
+(1 row)
+
+SELECT injection_points_detach('TestConditionLocal1');
+ injection_points_detach 
+-
+ 
+(1 row)
+
 DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/injection_points--1.0.sql b/src/test/modules/injection_points/injection_points--1.0.sql
index 0a2e59aba7..1813a4d6d8 100644
--- a/src/test/modules/injection_points/injection_points--1.0.sql
+++ b/src/test/modules/injection_points/injection_points--1.0.sql
@@ -34,6 +34,17 @@ RETURNS void
 AS 'MODULE_PATHNAME', 'injection_points_wakeup'
 LANGUAGE C STRICT PARALLEL UNSAFE;
 
+--
+-- injection_points_local()
+--
+-- Trigger switch to link any future injection points attached to the
+-- current process, useful to make SQL tests concurrently-safe.
+--
+CREATE FUNCTION injection_points_local()
+RETURNS void
+AS 'MODULE_PATHNAME', 'injection_points_local'
+LANGUAGE C STRICT PARALLEL UNSAFE;
+
 --
 -- injection_points_detach()
 --
diff --git a/src/test/modules/injection_points/injection_points.c b/src/test/modules/injection_points/injection_points.c
index 0730254d54..8be081db99 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ 

Re: Regarding the order of the header file includes

2024-03-17 Thread Richard Guo
On Wed, Mar 13, 2024 at 10:07 PM Peter Eisentraut 
wrote:

> On 12.03.24 12:47, Richard Guo wrote:
> >
> > On Wed, Mar 6, 2024 at 5:32 PM Richard Guo  > > wrote:
> >
> > Please note that this patch only addresses the order of header file
> > includes in backend modules (and might not be thorough).  It is
> possible
> > that other modules may have a similar issue, but I have not evaluated
> > them yet.
> >
> >
> > Attached is v2, which also includes the 0002 patch that addresses the
> > order of header file includes in non-backend modules.
>
> committed (as one patch)


Thanks for pushing!

Thanks
Richard


Re: Support json_errdetail in FRONTEND builds

2024-03-17 Thread Tom Lane
Daniel Gustafsson  writes:
> I took another look at this tonight and committed it with some mostly cosmetic
> changes.  Since then, tamandua failed the SSL test on this commit but I am
> unable to reproduce any error both on older OpenSSL and matching the 3.1
> version on tamandua, so not sure if its related.

That failure looks like just a random buildfarm burp:

2024-03-17 23:11:30.989 UTC [106988][postmaster][:0] LOG:  starting PostgreSQL 
17devel on x86_64-linux, compiled by gcc-12.3.0, 64-bit
2024-03-17 23:11:30.989 UTC [106988][postmaster][:0] LOG:  could not bind IPv4 
address "127.0.0.1": Address already in use
2024-03-17 23:11:30.989 UTC [106988][postmaster][:0] HINT:  Is another 
postmaster already running on port 54588? If not, wait a few seconds and retry.
2024-03-17 23:11:30.990 UTC [106988][postmaster][:0] WARNING:  could not create 
listen socket for "127.0.0.1"
2024-03-17 23:11:30.990 UTC [106988][postmaster][:0] FATAL:  could not create 
any TCP/IP sockets
2024-03-17 23:11:30.993 UTC [106988][postmaster][:0] LOG:  database system is 
shut down

regards, tom lane




Re: Q: Escapes in jsonpath Idents

2024-03-17 Thread Erik Wienhold
On 2024-03-17 20:50 +0100, David E. Wheeler wrote:
> On Mar 17, 2024, at 15:12, Erik Wienhold  wrote:
> > So I think it makes sense to reword the entire backslash part of the
> > paragraph and remove references to JSON entirely.  The attached patch
> > does that and also formats the backslash escapes as a bulleted list for
> > readability.
> 
> Ah, it’s JavaScript format, not JSON! This does clarify things quite
> nicely, thank you. Happy to add my review once it’s in a commit fest.

Thanks.  https://commitfest.postgresql.org/48/4899/

> > The first case ($.$foo) is in line with the restriction on member
> > accessors that you quoted first.
> 
> Huh, that’s now how I read it. Here it is again:
> 
> >> Member accessor that returns an object member with the specified
> >> key. If the key name matches some named variable starting with $ or
> >> does not meet the JavaScript rules for an identifier, it must be
> >> enclosed in double quotes to make it a string literal.
> 
> 
> Note that in my example `$foo` does not match a variable. I mean it
> looks like a variable, but none is used here. I guess it’s being
> conservative because it might be used in one of the functions, like
> jsonb_path_exists(), to which variables might be passed.

I had the same reasoning while writing my first reply but scrapped that
part because I found it obvious:  That jsonpath is parsed before calling
jsonb_path_exists() and therefore the parser has no context about any
variables, which might not even be hardcoded but may result from a
query.

> > The error message 'syntax error at or near "$oo" of jsonpath input' for
> > the second case ($.f$oo), however, looks as if the scanner identifies
> > '$oo' as a variable instead of contiuing the scan of identifier (f$oo)
> > for the member accessor.  Looks like a bug to me because a variable
> > doesn't even make sense in that place.
> 
> Right. Maybe the docs should be updated to say that a literal dollar
> sign isn’t supported in identifiers, unlike in JavaScript, except
> through escapes like this:

Unfortunately, I don't have access to that part of the SQL spec.  So I
don't know how the jsonpath grammar is specified.

I had a look into Oracle, MySQL, and SQLite docs to see what they
implement:

* Oracle requires the unquoted field names to match [A-Za-z][A-Za-z0-9]*
  (see "object steps").  It also supports variables.
  
https://docs.oracle.com/en/database/oracle/oracle-database/23/adjsn/json-path-expressions.html

* MySQL refers to ECMAScript identifiers but does not say anything about
  variables: https://dev.mysql.com/doc/refman/8.3/en/json.html#json-path-syntax

* SQLite skimps on details and does not document a grammar:
  https://sqlite.org/json1.html#path_arguments
  But it looks as if it strives for compatibility with MySQL and our dear
  Postgres: https://sqlite.org/src/doc/json-in-core/doc/json-enhancements.md

Also checked git log src/backend/utils/adt/jsonpath_scan.l for some
insights but haven't found any yet.

-- 
Erik




Re: Add pg_basetype() function to obtain a DOMAIN base type

2024-03-17 Thread jian he
looking at it again.
I found out we can just simply do
`
Datum
pg_basetype(PG_FUNCTION_ARGS)
{
Oid oid;

oid =  get_fn_expr_argtype(fcinfo->flinfo, 0);
PG_RETURN_OID(getBaseType(oid));
}
`

if the type is not a domain, work the same as  pg_typeof.
if the type is domain,  pg_typeof return as is, pg_basetype return the
base type.
so it only diverges when the argument type is a type of domain.

the doc:
pg_basetype ( "any" )
regtype
   
   
   Returns the OID of the base type of a domain. If the argument
is not a type of domain,
   return the OID of the data type of the argument just like pg_typeof().
   If there's a chain of domain dependencies, it will recurse
until finding the base type.
   


also, I think this way, we only do one syscache lookup.
From d50593e7a25f3e5f05139597d7be14f9dbfe48b9 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Sun, 17 Mar 2024 10:35:52 +0800
Subject: [PATCH v4 1/1] Add pg_basetype("any") function for querying basetype
 of a domain

Currently obtaining the base type of a domain involves a complex SQL query,
this specially in the case of recursive domain types.

To solve this, use the already available getBaseType() function,
and expose it as the pg_basetype SQL function.

if the argument is not a doamin type, return the type of the argument as is.
if the argument is a type of doamin, pg_basetype will recurse the domain dependencies chain
and return the real inner base type of the domain.

discussion:  https://postgr.es/m/CAGRrpzZSX8j=MQcbCSEisFA=ic=K3bknVfnFjAv1diVJxFHJvg@mail.gmail.com
---
 doc/src/sgml/func.sgml   | 28 +-
 src/backend/utils/adt/misc.c | 12 ++
 src/include/catalog/pg_proc.dat  |  3 +++
 src/test/regress/expected/domain.out | 36 
 src/test/regress/sql/domain.sql  | 16 +
 5 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 72c5175e..cf440b92 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -24777,6 +24777,32 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);

   
 
+  
+   
+
+ pg_basetype
+
+pg_basetype ( "any" )
+regtype
+   
+   
+   Returns the OID of the base type of a domain. If the argument is not a type of domain,
+   return the OID of the data type of the argument just like pg_typeof().
+   If there's a chain of domain dependencies, it will recurse until finding the base type.
+   
+   
+For example:
+
+CREATE DOMAIN mytext as text;
+
+SELECT pg_basetype('mytext'::mytext);
+ pg_basetype
+---
+ text
+
+   
+  
+
   

 
@@ -25263,7 +25289,7 @@ SELECT currval(pg_get_serial_sequence('sometable', 'id'));
 
   

-
+
  pg_typeof
 
 pg_typeof ( "any" )
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index d4a92d0b..2272d953 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -565,6 +565,18 @@ pg_typeof(PG_FUNCTION_ARGS)
 	PG_RETURN_OID(get_fn_expr_argtype(fcinfo->flinfo, 0));
 }
 
+/*
+ * Return the base type of the argument.
+ * iff the argument is not a type of domain, return the argument's type as is.
+ */
+Datum
+pg_basetype(PG_FUNCTION_ARGS)
+{
+	Oid			oid;
+
+	oid =  get_fn_expr_argtype(fcinfo->flinfo, 0);
+	PG_RETURN_OID(getBaseType(oid));
+}
 
 /*
  * Implementation of the COLLATE FOR expression; returns the collation
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 700f7daf..b9079be2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3877,6 +3877,9 @@
 { oid => '1619', descr => 'type of the argument',
   proname => 'pg_typeof', proisstrict => 'f', provolatile => 's',
   prorettype => 'regtype', proargtypes => 'any', prosrc => 'pg_typeof' },
+{ oid => '6312', descr => 'get the base type of a domain',
+  proname => 'pg_basetype', proisstrict => 'f', provolatile => 's',
+  prorettype => 'regtype', proargtypes => 'any', prosrc => 'pg_basetype' },
 { oid => '3162',
   descr => 'collation of the argument; implementation of the COLLATION FOR expression',
   proname => 'pg_collation_for', proisstrict => 'f', provolatile => 's',
diff --git a/src/test/regress/expected/domain.out b/src/test/regress/expected/domain.out
index 6d94e844..13bf7877 100644
--- a/src/test/regress/expected/domain.out
+++ b/src/test/regress/expected/domain.out
@@ -1207,3 +1207,39 @@ create domain testdomain1 as int constraint unsigned check (value > 0);
 alter domain testdomain1 rename constraint unsigned to unsigned_foo;
 alter domain testdomain1 drop constraint unsigned_foo;
 drop domain testdomain1;
+--
+-- Get the base type of a domain
+--
+create domain mytext as text;
+create domain mytext_child_1 as mytext;
+create domain mytext_child_2 as mytext_child_1;

Re: Support json_errdetail in FRONTEND builds

2024-03-17 Thread Daniel Gustafsson
> On 17 Mar 2024, at 00:00, Daniel Gustafsson  wrote:
> 
>> On 16 Mar 2024, at 00:59, Andrew Dunstan  wrote:
>>> On Mar 16, 2024, at 8:53 AM, Daniel Gustafsson  wrote:
 On 15 Mar 2024, at 21:56, Andrew Dunstan  wrote:
> 
 I'd like to get this done ASAP so I can rebase my incremental parse 
 patchset. Daniel, do you want to commit it? If not, I can.
>>> 
>>> Sure, I can commit these patches.  It won't be until tomorrow though since I
>>> prefer to have ample time to monitor the buildfarm, if you are in a bigger 
>>> rush
>>> than that then feel free to go ahead.
>> 
>> tomorrow will be fine, thanks 
> 
> Sorry, I ran into some unforeseen scheduling issues and had less time 
> available
> than planned.  I have pushed the 0001 StringInfo patch to reduce the work for
> tomorrow when I will work on 0002 unless beaten to it.

I took another look at this tonight and committed it with some mostly cosmetic
changes.  Since then, tamandua failed the SSL test on this commit but I am
unable to reproduce any error both on older OpenSSL and matching the 3.1
version on tamandua, so not sure if its related.  Other animals have cleared
sslcheck after this, but looking at this highlights just how rare it is for a
buildfarm animal to run sslcheck =/

--
Daniel Gustafsson





Re: SQL:2011 application time

2024-03-17 Thread jian he
Hi, minor issues from 1 to 0005.
+  
+   referencedagg
+   aggregates referenced rows' WITHOUT OVERLAPS
+part
+   13
+  
comparing with surrounding items, maybe need to add `(optional)`?
I think the explanation is not good as explained in referencedagg entry below:
  
   An aggregate function. Given values of this opclass,
   it returns a value combining them all. The return value
   need not be the same type as the input, but it must be a
   type that can appear on the right hand side of the "contained by"
   operator. For example the built-in range_ops
   opclass uses range_agg here, so that foreign
   keys can check fkperiod @> range_agg(pkperiod).
  


+  In other words, the reference must have a referent for its
entire duration.
+  This column must be a column with a range type.
+  In addition the referenced table must have a primary key
+  or unique constraint declared with WITHOUT PORTION.
+ 
seems you missed replacing this one.


in v28-0002, the function name is FindFKPeriodOpers,
then in v28-0005 rename it to FindFKPeriodOpersAndProcs?
renaming the function name in a set of patches seems not a good idea?


+  
+   This is used for temporal foreign key constraints.
+   If you omit this support function, your type cannot be used
+   as the PERIOD part of a foreign key.
+  
in v28-0004, I think here "your type"  should change to "your opclass"?

+bool
+check_amproc_is_aggregate(Oid funcid)
+{
+ bool result;
+ HeapTuple tp;
+ Form_pg_proc procform;
+
+ tp = SearchSysCache1(PROCOID, ObjectIdGetDatum(funcid));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for function %u", funcid);
+ procform = (Form_pg_proc) GETSTRUCT(tp);
+ result = procform->prokind == 'a';
+ ReleaseSysCache(tp);
+ return result;
+}
maybe
`
change procform->prokind == 'a';
`
to
`
procform->prokind == PROKIND_AGGREGATE;
`
or we can put the whole function to cache/lsyscache.c
name it just as proc_is_aggregate.


-  Added pg_dump support.
- Show the correct syntax in psql \d output for foreign keys.
in 28-0002, seems there is no work to correspond to these 2 items in
the commit message?


@@ -12335,7 +12448,8 @@ validateForeignKeyConstraint(char *conname,
  Relation rel,
  Relation pkrel,
  Oid pkindOid,
- Oid constraintOid)
+ Oid constraintOid,
+ bool temporal)
do you need to change the last argument of this function to "is_period"?


+ sprintf(paramname, "$%d", riinfo->nkeys);
+ sprintf(paramname, "$%d", riinfo->nkeys);
do you think it worth the trouble to change to snprintf, I found
related post on [1].

[1] https://stackoverflow.com/a/7316500/15603477




Re: Combine Prune and Freeze records emitted by vacuum

2024-03-17 Thread Heikki Linnakangas

On 15/03/2024 02:56, Melanie Plageman wrote:

Okay, so I was going to start using xl_heap_prune for vacuum here too,
but I realized it would be bigger because of the
snapshotConflictHorizon. Do you think there is a non-terrible way to
make the snapshotConflictHorizon optional? Like with a flag?


Yeah, another flag would do the trick.


I introduced a few sub-record types similar to what you suggested --
they help a bit with alignment, so I think they are worth keeping. There
are comments around them, but perhaps a larger diagram of the layout of
a the new XLOG_HEAP2_PRUNE record would be helpful.


I started doing this, but I can't find a way of laying out the diagram
that pgindent doesn't destroy. I thought I remember other diagrams in
the source code showing the layout of something (something with pages
somewhere?) that don't get messed up by pgindent, but I couldn't find
them.


See src/tools/pgindent/README, section "Cleaning up in case of failure 
or ugly output":


/*--
 * Text here will not be touched by pgindent.
 */


There is a bit of duplicated code between heap_xlog_prune() and
heap2_desc() since they both need to deserialize the record. Before the
code to do this was small and it didn't matter, but it might be worth
refactoring it that way now.


I have added a helper function to do the deserialization,
heap_xlog_deserialize_prune_and_freeze(). But I didn't start using it in
heap2_desc() because of the way the pg_waldump build file works. Do you
think the helper belongs in any of waldump's existing sources?

pg_waldump_sources = files(
'compat.c',
'pg_waldump.c',
'rmgrdesc.c',
)

pg_waldump_sources += rmgr_desc_sources
pg_waldump_sources += xlogreader_sources
pg_waldump_sources += files('../../backend/access/transam/xlogstats.c')

Otherwise, I assume I am suppose to avoid adding some big new include to
waldump.


Didn't look closely at that, but there's some precedent with 
commit/prepare/abort records. See ParseCommitRecord, xl_xact_commit, 
xl_parsed_commit et al.


Note that we don't provide WAL compatibility across major versions. You 
can fully remove the old xl_heap_freeze_page format. (We should bump 
XLOG_PAGE_MAGIC when this is committed though)



On the point of removing the freeze-only code path from
heap_page_prune() (now heap_page_prune_and_freeze()): while doing this,
I realized that heap_pre_freeze_checks() was not being called in the
case that we decide to freeze because we emitted an FPI while setting
the hint bit. I've fixed that, however, I've done so by moving
heap_pre_freeze_checks() into the critical section. I think that is not
okay? I could move it earlier and not do call it when the hint bit FPI
leads us to freeze tuples. But, I think that would lead to us doing a
lot less validation of tuples being frozen when checksums are enabled.
Or, I could make two critical sections?


I found another approach and just do the pre-freeze checks if we are
considering freezing except for the FPI criteria.


Hmm, I think you can make this simpler if you use 
XLogCheckBufferNeedsBackup() to check if the hint-bit update will 
generate a full-page-image before entering the critical section. Like 
you did to check if pruning will generate a full-page-image. I included 
that change in the attached patches.


I don't fully understand this:


/*
 * If we will freeze tuples on the page or they were all already frozen
 * on the page, if we will set the page all-frozen in the visibility 
map,
 * we can advance relfrozenxid and relminmxid to the values in
 * pagefrz->FreezePageRelfrozenXid and pagefrz->FreezePageRelminMxid.
 */
if (presult->all_frozen || presult->nfrozen > 0)
{
presult->new_relfrozenxid = pagefrz->FreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->FreezePageRelminMxid;
}
else
{
presult->new_relfrozenxid = pagefrz->NoFreezePageRelfrozenXid;
presult->new_relminmxid = pagefrz->NoFreezePageRelminMxid;
}


Firstly, the comment is outdated, because we have already done the 
freezing at this point. But more importantly, I don't understand the 
logic here. Could we check just presult->nfrozen > 0 here and get the 
same result?



I think it is probably worse to add both of them as additional optional
arguments, so I've just left lazy_scan_prune() with the job of
initializing them.


Ok.


Here are some patches on top of your patches for some further 
refactorings. Some notable changes in heap_page_prune_and_freeze():


- I moved the heap_prepare_freeze_tuple() work from the 2nd loop to the 
1st one. It seems more clear and efficient that way.


- I extracted the code to generate the WAL record to a separate function.

- Refactored the "will setting hint bit generate FPI" check as discussed 
above


These 

Re: Autogenerate some wait events code and documentation

2024-03-17 Thread Michael Paquier
On Sun, Mar 17, 2024 at 11:31:14AM -0700, Noah Misch wrote:
> I like the new developer experience of adding a wait event.  After release of
> v17, how should we approach back-patching an event, like was done in commits
> 8fa4a1a 1396b5c 78c0f85?  Each of those commits put the new event at the end
> of its released-branch wait_event.h enum.  In v17,
> generate-wait_event_types.pl sorts events to position them.

Indeed, that would be a bad idea.

> Adding an event
> will renumber others, which can make an extension report the wrong event until
> recompiled.  Extensions citus, pg_bulkload, and vector refer to static events.
> If a back-patch added WAIT_EVENT_MESSAGE_QUEUE_SOMETHING_NEW, an old-build
> pg_bulkload report of WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN would show up in
> pg_stat_activity as WAIT_EVENT_PARALLEL_BITMAP_SCAN.  (WAIT_EVENT_EXTENSION is
> not part of a generated enum, fortunately.)  Some options:
> 
> 1. Don't back-patch wait events to v17+.  Use the closest existing event.
> 2. Let wait_event_names.txt back-patches control the enum order.  For example,
>a line could have an annotation that controls its position relative to the
>auto-sorted lines.  For another example, the generator could stop sorting.
> 3. Accept the renumbering, because the consequence isn't that horrible.
> 
> Option (3) is worse than (1), but I don't have a recommendation between (1)
> and (2).  I tend to like (1), a concern being the ease of accidental
> violations.  If we had the ABI compliance checking that
> https://postgr.es/m/flat/cah2-wzk7tvglxzoz8a22af-gmo5ghojwtyrvak5zgovtrce...@mail.gmail.com
> explored, (1) would be plenty safe.  Should anything change here, or not?

(1) would be annoying, we have backpatched scaling problems in the
past even if that does not happen often.  And in some cases I can
understand why one would want to add a new wait event to track that
a patch does what is expected of it.  (2) to stop the automated
sorting would bring back the problems that this thread has spent time
to solve: people tend to not add wait events correctly, so I would
suspect issues on HEAD.  I've seen that too many times on older
branches.

I see an option (4), similar to your (2) without the per-line
annotation: add a new magic keyword like the existing "Section" that
is used in the first lines of generate-wait_event_types.pl where we
generate tab-separated lines with the section name as prefix of each
line.  So I can think of something like:
Section: ClassName - WaitEventFoo
FOO_1   "Waiting in foo1"
FOO_2   "Waiting in foo2"
Backpatch:
BAR_1   "Waiting in bar1"
BAR_2   "Waiting in bar2"

Then force the ordering for the docs and keep the elements in the
backpatch section at the end of the enums in the order in the txt.
One thing that could make sense is to enforce that "Backpatch" is at
the end of a section, meaning that we would need a second keyword like
a "Section: EndBackpatch" or something like that.  That's not strictly
necessary IMO as the format of the txt is easy to follow.
--
Michael


signature.asc
Description: PGP signature


Re: Built-in CTYPE provider

2024-03-17 Thread Tom Lane
Jeff Davis  writes:
> New series attached.

Coverity thinks there's something wrong with builtin_validate_locale,
and as far as I can tell it's right: the last ereport is unreachable,
because required_encoding is never changed from its initial -1 value.
It looks like there's a chunk of logic missing there, or else that
the code could be simplified further.

/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/adt/pg_locale.c: 2519 
in builtin_validate_locale()
>>> CID 1594398:  Control flow issues  (DEADCODE)
>>> Execution cannot reach the expression "encoding != required_encoding" 
>>> inside this statement: "if (required_encoding >= 0 ...".
2519if (required_encoding >= 0 && encoding != required_encoding)
2520ereport(ERROR,
2521(errcode(ERRCODE_WRONG_OBJECT_TYPE),
2522 errmsg("encoding \"%s\" does not match 
locale \"%s\"",
2523
pg_encoding_to_char(encoding), locale)));

regards, tom lane




Re: Popcount optimization using AVX512

2024-03-17 Thread David Rowley
On Sat, 16 Mar 2024 at 04:06, Nathan Bossart  wrote:
> I ran John Naylor's test_popcount module [0] with the following command on
> an i7-1195G7:
>
> time psql postgres -c 'select drive_popcount(1000, 1024)'
>
> Without your patches, this seems to take somewhere around 8.8 seconds.
> With your patches, it takes 0.6 seconds.  (I re-compiled and re-ran the
> tests a couple of times because I had a difficult time believing the amount
> of improvement.)
>
> [0] 
> https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO%2Bb7AEWHRFANxR1h1kxveEV%3DghLQ%40mail.gmail.com

I think most of that will come from getting rid of the indirect
function that currently exists in pg_popcount().

Using the attached quick hack, the performance using John's test
module goes from:

-- master
postgres=# select drive_popcount(1000, 1024);
Time: 9832.845 ms (00:09.833)
Time: 9844.460 ms (00:09.844)
Time: 9858.608 ms (00:09.859)

-- with attached hacky and untested patch
postgres=# select drive_popcount(1000, 1024);
Time: 2539.029 ms (00:02.539)
Time: 2598.223 ms (00:02.598)
Time: 2611.435 ms (00:02.611)

--- and with the avx512 patch on an AMD 7945HX CPU:
postgres=# select drive_popcount(1000, 1024);
Time: 564.982 ms
Time: 556.540 ms
Time: 554.032 ms

The following comment seems like it could do with some improvements.

 * Use AVX-512 Intrinsics for supported Intel CPUs or fall back the the software
 * loop in pg_bunutils.c and use the best 32 or 64 bit fast methods. If no fast
 * methods are used this will fall back to __builtin_* or pure software.

There's nothing much specific to Intel here.  AMD Zen4 has AVX512.
Plus "pg_bunutils.c" should be "pg_bitutils.c" and "the the"

How about just:

 * Use AVX-512 Intrinsics on supported CPUs. Fall back the software loop in
 * pg_popcount_slow() when AVX-512 is unavailable.

Maybe it's worth exploring something along the lines of the attached
before doing the AVX512 stuff.  It seems like a pretty good speed-up
and will apply for CPUs without AVX512 support.

David
diff --git a/src/port/pg_bitutils.c b/src/port/pg_bitutils.c
index 640a89561a..85e45cee9b 100644
--- a/src/port/pg_bitutils.c
+++ b/src/port/pg_bitutils.c
@@ -305,7 +305,18 @@ pg_popcount(const char *buf, int bytes)
 
while (bytes >= 8)
{
-   popcnt += pg_popcount64(*words++);
+#ifdef _MSC_VER
+   popcnt += __popcnt64(*words++);
+#else
+   uint64 res;
+
+   __asm__ __volatile__(" popcntq %1,%0\n"
+: "=q"(res)
+: "rm"(word)
+: "cc");
+   popcnt += (int) res;
+   words++;
+#endif
bytes -= 8;
}
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index f1d18a1b29..ae880db64c 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -26,6 +26,7 @@ subdir('test_misc')
 subdir('test_oat_hooks')
 subdir('test_parser')
 subdir('test_pg_dump')
+subdir('test_popcount')
 subdir('test_predtest')
 subdir('test_radixtree')
 subdir('test_rbtree')


Re: broken JIT support on Fedora 40

2024-03-17 Thread Dmitry Dolgov
> On Fri, Mar 15, 2024 at 01:54:38PM +1300, Thomas Munro wrote:
> For me it seems that the LLVMRunPasses() call, new in
>
> commit 76200e5ee469e4a9db5f9514b9d0c6a31b496bff
> Author: Thomas Munro 
> Date:   Wed Oct 18 22:15:54 2023 +1300
>
> jit: Changes for LLVM 17.
>
> is reaching code that segfaults inside libLLVM, specifically in
> llvm::InlineFunction(llvm::CallBase&, llvm::InlineFunctionInfo&, bool,
> llvm::AAResults*, bool, llvm::Function*).  First obvious question
> would be: is that NULL argument still acceptable?  Perhaps it wants
> our LLVMTargetMachineRef there:
>
>err = LLVMRunPasses(module, passes, NULL, options);
>
> But then when we see what is does with that argument, it arrives at a
> place that apparently accepts nullptr.
>
> https://github.com/llvm/llvm-project/blob/6b2bab2839c7a379556a10287034bd55906d7094/llvm/lib/Passes/PassBuilderBindings.cpp#L56
> https://github.com/llvm/llvm-project/blob/6b2bab2839c7a379556a10287034bd55906d7094/llvm/include/llvm/Passes/PassBuilder.h#L124
>
> Hrmph.  Might need an assertion build to learn more.  I'll try to look
> again next week or so.

Looks like I can reproduce this as well, libLLVM crashes when reaching
AddReturnAttributes inside InlineFunction, when trying to access
operands of the return instruction. I think, for whatever reason, the
latest LLVM doesn't like (i.e. do not expect this when performing
inlining pass) return instructions that do not have a return value, and
this is what happens in the outblock of deform function we generate
(slot_compile_deform).

For verification, I've modified the deform.outblock to call LLVMBuildRet
instead of LLVMBuildRetVoid and this seems to help -- inline and deform
stages are still performed as before, but nothing crashes. But of course
it doesn't sound right that inlining pass cannot process such code.
Unfortunately I don't see any obvious change in the recent LLVM history
that would justify this outcome, might be a genuine bug, will
investigate further.




Re: Q: Escapes in jsonpath Idents

2024-03-17 Thread David E. Wheeler
On Mar 17, 2024, at 15:12, Erik Wienhold  wrote:

> Hi David,

Hey Erik. Thanks for the detailed reply and patch!

> So I think it makes sense to reword the entire backslash part of the
> paragraph and remove references to JSON entirely.  The attached patch
> does that and also formats the backslash escapes as a bulleted list for
> readability.

Ah, it’s JavaScript format, not JSON! This does clarify things quite nicely, 
thank you. Happy to add my review once it’s in a commit fest.

> The first case ($.$foo) is in line with the restriction on member
> accessors that you quoted first.

Huh, that’s now how I read it. Here it is again:

>> Member accessor that returns an object member with the specified
>> key. If the key name matches some named variable starting with $ or
>> does not meet the JavaScript rules for an identifier, it must be
>> enclosed in double quotes to make it a string literal.


Note that in my example `$foo` does not match a variable. I mean it looks like 
a variable, but none is used here. I guess it’s being conservative because it 
might be used in one of the functions, like jsonb_path_exists(), to which 
variables might be passed.

> The error message 'syntax error at or near "$oo" of jsonpath input' for
> the second case ($.f$oo), however, looks as if the scanner identifies
> '$oo' as a variable instead of contiuing the scan of identifier (f$oo)
> for the member accessor.  Looks like a bug to me because a variable
> doesn't even make sense in that place.

Right. Maybe the docs should be updated to say that a literal dollar sign isn’t 
supported in identifiers, unlike in JavaScript, except through escapes like 
this:

> What works though, besides double quoting, is escaping the dollar sign:
> 
> regress=# select '$.\u0024foo'::jsonpath;
> jsonpath
> --
> $."$foo"
> (1 row)
> 
> And we've come full circle :)



Best,

David






Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-17 Thread Tom Lane
I wrote:
> Dean Rasheed  writes:
>> Overall, I think this is heading in the right direction. I think we
>> just need a good way to say "the n'th output column of the subplan",
>> that can't be confused with anything else in the output.

> We could consider notations like "(SubPlan 1 column 2)", which
> couldn't be confused with anything else, if only because a name
> like that would have to be double-quoted.  It's a little more
> verbose but not that much.  I fear "(SubPlan 1 col 2)" is too
> short to be clear.

Here's a cleaned-up version that seems to successfully resolve
PARAM_EXEC references in all cases.  I haven't changed the
"(plan_name).colN" notation, but that's an easy fix once we have
consensus on the spelling.

There are two other loose ends bothering me:

1.  I see that Gather nodes sometimes print things like

   ->  Gather (actual rows=N loops=N)
 Workers Planned: 2
 Params Evaluated: $0, $1
 Workers Launched: N

This now sticks out like a sore thumb, because there's no other
reference to $0 or $1 in the EXPLAIN output.  We could possibly
adjust the code to print something like

Params Evaluated: (InitPlan 1).col1, (InitPlan 2).col1

but I think that's pretty silly.  This looks to me like a code
debugging aid that shouldn't have survived past initial development.
It's of zero use to end users, and it doesn't correspond to anything
we bother to mention in EXPLAIN output in any other case: initplans
just magically get evaluated at the right times.  I propose we
nuke the "Params Evaluated" output altogether.

2.  MULTIEXPR_SUBLINK subplans now result in EXPLAIN output like

explain (verbose, costs off)
update inhpar i set (f1, f2) = (select i.f1, i.f2 || '-' from int4_tbl limit 1);
  ...
   ->  Result
 Output: (SubPlan 1).col1, (SubPlan 1).col2, (SubPlan 1), i.tableoid, 
i.ctid

The undecorated reference to (SubPlan 1) is fairly confusing, since
it doesn't correspond to anything that will actually get output.
I suggest that perhaps instead this should read

 Output: (SubPlan 1).col1, (SubPlan 1).col2, IGNORE(SubPlan 1), 
i.tableoid, i.ctid

or

 Output: (SubPlan 1).col1, (SubPlan 1).col2, RESET(SubPlan 1), 
i.tableoid, i.ctid

the point of "RESET()" being that what the executor actually does
there is to re-arm the SubPlan to be evaluated again on the next
pass through the targetlist.  I'm not greatly in love with either
of those ideas, though.  Any thoughts?

regards, tom lane

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 58a603ac56..2609255da6 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3062,10 +3062,10 @@ select exists(select 1 from pg_enum), sum(c1) from ft1;
 QUERY PLAN
 --
  Foreign Scan
-   Output: $0, (sum(ft1.c1))
+   Output: (InitPlan 1).col1, (sum(ft1.c1))
Relations: Aggregate on (public.ft1)
Remote SQL: SELECT sum("C 1") FROM "S 1"."T 1"
-   InitPlan 1 (returns $0)
+   InitPlan 1
  ->  Seq Scan on pg_catalog.pg_enum
 (6 rows)
 
@@ -3080,8 +3080,8 @@ select exists(select 1 from pg_enum), sum(c1) from ft1 group by 1;
 QUERY PLAN 
 ---
  GroupAggregate
-   Output: $0, sum(ft1.c1)
-   InitPlan 1 (returns $0)
+   Output: (InitPlan 1).col1, sum(ft1.c1)
+   InitPlan 1
  ->  Seq Scan on pg_catalog.pg_enum
->  Foreign Scan on public.ft1
  Output: ft1.c1
@@ -3305,10 +3305,10 @@ select sum(c1) filter (where (c1 / c1) * random() <= 1) from ft1 group by c2 ord
 
 explain (verbose, costs off)
 select sum(c2) filter (where c2 in (select c2 from ft1 where c2 < 5)) from ft1;
-QUERY PLAN 

+  QUERY PLAN   
+---
  Aggregate
-   Output: sum(ft1.c2) FILTER (WHERE (hashed SubPlan 1))
+   Output: sum(ft1.c2) FILTER (WHERE (ANY (ft1.c2 = (hashed SubPlan 1).col1)))
->  Foreign Scan on public.ft1
  Output: ft1.c2
  Remote SQL: SELECT c2 FROM "S 1"."T 1"
@@ -6171,9 +6171,9 @@ UPDATE ft2 AS target SET (c2, c7) = (
  Update on public.ft2 target
Remote SQL: UPDATE "S 1"."T 1" SET c2 = $2, c7 = $3 WHERE ctid = $1
->  Foreign Scan on public.ft2 target
- Output: $1, $2, (SubPlan 1 (returns $1,$2)), target.ctid, target.*
+ Output: (SubPlan 1).col1, (SubPlan 1).col2, (SubPlan 1), target.ctid, target.*
  Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8, ctid FROM "S 1"."T 1" WHERE (("C 1" > 1100)) FOR UPDATE
- SubPlan 1 (returns $1,$2)
+ SubPlan 1
->  

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-17 Thread Tomas Vondra



On 3/17/24 17:38, Andres Freund wrote:
> Hi,
> 
> On 2024-03-16 21:25:18 +0100, Tomas Vondra wrote:
>> On 3/16/24 20:12, Andres Freund wrote:
>>> That would address some of the worst behaviour, but it doesn't really seem 
>>> to
>>> address the underlying problem of the two iterators being modified
>>> independently. ISTM the proper fix would be to protect the state of the
>>> iterators with a single lock, rather than pushing down the locking into the
>>> bitmap code. OTOH, we'll only need one lock going forward, so being economic
>>> in the effort of fixing this is also important.
>>>
>>
>> Can you share some details about how you identified the problem, counted
>> the prefetches that happen too late, etc? I'd like to try to reproduce
>> this to understand the issue better.
> 
> There's two aspects. Originally I couldn't reliably reproduce the regression
> with Melanie's repro on my laptop. I finally was able to do so after I
> a) changed the block device's read_ahead_kb to 0
> b) used effective_io_concurrency=1
> 
> That made the difference between the BitmapAdjustPrefetchIterator() locations
> very significant, something like 2.3s vs 12s.
> 

Interesting. I haven't thought about read_ahead_kb, but in hindsight it
makes sense it affects these cases. OTOH I did not set it to 0 on either
machine (the 6xSATA RAID0 has it at 12288, for example) and yet that's
how we found the regressions.

For eic it makes perfect sense that setting it to 1 is particularly
vulnerable to this issue - it only takes a small "desynchronization" of
the two iterators for the prefetch to "fall behind" and frequently
prefetch blocks we already read.

> Besides a lot of other things, I finally added debugging fprintfs printing the
> pid, (prefetch, read), block number. Even looking at tiny excerpts of the
> large amount of output that generates shows that two iterators were out of
> sync.
> 

Thanks. I did experiment with fprintf, but it's quite cumbersome, so I
was hoping you came up with some smart way to trace this king of stuff.
For example I was wondering if ebpf would be a more convenient way.

> 
>> If I understand correctly, what may happen is that a worker reads blocks
>> from the "prefetch" iterator, but before it manages to issue the
>> posix_fadvise, some other worker already did pread. Or can the iterators
>> get "out of sync" in a more fundamental way?
> 
> I agree that the current scheme of two shared iterators being used has some
> fairly fundamental raciness. But I suspect there's more than that going on
> right now.
> 
> Moving BitmapAdjustPrefetchIterator() to later drastically increases the
> raciness because it means table_scan_bitmap_next_block() happens between
> increasing the "real" and the "prefetch" iterators.
> 
> An example scenario that, I think, leads to the iterators being out of sync,
> without there being races between iterator advancement and completing
> prefetching:
> 
> start:
>   real -> block 0
>   prefetch -> block 0
>   prefetch_pages = 0
>   prefetch_target = 1
> 
> W1: tbm_shared_iterate(real) -> block 0
> W2: tbm_shared_iterate(real) -> block 1
> W1: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 0
> W2: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 1
> W1: read block 0
> W2: read block 1
> W1: BitmapPrefetch() -> prefetch_pages++ -> 1, tbm_shared_iterate(prefetch) 
> -> 2, prefetch block 2
> W2: BitmapPrefetch() -> nothing, as prefetch_pages == prefetch_target
> 
> W1: tbm_shared_iterate(real) -> block 2
> W2: tbm_shared_iterate(real) -> block 3
> 
> W2: BitmapAdjustPrefetchIterator() -> prefetch_pages--
> W2: read block 3
> W2: BitmapPrefetch() -> prefetch_pages++, tbm_shared_iterate(prefetch) -> 3, 
> prefetch block 3
> 
> So afaict here we end up prefetching a block that the *same process* just had
> read.
> 

Uh, that's very weird. I'd understood if there's some cross-process
issue, but if this happens in a single process ... strange.

> ISTM that the idea of somehow "catching up" in BitmapAdjustPrefetchIterator(),
> separately from advancing the "real" iterator, is pretty ugly for non-parallel
> BHS and just straight up broken in the parallel case.
> 

Yeah, I agree with the feeling it's an ugly fix. Definitely seems more
like fixing symptoms than the actual problem.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Make attstattarget nullable

2024-03-17 Thread Nathan Bossart
On Sun, Mar 17, 2024 at 01:51:39PM +0100, Peter Eisentraut wrote:
> I have committed this patch series.  Thanks.

My compiler is worried that "newtarget" might be getting used
uninitialized.  AFAICT there's no actual risk here, so I think initializing
it to 0 is sufficient.  I'll commit the attached patch shortly.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index 5f49479832..1db3ef69d2 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -606,7 +606,7 @@ AlterStatistics(AlterStatsStmt *stmt)
 	bool		repl_null[Natts_pg_statistic_ext];
 	bool		repl_repl[Natts_pg_statistic_ext];
 	ObjectAddress address;
-	int			newtarget;
+	int			newtarget = 0;
 	bool		newtarget_default;
 
 	/* -1 was used in previous versions for the default setting */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ae6719329e..3ed0618b4e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -8711,7 +8711,7 @@ ATExecDropExpression(Relation rel, const char *colName, bool missing_ok, LOCKMOD
 static ObjectAddress
 ATExecSetStatistics(Relation rel, const char *colName, int16 colNum, Node *newValue, LOCKMODE lockmode)
 {
-	int			newtarget;
+	int			newtarget = 0;
 	bool		newtarget_default;
 	Relation	attrelation;
 	HeapTuple	tuple,


Re: Q: Escapes in jsonpath Idents

2024-03-17 Thread Erik Wienhold
Hi David,

On 2024-03-16 19:39 +0100, David E. Wheeler wrote:
> The jsonpath doc[1] has an excellent description of the format of
> strings, but for unquoted path keys, it simply says:
> 
> > Member accessor that returns an object member with the specified
> > key. If the key name matches some named variable starting with $ or
> > does not meet the JavaScript rules for an identifier, it must be
> > enclosed in double quotes to make it a string literal.
> 
> I went looking for the JavaScript rules for an identifier and found
> this in the MDN docs[2]:
> 
> > In JavaScript, identifiers can contain Unicode letters, $, _, and
> > digits (0-9), but may not start with a digit. An identifier differs
> > from a string in that a string is data, while an identifier is part
> > of the code. In JavaScript, there is no way to convert identifiers
> > to strings, but sometimes it is possible to parse strings into
> > identifiers.
> 
> 
> However, the Postgres parsing of jsonpath keys appears to follow the
> same rules as strings, allowing backslash escapes:
> 
> david=# select '$.fo\u00f8 == $x'::jsonpath;
>  jsonpath   ---
>  ($."foø" == $"x")
> 
> This would seem to contradict the documentation. Is this behavior
> required by the SQL standard? Do the docs need updating? Or should the
> code actually follow the JSON identifier behavior?

That quoted MDN page does not give the whole picture.  ECMAScript and JS
do allow Unicode escape sequences in identifier names:

https://262.ecma-international.org/#sec-identifier-names
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#identifiers

> PS: Those excellent docs on strings mentions support for \v, but the
> grammar in the right nav of https://www.json.org/json-en.html does
> not. Another bonus feature?

You refer to that sentence: "Other special backslash sequences include
those recognized in JSON strings: \b, \f, \n, \r, \t, \v for various
ASCII control characters, and \u for a Unicode character identified
by its 4-hex-digit code point."

Mentioning JSON and \v in the same sentence is wrong: JavaScript allows
that escape in strings but JSON doesn't.  I think the easiest is to just
replace "JSON" with "JavaScript" in that sentence to make it right.  The
paragraph also already says "embedded string literals follow JavaScript/
ECMAScript conventions", so mentioning JSON seems unnecessary to me.

The last sentence also mentions backslash escapes \xNN and \u{N...} as
deviations from JSON when in fact those are valid escape sequences from
ECMA-262: https://262.ecma-international.org/#prod-HexEscapeSequence
So I think it makes sense to reword the entire backslash part of the
paragraph and remove references to JSON entirely.  The attached patch
does that and also formats the backslash escapes as a bulleted list for
readability.

> [1]: https://www.postgresql.org/docs/16/datatype-json.html#DATATYPE-JSONPATH
> [2]: https://developer.mozilla.org/en-US/docs/Glossary/Identifier

On 2024-03-16 21:33 +0100, David E. Wheeler wrote:
> On Mar 16, 2024, at 14:39, David E. Wheeler 
> wrote:
> 
> > I went looking for the JavaScript rules for an identifier and found
> > this in the MDN docs[2]:
> > 
> >> In JavaScript, identifiers can contain Unicode letters, $, _, and
> >> digits (0-9), but may not start with a digit. An identifier differs
> >> from a string in that a string is data, while an identifier is part
> >> of the code. In JavaScript, there is no way to convert identifiers
> >> to strings, but sometimes it is possible to parse strings into
> >> identifiers.
> 
> Coda: Dollar signs don’t work at all outside double-quoted string
> identifiers:
> 
> david=# select '$.$foo'::jsonpath;
> ERROR:  syntax error at or near "$foo" of jsonpath input
> LINE 1: select '$.$foo'::jsonpath;
>^
> 
> david=# select '$.f$oo'::jsonpath;
> ERROR:  syntax error at or near "$oo" of jsonpath input
> LINE 1: select '$.f$oo'::jsonpath;
>^
> 
> david=# select '$."$foo"'::jsonpath;
>  jsonpath 
> --
>  $."$foo"
> 
> This, too, contradicts the MDM definition an identifier (and some
> quick browser tests).

The first case ($.$foo) is in line with the restriction on member
accessors that you quoted first.

The error message 'syntax error at or near "$oo" of jsonpath input' for
the second case ($.f$oo), however, looks as if the scanner identifies
'$oo' as a variable instead of contiuing the scan of identifier (f$oo)
for the member accessor.  Looks like a bug to me because a variable
doesn't even make sense in that place.

What works though, besides double quoting, is escaping the dollar sign:

regress=# select '$.\u0024foo'::jsonpath;
 jsonpath
--
 $."$foo"
(1 row)

And we've come full circle :)

-- 
Erik
>From a2bade71867aecbea90c7c03f0295cecca0c215d Mon Sep 17 00:00:00 2001
From: Erik Wienhold 
Date: Sun, 17 Mar 2024 19:28:07 +0100
Subject: [PATCH v1] Simplify docs on 

Re: Autogenerate some wait events code and documentation

2024-03-17 Thread Noah Misch
On Wed, Jul 05, 2023 at 10:57:19AM +0900, Michael Paquier wrote:
> I have applied it.

I like the new developer experience of adding a wait event.  After release of
v17, how should we approach back-patching an event, like was done in commits
8fa4a1a 1396b5c 78c0f85?  Each of those commits put the new event at the end
of its released-branch wait_event.h enum.  In v17,
generate-wait_event_types.pl sorts events to position them.  Adding an event
will renumber others, which can make an extension report the wrong event until
recompiled.  Extensions citus, pg_bulkload, and vector refer to static events.
If a back-patch added WAIT_EVENT_MESSAGE_QUEUE_SOMETHING_NEW, an old-build
pg_bulkload report of WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN would show up in
pg_stat_activity as WAIT_EVENT_PARALLEL_BITMAP_SCAN.  (WAIT_EVENT_EXTENSION is
not part of a generated enum, fortunately.)  Some options:

1. Don't back-patch wait events to v17+.  Use the closest existing event.
2. Let wait_event_names.txt back-patches control the enum order.  For example,
   a line could have an annotation that controls its position relative to the
   auto-sorted lines.  For another example, the generator could stop sorting.
3. Accept the renumbering, because the consequence isn't that horrible.

Option (3) is worse than (1), but I don't have a recommendation between (1)
and (2).  I tend to like (1), a concern being the ease of accidental
violations.  If we had the ABI compliance checking that
https://postgr.es/m/flat/cah2-wzk7tvglxzoz8a22af-gmo5ghojwtyrvak5zgovtrce...@mail.gmail.com
explored, (1) would be plenty safe.  Should anything change here, or not?




Re: pg_upgrade failing for 200+ million Large Objects

2024-03-17 Thread Laurenz Albe
On Sat, 2024-03-16 at 18:46 -0400, Tom Lane wrote:
> > Without the patch:
> > Runtime: 74.5 minutes
> 
> > With the patch:
> > Runtime: 70 minutes
> 
> Hm, I'd have hoped for a bit more runtime improvement.

I did a second run with the patch, and that finished in 66 minutes,
so there is some jitter there.

I think the reduced memory footprint and the reduced transaction ID
consumption alone make this patch worthwhile.

Yours,
Laurenz Albe




Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-17 Thread Andres Freund
Hi,

On 2024-03-16 21:25:18 +0100, Tomas Vondra wrote:
> On 3/16/24 20:12, Andres Freund wrote:
> > That would address some of the worst behaviour, but it doesn't really seem 
> > to
> > address the underlying problem of the two iterators being modified
> > independently. ISTM the proper fix would be to protect the state of the
> > iterators with a single lock, rather than pushing down the locking into the
> > bitmap code. OTOH, we'll only need one lock going forward, so being economic
> > in the effort of fixing this is also important.
> >
>
> Can you share some details about how you identified the problem, counted
> the prefetches that happen too late, etc? I'd like to try to reproduce
> this to understand the issue better.

There's two aspects. Originally I couldn't reliably reproduce the regression
with Melanie's repro on my laptop. I finally was able to do so after I
a) changed the block device's read_ahead_kb to 0
b) used effective_io_concurrency=1

That made the difference between the BitmapAdjustPrefetchIterator() locations
very significant, something like 2.3s vs 12s.

Besides a lot of other things, I finally added debugging fprintfs printing the
pid, (prefetch, read), block number. Even looking at tiny excerpts of the
large amount of output that generates shows that two iterators were out of
sync.


> If I understand correctly, what may happen is that a worker reads blocks
> from the "prefetch" iterator, but before it manages to issue the
> posix_fadvise, some other worker already did pread. Or can the iterators
> get "out of sync" in a more fundamental way?

I agree that the current scheme of two shared iterators being used has some
fairly fundamental raciness. But I suspect there's more than that going on
right now.

Moving BitmapAdjustPrefetchIterator() to later drastically increases the
raciness because it means table_scan_bitmap_next_block() happens between
increasing the "real" and the "prefetch" iterators.

An example scenario that, I think, leads to the iterators being out of sync,
without there being races between iterator advancement and completing
prefetching:

start:
  real -> block 0
  prefetch -> block 0
  prefetch_pages = 0
  prefetch_target = 1

W1: tbm_shared_iterate(real) -> block 0
W2: tbm_shared_iterate(real) -> block 1
W1: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 0
W2: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 1
W1: read block 0
W2: read block 1
W1: BitmapPrefetch() -> prefetch_pages++ -> 1, tbm_shared_iterate(prefetch) -> 
2, prefetch block 2
W2: BitmapPrefetch() -> nothing, as prefetch_pages == prefetch_target

W1: tbm_shared_iterate(real) -> block 2
W2: tbm_shared_iterate(real) -> block 3

W2: BitmapAdjustPrefetchIterator() -> prefetch_pages--
W2: read block 3
W2: BitmapPrefetch() -> prefetch_pages++, tbm_shared_iterate(prefetch) -> 3, 
prefetch block 3

So afaict here we end up prefetching a block that the *same process* just had
read.

ISTM that the idea of somehow "catching up" in BitmapAdjustPrefetchIterator(),
separately from advancing the "real" iterator, is pretty ugly for non-parallel
BHS and just straight up broken in the parallel case.

Greetings,

Andres Freund




Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-17 Thread Tom Lane
Dean Rasheed  writes:
> On Sat, 16 Mar 2024 at 17:25, Tom Lane  wrote:
>> After looking at your draft some more, it occurred to me that we're
>> not that far from getting rid of showing "$n" entirely in this
>> context.  Instead of (subplan_name).$n, we could write something like
>> (subplan_name).colN or (subplan_name).columnN or (subplan_name).fN,

> I think it would be confusing if there were tables whose columns are
> named "colN". In that case, a SQL qual like "t1.col2 = t2.col2" might
> be output as something like "t1.col2 = (SubPlan 1).col3", since the
> subplan's targetlist wouldn't necessarily just output the table
> columns in order.

Perhaps.  I think people who are using columns named like that are
already accustomed to having to pay close attention to which table
the column is shown as qualified by.  So I'm not sure there'd really
be much problem in practice.

> I actually think "$n" is not so bad (especially if we make n the
> column number). The fact that it's qualified by the subplan name ought
> to be sufficient to avoid it being confused with an external
> parameter.

That's an interesting compromise position, but I'm not sure that
it buys much compared to the approach shown in your draft (that
is, continuing to use the real param IDs).  The real IDs at least
have the advantage of being unique.

> Whatever name is chosen, I think we should still output "(returns
> ...)" on the subplan nodes.

We should do that if we continue to show real param IDs, but
if we change to using column numbers then I think it's pointless.
Output like

SubPlan 1 (returns $1,$2)
  ...
SubPlan 2 (returns $1,$2,$3)

seems to me that it'd be more confusing not less so.  Does SubPlan 2
return the same values as SubPlan 1 plus more?

> Overall, I think this is heading in the right direction. I think we
> just need a good way to say "the n'th output column of the subplan",
> that can't be confused with anything else in the output.

We could consider notations like "(SubPlan 1 column 2)", which
couldn't be confused with anything else, if only because a name
like that would have to be double-quoted.  It's a little more
verbose but not that much.  I fear "(SubPlan 1 col 2)" is too
short to be clear.

regards, tom lane




Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-03-17 Thread Alexander Korotkov
Hi Amit,
Hi Bharath,

On Sat, Mar 16, 2024 at 5:05 PM Bharath Rupireddy
 wrote:
> On Sat, Mar 16, 2024 at 4:26 PM Amit Kapila  wrote:
> > In general, it seems this patch has been stuck for a long time on the
> > decision to choose an appropriate UI (syntax), and we thought of
> > moving it further so that the other parts of the patch can be
> > reviewed/discussed. So, I feel before pushing this we should see
> > comments from a few (at least two) other senior members who earlier
> > shared their opinion on the syntax. I know we don't have much time
> > left but OTOH pushing such a change (where we didn't have a consensus
> > on syntax) without much discussion at this point of time could lead to
> > discussions after commit.
>
> +1 to gain consensus first on the syntax changes. With this, we might
> be violating the SQL standard for explicit txn commands (I stand for
> correction about the SQL standard though).

Thank you for your feedback.  Generally, I agree it's correct to get
consensus on syntax first.  And yes, this patch has been here since
2016.  We didn't get consensus for syntax for 8 years.  Frankly
speaking, I don't see a reason why this shouldn't take another 8
years.  At the same time the ability to wait on standby given LSN is
replayed seems like pretty basic and simple functionality.  Thus, it's
quite frustrating it already took that long and still unclear when/how
this could be finished.

My current attempt was to commit minimal implementation as less
invasive as possible.  A new clause for BEGIN doesn't require
additional keywords and doesn't introduce additional statements.  But
yes, this is still a new qual.  And, yes, Amit you're right that even
if I had committed that, there was still a high risk of further
debates and revert.

Given my specsis about agreement over syntax, I'd like to check
another time if we could go without new syntax at all.  There was an
attempt to implement waiting for lsn as a function.  But function
holds a snapshot, which could prevent WAL records from being replayed.
Releasing a snapshot could break the parent query.  But now we have
procedures, which need a dedicated statement for the call and can even
control transactions.  Could we implement a waitlsn in a procedure
that:

1. First, check that it was called with non-atomic context (that is,
it's not called within a transaction). Trigger error if called with
atomic context.
2. Release a snapshot to be able to wait without risk of WAL replay
stuck.  Procedure is still called within the snapshot.  It's a bit of
a hack to release a snapshot, but Vacuum statements already do so.

Amit, Bharath, what do you think about this approach?  Is this a way to go?

--
Regards,
Alexander Korotkov




Re: Fix for timestamp lag issue from emit_log_hook when GUC log_line_prefix has '%m'

2024-03-17 Thread Kambam Vinay
Thanks Michael for the review. Agree with your comment on the patch.
updated the patch with recommended change.

Thanks,
Vinay

On Mon, Mar 11, 2024 at 1:13 PM Michael Paquier  wrote:

> On Sat, Mar 09, 2024 at 09:09:39PM +0530, Kambam Vinay wrote:
> > We observed a slight lag in timestamp for a few logs from the
> emit_log_hook
> > hook implementation when the log_line_prefix GUC has '%m'.
> >
> > Upon debugging, we found that the saved_timeval_set variable is set to
> > 'true' in get_formatted_log_time() but is not reset to 'false' until the
> > next call to send_message_to_server_log(). Due to this, saved_timeval_set
> > will be true during the execution of hook emit_log_hook() which prefixes
> > the saved timestamp 'saved_timeval' from the previous log line (our hook
> > implementation calls log_line_prefix()).
> >
> > Attached patch sets the saved_timeval_set to false before executing the
> > emit_log_hook()
>
> Indeed.  If you rely on log_line_prefix() in your hook before the
> server side elog, the saved timestamp would not match with what could
> be computed in the follow-up send_message_to_server_log or
> send_message_to_frontend.
>
> Hmm.  Shouldn't we remove the forced reset of formatted_log_time for
> the 'm' case in log_status_format() and remove the reset done at the
> beginning of send_message_to_server_log()?  One problem with your
> patch is that we would still finish with a different saved_timeval in
> the hook and in send_message_to_server_log(), but we should do things
> so as the timestamps are the same for the whole duration of
> EmitErrorReport(), no?  It seems to me that a more correct solution
> would be to reset saved_timeval_set and formatted_log_time[0] once
> before the hook, at the beginning of EmitErrorReport().
> --
> Michael
>


0001-set-saved_timeval_set-to-false-before-executing-emit.patch
Description: Binary data


Re: Simplify backtrace_functions GUC code

2024-03-17 Thread Alvaro Herrera
On 2024-Mar-17, Bharath Rupireddy wrote:

> I think the code can be simplified a bit by using
> SplitIdentifierString like in the attached patch. With this,
> backtrace_function_list variable and assign_backtrace_functions() will
> go away.

Did you read the discussion that led to the current coding?  What part
of it is no longer valid, in such a way that you want to make the code
look like an approach that was rejected back then?

https://www.postgresql.org/message-id/flat/35beac83-bf15-9d79-05c4-2dccd0834993%402ndquadrant.com#4dc9ccec753c0d99369be9d53bf24476

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Make attstattarget nullable

2024-03-17 Thread Peter Eisentraut

On 14.03.24 15:46, Tomas Vondra wrote:

2) The newtarget handling in AlterStatistics seems rather confusing.
Why
does it get set to -1 just to ignore the value later? For a while I was
99% sure ALTER STATISTICS ... SET STATISTICS DEFAULT will set the field
to -1. Maybe ditching the first if block and directly checking
stmt->stxstattarget before setting repl_val/repl_null would be better?


But we also need to continue accepting -1 for default on input.  The
current code achieves that, the proposed variant would not.


OK, I did not realize that. But then maybe this should be explained in a
comment before the new "if" block, because people won't realize why it
needs to be this way.


In the new version, I tried to write this more explicitly, and updated
tablecmds.c to match.


WFM. It still seems a bit hard to read, but I don't know how to do it
better. I guess it's how it has to be to deal with multiple default
values in a backwards-compatible way. Good thing is it's localized in
two places.


I have committed this patch series.  Thanks.





Re: Improving EXPLAIN's display of SubPlan nodes

2024-03-17 Thread Dean Rasheed
On Sat, 16 Mar 2024 at 17:25, Tom Lane  wrote:
>
> After looking at your draft some more, it occurred to me that we're
> not that far from getting rid of showing "$n" entirely in this
> context.  Instead of (subplan_name).$n, we could write something like
> (subplan_name).colN or (subplan_name).columnN or (subplan_name).fN,
> depending on your taste for verboseness.  "fN" would correspond to the
> names we assign to columns of anonymous record types, but it hasn't
> got much else to recommend it.  In the attached I used "colN";
> "columnN" would be my next choice.

Using the column number rather than the parameter index looks a lot
neater, especially in output with multiple subplans. Of those choices,
"colN" looks nicest, however...

I think it would be confusing if there were tables whose columns are
named "colN". In that case, a SQL qual like "t1.col2 = t2.col2" might
be output as something like "t1.col2 = (SubPlan 1).col3", since the
subplan's targetlist wouldn't necessarily just output the table
columns in order.

I actually think "$n" is not so bad (especially if we make n the
column number). The fact that it's qualified by the subplan name ought
to be sufficient to avoid it being confused with an external
parameter. Maybe there are other options, but I think it's important
to choose something that's unlikely to be confused with a real column
name.

Whatever name is chosen, I think we should still output "(returns
...)" on the subplan nodes. In a complex query there might be a lot of
output columns, and the expressions might be quite complex, making it
hard to see how many columns the subplan is returning. Besides,
without that, it might not be obvious to everyone what "colN" (or
whatever we settle on) means in places that refer to the subplan.

> You could also imagine trying to use the sub-SELECT's actual output
> column names, but I fear that would be ambiguous --- too often it'd
> be "?column?" or some other non-unique name.

Yeah, I think that's a non-starter, because the output column names
aren't necessarily unique.

> The attached proof-of-concept is incomplete: it fails to replace some
> $n occurrences with subplan references, as is visible in some of the
> test cases.  I believe your quick hack in get_parameter() is not
> correct in detail, but for the moment I didn't bother to debug it.

Yeah, that's exactly what it was, a quick hack. I just wanted to get
some output to see what it would look like in a few real cases.

Overall, I think this is heading in the right direction. I think we
just need a good way to say "the n'th output column of the subplan",
that can't be confused with anything else in the output.

Regards,
Dean




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-17 Thread Bharath Rupireddy
On Sat, Mar 16, 2024 at 3:55 PM Amit Kapila  wrote:
>
> procArray->replication_slot_catalog_xmin) but then don't adjust it for
> 'max_slot_xid_age'. I could be missing something in this but it is
> better to keep discussing this and try to move with another parameter
> 'inactive_replication_slot_timeout' which according to me can be kept
> at slot level instead of a GUC but OTOH we need to see the arguments
> on both side and then decide which makes more sense.

Hm. Are you suggesting inactive_timeout to be a slot level parameter
similar to 'failover' property added recently by
c393308b69d229b664391ac583b9e07418d411b6 and
73292404370c9900a96e2bebdc7144f7010339cf? With this approach, one can
set inactive_timeout while creating the slot either via
pg_create_physical_replication_slot() or
pg_create_logical_replication_slot() or CREATE_REPLICATION_SLOT or
ALTER_REPLICATION_SLOT command, and postgres tracks the
last_inactive_at for every slot based on which the slot gets
invalidated. If this understanding is right, I can go ahead and work
towards it.

Alternatively, we can go the route of making GUC a list of key-value
pairs of {slot_name, inactive_timeout}, but this kind of GUC for
setting slot level parameters is going to be the first of its kind, so
I'd prefer the above approach.

Thoughts?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-17 Thread Thomas Munro
On Tue, Mar 12, 2024 at 10:03 AM Melanie Plageman
 wrote:
> I've rebased the attached v10 over top of the changes to
> lazy_scan_heap() Heikki just committed and over the v6 streaming read
> patch set. I started testing them and see that you are right, we no
> longer pin too many buffers. However, the uncached example below is
> now slower with streaming read than on master -- it looks to be
> because it is doing twice as many WAL writes and syncs. I'm still
> investigating why that is.

That makes sense to me.  We have 256kB of buffers in our ring, but now
we're trying to read ahead 128kB at a time, so it works out that we
can only flush the WAL accumulated while dirtying half the blocks at a
time, so we flush twice as often.

If I change the ring size to 384kB, allowing for that read-ahead
window, I see approximately the same WAL flushes.  Surely we'd never
be able to get the behaviour to match *and* keep the same ring size?
We simply need those 16 extra buffers to have a chance of accumulating
32 dirty buffers, and the associated WAL.  Do you see the same result,
or do you think something more than that is wrong here?

Here are some system call traces using your test that helped me see
the behaviour:

1. Unpatched, ie no streaming read, we flush 90kB of WAL generated by
32 pages before we write them out one at a time just before we read in
their replacements.  One flush covers the LSNs of all the pages that
will be written, even though it's only called for the first page to be
written.  That's because XLogFlush(lsn), if it decides to do anything,
flushes as far as it can... IOW when we hit the *oldest* dirty block,
that's when we write out the WAL up to where we dirtied the *newest*
block, which covers the 32 pwrite() calls here:

pwrite(30,...,90112,0xf9) = 90112 (0x16000)
fdatasync(30) = 0 (0x0)
pwrite(27,...,8192,0x0) = 8192 (0x2000)
pread(27,...,8192,0x4) = 8192 (0x2000)
pwrite(27,...,8192,0x2000) = 8192 (0x2000)
pread(27,...,8192,0x42000) = 8192 (0x2000)
pwrite(27,...,8192,0x4000) = 8192 (0x2000)
pread(27,...,8192,0x44000) = 8192 (0x2000)
pwrite(27,...,8192,0x6000) = 8192 (0x2000)
pread(27,...,8192,0x46000) = 8192 (0x2000)
pwrite(27,...,8192,0x8000) = 8192 (0x2000)
pread(27,...,8192,0x48000) = 8192 (0x2000)
pwrite(27,...,8192,0xa000) = 8192 (0x2000)
pread(27,...,8192,0x4a000) = 8192 (0x2000)
pwrite(27,...,8192,0xc000) = 8192 (0x2000)
pread(27,...,8192,0x4c000) = 8192 (0x2000)
pwrite(27,...,8192,0xe000) = 8192 (0x2000)
pread(27,...,8192,0x4e000) = 8192 (0x2000)
pwrite(27,...,8192,0x1) = 8192 (0x2000)
pread(27,...,8192,0x5) = 8192 (0x2000)
pwrite(27,...,8192,0x12000) = 8192 (0x2000)
pread(27,...,8192,0x52000) = 8192 (0x2000)
pwrite(27,...,8192,0x14000) = 8192 (0x2000)
pread(27,...,8192,0x54000) = 8192 (0x2000)
pwrite(27,...,8192,0x16000) = 8192 (0x2000)
pread(27,...,8192,0x56000) = 8192 (0x2000)
pwrite(27,...,8192,0x18000) = 8192 (0x2000)
pread(27,...,8192,0x58000) = 8192 (0x2000)
pwrite(27,...,8192,0x1a000) = 8192 (0x2000)
pread(27,...,8192,0x5a000) = 8192 (0x2000)
pwrite(27,...,8192,0x1c000) = 8192 (0x2000)
pread(27,...,8192,0x5c000) = 8192 (0x2000)
pwrite(27,...,8192,0x1e000) = 8192 (0x2000)
pread(27,...,8192,0x5e000) = 8192 (0x2000)
pwrite(27,...,8192,0x2) = 8192 (0x2000)
pread(27,...,8192,0x6) = 8192 (0x2000)
pwrite(27,...,8192,0x22000) = 8192 (0x2000)
pread(27,...,8192,0x62000) = 8192 (0x2000)
pwrite(27,...,8192,0x24000) = 8192 (0x2000)
pread(27,...,8192,0x64000) = 8192 (0x2000)
pwrite(27,...,8192,0x26000) = 8192 (0x2000)
pread(27,...,8192,0x66000) = 8192 (0x2000)
pwrite(27,...,8192,0x28000) = 8192 (0x2000)
pread(27,...,8192,0x68000) = 8192 (0x2000)
pwrite(27,...,8192,0x2a000) = 8192 (0x2000)
pread(27,...,8192,0x6a000) = 8192 (0x2000)
pwrite(27,...,8192,0x2c000) = 8192 (0x2000)
pread(27,...,8192,0x6c000) = 8192 (0x2000)
pwrite(27,...,8192,0x2e000) = 8192 (0x2000)
pread(27,...,8192,0x6e000) = 8192 (0x2000)
pwrite(27,...,8192,0x3) = 8192 (0x2000)
pread(27,...,8192,0x7) = 8192 (0x2000)
pwrite(27,...,8192,0x32000) = 8192 (0x2000)
pread(27,...,8192,0x72000) = 8192 (0x2000)
pwrite(27,...,8192,0x34000) = 8192 (0x2000)
pread(27,...,8192,0x74000) = 8192 (0x2000)
pwrite(27,...,8192,0x36000) = 8192 (0x2000)
pread(27,...,8192,0x76000) = 8192 (0x2000)
pwrite(27,...,8192,0x38000) = 8192 (0x2000)
pread(27,...,8192,0x78000) = 8192 (0x2000)
pwrite(27,...,8192,0x3a000) = 8192 (0x2000)
pread(27,...,8192,0x7a000) = 8192 (0x2000)
pwrite(27,...,8192,0x3c000) = 8192 (0x2000)
pread(27,...,8192,0x7c000) = 8192 (0x2000)
pwrite(27,...,8192,0x3e000) = 8192 (0x2000)
pread(27,...,8192,0x7e000) = 8192 (0x2000)

(Digression: this alternative tail-write-head-read pattern defeats the
read-ahead and write-behind on a bunch of OSes, but not Linux because
it only seems to worry about the reads, while other Unixes have
write-behind detection too, and I believe at least some are confused
by this pattern of tiny writes following along some distance behind
tiny reads; Andrew Gierth figured that out 

Simplify backtrace_functions GUC code

2024-03-17 Thread Bharath Rupireddy
Hi,

While working on [1], I noticed that the backtrace_functions GUC code
does its own string parsing and uses another extra variable
backtrace_function_list to store the processed form of
backtrace_functions GUC.

I think the code can be simplified a bit by using
SplitIdentifierString like in the attached patch. With this,
backtrace_function_list variable and assign_backtrace_functions() will
go away.

Thoughts?

[1] 
https://www.postgresql.org/message-id/CALj2ACXG1wzrb8%2B5HPNd8Qjr1h8GYkW-ijWhMYr2Y8_DzOB-%3Dg%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 0e07942e20076ad0f6b6ae62f9cc82a48d6971ba Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sun, 17 Mar 2024 05:42:26 +
Subject: [PATCH v1] Simplify backtrace_functions GUC code

---
 src/backend/utils/error/elog.c  | 124 ++--
 src/backend/utils/misc/guc_tables.c |   4 +-
 src/include/utils/guc_hooks.h   |   1 -
 3 files changed, 65 insertions(+), 64 deletions(-)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 52bc01058c..40b2dfc00e 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -114,9 +114,6 @@ char	   *Log_destination_string = NULL;
 bool		syslog_sequence_numbers = true;
 bool		syslog_split_messages = true;
 
-/* Processed form of backtrace_functions GUC */
-static char *backtrace_function_list;
-
 #ifdef HAVE_SYSLOG
 
 /*
@@ -830,22 +827,42 @@ set_stack_entry_location(ErrorData *edata,
 static bool
 matches_backtrace_functions(const char *funcname)
 {
-	const char *p;
+	char	   *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
+	bool		is_parasing_okay PG_USED_FOR_ASSERTS_ONLY;
+
 
-	if (!backtrace_function_list || funcname == NULL || funcname[0] == '\0')
+	if (backtrace_functions == NULL || backtrace_functions[0] == '\0' ||
+		funcname == NULL || funcname[0] == '\0')
 		return false;
 
-	p = backtrace_function_list;
-	for (;;)
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(backtrace_functions);
+
+	/* Parse string into list of identifiers */
+	is_parasing_okay = SplitIdentifierString(rawstring, ',', );
+
+	/*
+	 * We already have parsed the list in check_backtrace_functions, just
+	 * assert the fact here.
+	 */
+	Assert(is_parasing_okay);
+
+	foreach(l, elemlist)
 	{
-		if (*p == '\0')			/* end of backtrace_function_list */
-			break;
+		char	   *tok = (char *) lfirst(l);
 
-		if (strcmp(funcname, p) == 0)
+		if (strcmp(tok, funcname) == 0)
+		{
+			pfree(rawstring);
+			list_free(elemlist);
 			return true;
-		p += strlen(p) + 1;
+		}
 	}
 
+	pfree(rawstring);
+	list_free(elemlist);
 	return false;
 }
 
@@ -2124,68 +2141,53 @@ DebugFileOpen(void)
 bool
 check_backtrace_functions(char **newval, void **extra, GucSource source)
 {
-	int			newvallen = strlen(*newval);
-	char	   *someval;
-	int			validlen;
-	int			i;
-	int			j;
+	char	   *rawstring;
+	List	   *elemlist;
+	ListCell   *l;
 
-	/*
-	 * Allow characters that can be C identifiers and commas as separators, as
-	 * well as some whitespace for readability.
-	 */
-	validlen = strspn(*newval,
-	  "0123456789_"
-	  "abcdefghijklmnopqrstuvwxyz"
-	  "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
-	  ", \n\t");
-	if (validlen != newvallen)
+	/* Need a modifiable copy of string */
+	rawstring = pstrdup(*newval);
+
+	/* Parse string into list of identifiers */
+	if (!SplitIdentifierString(rawstring, ',', ))
 	{
-		GUC_check_errdetail("Invalid character");
+		/* syntax error in list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawstring);
+		list_free(elemlist);
 		return false;
 	}
 
-	if (*newval[0] == '\0')
+	foreach(l, elemlist)
 	{
-		*extra = NULL;
-		return true;
-	}
+		char	   *tok = (char *) lfirst(l);
+		int			toklen = strlen(tok);
+		int			validlen;
 
-	/*
-	 * Allocate space for the output and create the copy.  We could discount
-	 * whitespace chars to save some memory, but it doesn't seem worth the
-	 * trouble.
-	 */
-	someval = guc_malloc(ERROR, newvallen + 1 + 1);
-	for (i = 0, j = 0; i < newvallen; i++)
-	{
-		if ((*newval)[i] == ',')
-			someval[j++] = '\0';	/* next item */
-		else if ((*newval)[i] == ' ' ||
- (*newval)[i] == '\n' ||
- (*newval)[i] == '\t')
-			;	/* ignore these */
-		else
-			someval[j++] = (*newval)[i];	/* copy anything else */
-	}
+		/*
+		 * Allow characters that can be C identifiers, commas as separators,
+		 * the wildcard symbol, as well as some whitespace for readability.
+		 */
+		validlen = strspn(tok,
+		  "0123456789_"
+		  "abcdefghijklmnopqrstuvwxyz"
+		  "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
+		  ", \n\t");
 
-	/* two \0s end the setting */
-	someval[j] = '\0';
-	someval[j + 1] = '\0';
+		if (validlen != toklen)
+		{
+			GUC_check_errdetail("Invalid character");
+			pfree(rawstring);
+			list_free(elemlist);
+			return false;
+		}
+	}
 
-	*extra = someval;
+	pfree(rawstring);
+	

Re: clarify equalTupleDescs()

2024-03-17 Thread Peter Eisentraut

On 13.03.24 19:43, Tomas Vondra wrote:

I looked at this patch today. I went through all the calls switched to
equalRowTypes, and AFAIK all of them are correct - all the places
switched to equalRowTypes() only need the weaker checks.

There's only two places still calling equalTupleDescs() - relcache
certainly needs that, and so does the assert in execReplication().

As for attndims, I agree equalRowTypes() should not check that. We're
not really checking that anywhere, it'd be quite weird to start with it
here. Especially if the plan is to remove it entirely.


Thanks for checking this again.  I have committed the patch as it was 
presented then.






Re: Switching XLog source from archive to streaming when primary available

2024-03-17 Thread Bharath Rupireddy
On Wed, Mar 6, 2024 at 9:49 PM Nathan Bossart  wrote:
>
> > I played with that idea and it came out very nice. Please see the
> > attached v22 patch. Note that personally I didn't like "FORCE" being
> > there in the names, so I've simplified them a bit.
>
> Thanks.  I'd like to spend some time testing this, but from a glance, the
> code appears to be in decent shape.

Rebase needed after 071e3ad59d6fd2d6d1277b2bd9579397d10ded28 due to a
conflict in meson.build. Please see the attached v23 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From b451e18b7b1f91eb3a6857efe552e7fe97cc7f39 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sun, 17 Mar 2024 05:48:38 +
Subject: [PATCH v23] Allow standby to switch WAL source from archive to
 streaming

A standby typically switches to streaming replication (get WAL
from primary), only when receive from WAL archive finishes (no
more WAL left there) or fails for any reason. Reading WAL from
archive may not always be as efficient and fast as reading from
primary. This can be due to the differences in disk types, IO
costs, network latencies etc.. All of these can impact the
recovery performance on standby and increase the replication lag
on primary. In addition, the primary keeps accumulating WAL needed
for the standby while the standby reads WAL from archive because
the standby replication slot stays inactive. To avoid these
problems, one can use this parameter to make standby switch to
stream mode sooner.

This feature adds a new GUC that specifies amount of time after
which standby attempts to switch WAL source from WAL archive to
streaming replication (getting WAL from primary). However, standby
exhausts all the WAL present in pg_wal before switching. If
standby fails to switch to stream mode, it falls back to archive
mode.

Author: Bharath Rupireddy
Reviewed-by: Cary Huang, Nathan Bossart
Reviewed-by: Kyotaro Horiguchi, SATYANARAYANA NARLAPURAM
Discussion: https://www.postgresql.org/message-id/CAHg+QDdLmfpS0n0U3U+e+dw7X7jjEOsJJ0aLEsrtxs-tUyf5Ag@mail.gmail.com
---
 doc/src/sgml/config.sgml  |  49 
 doc/src/sgml/high-availability.sgml   |  15 ++-
 src/backend/access/transam/xlogrecovery.c | 117 --
 src/backend/utils/misc/guc_tables.c   |  12 ++
 src/backend/utils/misc/postgresql.conf.sample |   3 +
 src/include/access/xlogrecovery.h |   1 +
 src/test/recovery/meson.build |   1 +
 src/test/recovery/t/043_wal_source_switch.pl  | 113 +
 8 files changed, 296 insertions(+), 15 deletions(-)
 create mode 100644 src/test/recovery/t/043_wal_source_switch.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..40c2ae93d3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5050,6 +5050,55 @@ ANY num_sync ( 
+  streaming_replication_retry_interval (integer)
+  
+   streaming_replication_retry_interval configuration parameter
+  
+  
+  
+   
+Specifies amount of time after which standby attempts to switch WAL
+source from archive to streaming replication (i.e., getting WAL from
+primary). However, the standby exhausts all the WAL present in
+pg_wal directory before switching. If the standby
+fails to switch to stream mode, it falls back to archive mode. If this
+parameter value is specified without units, it is taken as seconds.
+With a lower value for this parameter, the standby makes frequent WAL
+source switch attempts. To avoid this, it is recommended to set a
+value depending on the rate of WAL generation on the primary. If the
+WAL files grow faster, the disk runs out of space sooner, so setting a
+value to make frequent WAL source switch attempts can help. The default
+is zero, disabling this feature. When disabled, the standby typically
+switches to stream mode only after receiving WAL from archive finishes
+(i.e., no more WAL left there) or fails for any reason. This parameter
+can only be set in the postgresql.conf file or on
+the server command line.
+   
+   
+
+ Standby may not always attempt to switch source from WAL archive to
+ streaming replication at exact
+ streaming_replication_retry_interval intervals. For
+ example, if the parameter is set to 1min and
+ fetching WAL file from archive takes about 2min,
+ then the source switch attempt happens for the next WAL file after
+ the current WAL file fetched from archive is fully applied.
+
+   
+   
+Reading WAL from archive may not always be as efficient and fast as
+reading from primary. This can be due to the differences in disk types,
+IO costs, network latencies etc. All of these can impact the recovery
+