date:20221110

Re: Typo about subxip in comments

2022-11-10 Thread Amit Kapila

On Fri, Nov 11, 2022 at 12:16 PM Richard Guo  wrote:
>
> On Fri, Nov 11, 2022 at 11:26 AM Japin Li  wrote:
>>
>> Recently, when I read the XidInMVCCSnapshot(), and find there are some
>> typos in the comments.
>
>
> Hmm, it seems to me 'the subxact array' is just another saying to refer
> to snapshot->subxip.  I'm not sure about this being typo.  But I have no
> objection to this change, as it is more consistent with the 'xip array'
> saying followed.
>

Agreed, it is more about being consistent with xip array.

-- 
With Regards,
Amit Kapila.

Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-10 Thread Amit Kapila

On Fri, Nov 11, 2022 at 7:57 AM houzj.f...@fujitsu.com
 wrote:
>
> On Monday, November 7, 2022 6:18 PM Masahiko Sawada  
> wrote:
> >
> > Here are comments on v42-0001:
> >
> > We have the following three similar name functions regarding to
> > starting a new parallel apply worker:
> > ---
> >  /*
> >   * Exit if any parameter that affects the remote connection
> > was changed.
> > - * The launcher will start a new worker.
> > + * The launcher will start a new worker, but note that the
> > parallel apply
> > + * worker may or may not restart depending on the value of
> > the streaming
> > + * option and whether there will be a streaming transaction.
> >
> > In which case does the parallel apply worker don't restart even if the
> > streaming option has been changed?
>
> Sorry, I forgot to reply to this comment. If user change the streaming option 
> from
> 'parallel' to 'on' or 'off', the parallel apply workers won't be restarted.
>

How about something like the below so as to be more explicit about
this in the comments?
diff --git a/src/backend/replication/logical/worker.c
b/src/backend/replication/logical/worker.c
index bfe326bf0c..74cd5565bd 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -3727,9 +3727,10 @@ maybe_reread_subscription(void)

/*
 * Exit if any parameter that affects the remote connection was changed.
-* The launcher will start a new worker, but note that the
parallel apply
-* worker may or may not restart depending on the value of the streaming
-* option and whether there will be a streaming transaction.
+* The launcher will start a new worker but note that the parallel apply
+* worker won't restart if the streaming option's value is changed from
+* 'parallel' to any other value or the server decides not to stream the
+* in-progress transaction.
 */

-- 
With Regards,
Amit Kapila.

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-11-10 Thread Michael Paquier

On Fri, Nov 11, 2022 at 11:53:08AM +0530, Bharath Rupireddy wrote:
> On Fri, Nov 11, 2022 at 11:14 AM John Naylor  
> wrote:
>> Was there supposed to be an attachment here?
> 
> Nope. The patches have already been committed -
> 3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and
> 28cc2976a9cf0ed661dbc55f49f669192cce1c89.

The committed patches are pretty much the same as the last version
sent on this thread, except that the changes have been split across
the files they locally impact, with a few simplifications tweaks to
the comments.  Hencem I did not see any need to send a new version for
this case.
--
Michael


signature.asc
Description: PGP signature

Re: Typo about subxip in comments

2022-11-10 Thread Richard Guo

On Fri, Nov 11, 2022 at 11:26 AM Japin Li  wrote:

> Recently, when I read the XidInMVCCSnapshot(), and find there are some
> typos in the comments.

Hmm, it seems to me 'the subxact array' is just another saying to refer
to snapshot->subxip.  I'm not sure about this being typo.  But I have no
objection to this change, as it is more consistent with the 'xip array'
saying followed.

Thanks
Richard

Re: Reviving lost replication slots

2022-11-10 Thread Bharath Rupireddy

On Thu, Nov 10, 2022 at 4:12 PM sirisha chamarthi
 wrote:
>
> On Wed, Nov 9, 2022 at 12:32 AM Kyotaro Horiguchi  
> wrote:
>>
>> I don't think walsenders fetching segment from archive is totally
>> stupid. With that feature, we can use fast and expensive but small
>> storage for pg_wal, while avoiding replciation from dying even in
>> emergency.
>
> Thanks! If there is a general agreement on this in this forum, I would like 
> to start working on this patch,

I think starting with establishing/summarizing the problem, design
approaches, implications etc. is a better idea than a patch. It might
invite more thoughts from the hackers.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-11-10 Thread Bharath Rupireddy

On Fri, Nov 11, 2022 at 11:14 AM John Naylor
 wrote:
>
> On Tue, Nov 8, 2022 at 11:08 AM Michael Paquier  wrote:
> >
> > So, I have looked at that, and at the end concluded that Andres'
> > suggestion to use PGAlignedBlock in pg_write_zeros() will serve better
> > in the long run.  Thomas has mentioned upthread that some of the
> > comments don't need to be that long, so I have tweaked these to be
> > minimal, and updated a few more areas.  Note that this has been split
> > into two commits: one to introduce the new routine in file_utils.c and
> > a second for the switch in walmethods.c.
>
> Was there supposed to be an attachment here?

Nope. The patches have already been committed -
3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and
28cc2976a9cf0ed661dbc55f49f669192cce1c89.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: Data is copied twice when specifying both child and parent table in publication

2022-11-10 Thread wangw.f...@fujitsu.com

On Tues, Nov 8, 2022 at 12:12 PM Osumi, Takamichi/大墨 昂道 
 wrote:
> On Monday, October 17, 2022 2:49 PM Wang, Wei/王 威
>  wrote:
> > Attach the new patch set.
> Hi, thank you for posting the new patches.
> 
> 
> Here are minor comments on the HEAD_v13-0002.

Thanks for your comments.

> (1) Suggestion for the document description
> 
> + 
> +  If a root partitioned table is published by any subscribed 
> publications
> which
> +  set publish_via_partition_root = true, changes on this root 
> partitioned
> table
> +  (or on its partitions) will be published using the identity and 
> schema of this
> +  root partitioned table rather than that of the individual 
> partitions.
> + 
> +
> 
> I suppose this sentence looks quite similar to the one in the previous 
> paragraph
> and can be adjusted.
> 
> IIUC the main value of the patch is to clarify what happens when
> we mix publications of different publish_via_partition_root settings for one
> partition hierarchy.
> If this is true, how about below sentence instead of the one above ?
> 
> "
> There can be a case where a subscription combines publications with
> different publish_via_partition_root values for one same partition hierarchy
> (e.g. subscribe two publications indicating the root partitioned table and 
> its child
> table respectively).
> In this case, the identity and schema of the root partitioned table take 
> priority.
> "

Thanks for your suggestion.
I agree that we should mention that this description is for a case where one
subscription subscribes to multiple publications. And I think it would be
better if we mentioned that the option publish_via_partition_root is specified
on a publication that publishes a root partitioned table. So I added the
description of this case as you suggested.

> (2) Better documentation alignment
> 
> I think we need to wrap publish_via_partition_root by "literal" tag
> in the documentation create_publication.sgml.

Improved.

The new patch set was attached in [1].

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB6275FB5397C6A647F262A3A69E009%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

RE: Data is copied twice when specifying both child and parent table in publication

2022-11-10 Thread wangw.f...@fujitsu.com

On Fri, Oct 21, 2022 at 17:02 PM Peter Smith  wrote:
> Here are my review comments for HEAD patches v13*

Thanks for your comments.
 
> Patch HEAD_v13-0002
> 
> 1. Commit message
> 
> The following usage scenarios are not described in detail in the manual:
> If one subscription subscribes multiple publications, and these publications
> publish a partitioned table and its partitions respectively. When we specify
> this parameter on one or more of these publications, which identity and schema
> should be used to publish the changes?
> 
> In these cases, I think the parameter publish_via_partition_root behave as
> follows:
> 
> ~
> 
> It seemed worded a bit strangely. Also, you said "on one or more of
> these publications" but the examples only show only one publication
> having 'publish_via_partition_root'.
> 
> SUGGESTION (I've modified the wording slightly but the examples are
> unchanged).
> 
> Assume a subscription is subscribing to multiple publications, and
> these publications publish a partitioned table and its partitions
> respectively:
> 
> [publisher-side]
> create table parent (a int primary key) partition by range (a);
> create table child partition of parent default;
> 
> create publication pub1 for table parent;
> create publication pub2 for table child;
> 
> [subscriber-side]
> create subscription sub connection '' publication pub1, pub2;
> 
> The manual does not clearly describe the behaviour when the user had
> specified the parameter 'publish_via_partition_root' on just one of
> the publications. This patch modifies documentation to clarify the
> following rules:
> 
> - If the parameter publish_via_partition_root is specified only in pub1,
> changes will be published using the identity and schema of the table 'parent'.
> 
> - If the parameter publish_via_partition_root is specified only in pub2,
> changes will be published using the identity and schema of the table 'child'.

Improved as suggested.

> ~~~
> 
> 2.
> 
> - If the parameter publish_via_partition_root is specified only in pub2,
> changes will be published using the identity and schema of the table child.
> 
> ~
> 
> Is that right though? This rule seems 100% contrary to the meaning of
> 'publish_via_partition_root=true'.

Yes, I think this behaviour fits the meaning of publish_via_partition_root.
Please refer to this description in the document:
```
This parameter determines whether changes in a partitioned table (or on its
partitions) contained in the publication will be published ...
```

So I think for 'publish_via_partition_root' to work, the partitioned table must
be specified in this publication.

Since only one member (partition table 'child') of this partition tree
('parent', 'child') is specified in 'pub2', even if 'pub2' specifies the
parameter 'publish_via_partition_root', 'pub2' will publish changes using the
identity and schema of the table 'child'.

> --
> 
> 3. doc/src/sgml/ref/create_publication.sgml
> 
> + 
> +  If a root partitioned table is published by any subscribed
> publications which
> +  set publish_via_partition_root = true, changes on this root
> partitioned table
> +  (or on its partitions) will be published using the identity
> and schema of this
> +  root partitioned table rather than that of the individual 
> partitions.
> + 
> 
> This seems to only describe the first example from the commit message.
> What about some description to explain the second example?

I think the second example is already described in the pg-doc (please see the
reply to #2). I am not quite sure if additional modifications are required. Do
you have any suggestions?

The new patch set was attached in [1].

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB6275FB5397C6A647F262A3A69E009%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-11-10 Thread John Naylor

On Tue, Nov 8, 2022 at 11:08 AM Michael Paquier  wrote:
>
> So, I have looked at that, and at the end concluded that Andres'
> suggestion to use PGAlignedBlock in pg_write_zeros() will serve better
> in the long run.  Thomas has mentioned upthread that some of the
> comments don't need to be that long, so I have tweaked these to be
> minimal, and updated a few more areas.  Note that this has been split
> into two commits: one to introduce the new routine in file_utils.c and
> a second for the switch in walmethods.c.

Was there supposed to be an attachment here?

--
John Naylor
EDB: http://www.enterprisedb.com

RE: Data is copied twice when specifying both child and parent table in publication

2022-11-10 Thread wangw.f...@fujitsu.com

On Fri, Oct 21, 2022 at 17:02 PM Peter Smith  wrote:
>

Thanks for your comments. Sorry for not replying in time.

> On Mon, Oct 17, 2022 at 4:49 PM wangw.f...@fujitsu.com
>  wrote:
> >
> > On Wed, Oct 5, 2022 at 11:08 AM Peter Smith 
> wrote:
> > > Hi Wang-san. Here are my review comments for HEAD_v12-0001 patch.
> >
> ...
> > >
> > > 3. QUESTION - pg_get_publication_tables / fetch_table_list
> > >
> > > When the same table is published by different publications (but there
> > > are other differences like row-filters/column-lists in each
> > > publication) the result tuple of this function does not include the
> > > pubid. Maybe the SQL of pg_publication_tables/fetch_table_list() is OK
> > > as-is but how does it manage to associate each table with the correct
> > > tuple?
> > >
> > > I know it apparently all seems to work but I’m not how does that
> > > happen? Can you explain why a puboid is not needed for the result
> > > tuple of this function?
> >
> > Sorry, I am not sure I understand your question.
> > I try to answer your question by explaining the two functions you mentioned:
> >
> > First, the function pg_get_publication_tables gets the list (see 
> > table_infos)
> > that included published table and the corresponding publication. Then based
> > on this list, the function pg_get_publication_tables returns information
> > (scheme, relname, row filter and column list) about the published tables in 
> > the
> > publications list. It just doesn't return pubid.
> >
> > Then, the SQL in the function fetch_table_list will get the columns in the
> > column list from pg_attribute. (This is to return all columns when the 
> > column
> > list is not specified)
> >
> 
> I meant, for example, if the different publications specified
> different col-lists for the same table then IIUC the
> fetch_table_lists() is going to return 2 list elements
> (schema,rel_name,row_filter,col_list). But when the schema/rel_name
> are the same for 2 elements then (without also a pubid) how are you
> going to know where the list element came from, and how come that is
> not important?
> 
> > > ~~
> > >
> > > test_pub=# create table t1(a int, b int, c int);
> > > CREATE TABLE
> > > test_pub=# create publication pub1 for table t1(a) where (a > 99);
> > > CREATE PUBLICATION
> > > test_pub=# create publication pub2 for table t1(a,b) where (b < 33);
> > > CREATE PUBLICATION
> > >
> > > Following seems OK when I swap orders of publication names...
> > >
> > > test_pub=# SELECT gpt.relid, gpt.attrs, pg_get_expr(gpt.qual,
> > > gpt.relid) AS rowfilter from pg_get_publication_tables(VARIADIC
> > > ARRAY['pub2','pub1']) gpt(relid, attrs, qual);
> > >  relid | attrs | rowfilter
> > > ---+---+---
> > >  16385 | 1 2   | (b < 33)
> > >  16385 | 1 | (a > 99)
> > > (2 rows)
> > >
> > > test_pub=# SELECT gpt.relid, gpt.attrs, pg_get_expr(gpt.qual,
> > > gpt.relid) AS rowfilter from pg_get_publication_tables(VARIADIC
> > > ARRAY['pub1','pub2']) gpt(relid, attrs, qual);
> > >  relid | attrs | rowfilter
> > > ---+---+---
> > >  16385 | 1 | (a > 99)
> > >  16385 | 1 2   | (b < 33)
> > > (2 rows)
> > >
> > > But what about this (this is similar to the SQL fragment from
> > > fetch_table_list); I swapped the pub names but the results are the
> > > same...
> > >
> > > test_pub=# SELECT pg_get_publication_tables(VARIADIC
> > > array_agg(p.pubname)) from pg_publication p where pubname
> > > IN('pub2','pub1');
> > >
> > >  pg_get_publication_tables
> > >
> > > -
> 
> > > -
> > > -
> 
> > > -
> > > ---
> > >  (16385,1,"{OPEXPR :opno 521 :opfuncid 147 :opresulttype 16 :opretset
> > > false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 1
> > > :vartype 23 :vartypmod -1 :var
> > > collid 0 :varlevelsup 0 :varnosyn 1 :varattnosyn 1 :location 47}
> > > {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4
> > > :constbyval true :constisnull false :
> > > location 51 :constvalue 4 [ 99 0 0 0 0 0 0 0 ]}) :location 49}")
> > >  (16385,"1 2","{OPEXPR :opno 97 :opfuncid 66 :opresulttype 16
> > > :opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1
> > > :varattno 2 :vartype 23 :vartypmod -1 :v
> > > arcollid 0 :varlevelsup 0 :varnosyn 1 :varattnosyn 2 :location 49}
> > > {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4
> > > :constbyval true :constisnull false
> > >  :location 53 :constvalue 4 [ 33 0 0 0 0 0 0 0 ]}) :location 51}")
> > > (2 rows)
> > >
> > > test_pub=# SELECT pg_get_publication_tables(VARIADIC
> > > array_agg(p.pubname)) from pg_publication p where pubname
>

Re: Reviving lost replication slots

2022-11-10 Thread Amit Kapila

On Thu, Nov 10, 2022 at 4:07 PM sirisha chamarthi
 wrote:
>
> On Wed, Nov 9, 2022 at 2:37 AM Amit Kapila  wrote:
>>
>> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>>  wrote:
>> >
>>  Is the intent of setting restart_lsn to InvalidXLogRecPtr was to
>> disallow reviving the slot?
>> >
>>
>> I think the intent is to compute the correct value for
>> replicationSlotMinLSN as we use restart_lsn for it and using the
>> invalidated slot's restart_lsn value for it doesn't make sense.
>
>
>  Correct. If a slot is invalidated (lost), then shouldn't we ignore the slot 
> from computing the catalog_xmin?  I don't see it being set to 
> InvalidTransactionId in ReplicationSlotsComputeRequiredXmin. Attached a small 
> patch to address this and the output after the patch is as shown below.
>

I think you forgot to attach the patch. However, I suggest you start a
separate thread for this because the patch you are talking about here
seems to be for an existing problem.

-- 
With Regards,
Amit Kapila.

Re: Support logical replication of DDLs

2022-11-10 Thread Peter Smith

On Fri, Nov 11, 2022 at 4:17 PM Peter Smith  wrote:
>
> On Fri, Nov 11, 2022 at 4:09 PM Peter Smith  wrote:
> >
> > On Fri, Nov 11, 2022 at 3:47 PM Peter Smith  wrote:
> > >
> > > Here are more review comments for the v32-0001 file ddl_deparse.c
> > >
> > > *** NOTE - my review post became too big, so I split it into smaller 
> > > parts.
> >
>

THIS IS PART 4 OF 4.

===

src/backend/commands/ddl_deparse.c

99. deparse_CreateUserMappingStmt

+ /*
+ * Lookup up object in the catalog, so we don't have to deal with
+ * current_user and such.
+ */

Typo "Lookup up"

~

100.

+ createStmt = new_objtree("CREATE USER MAPPING ");
+
+ append_object_object(createStmt, "FOR %{role}R",
new_objtree_for_role_id(form->umuser));
+
+ append_string_object(createStmt, "SERVER %{server}I", server->servername);

All this can be combined into a single VA() function call.

~

101.

+ /* add an OPTIONS clause, if any */

Uppercase comment.

--

102. deparse_AlterUserMappingStmt

+ /*
+ * Lookup up object in the catalog, so we don't have to deal with
+ * current_user and such.
+ */
+
+ tp = SearchSysCache1(USERMAPPINGOID, ObjectIdGetDatum(objectId));

102a.
Typo "Lookup up"

~

102b.
Unnecessary blank line.

~

103.

+ alterStmt = new_objtree("ALTER USER MAPPING");
+
+ append_object_object(alterStmt, "FOR %{role}R",
new_objtree_for_role_id(form->umuser));
+
+ append_string_object(alterStmt, "SERVER %{server}I", server->servername);

Can be combined into a single VA() function call.

~

104.
+ /* add an OPTIONS clause, if any */

Uppercase comment

--

105. deparse_AlterStatsStmt

+ alterStat = new_objtree("ALTER STATISTICS");
+
+ /* Lookup up object in the catalog */
+ tp = SearchSysCache1(STATEXTOID, ObjectIdGetDatum(objectId));
+ if (!HeapTupleIsValid(tp))
+ elog(ERROR, "cache lookup failed for statistic %u", objectId);
+
+ statform = (Form_pg_statistic_ext) GETSTRUCT(tp);
+
+ append_object_object(alterStat, "%{identity}D",
+ new_objtree_for_qualname(statform->stxnamespace,
+   NameStr(statform->stxname)));
+
+ append_float_object(alterStat, "SET STATISTICS %{target}n",
node->stxstattarget);

105a.
This was a biff unusual to put the new_objtree even before the catalog lookup.

~

105b.
All new_objtreee and append_XXX can be combined as a single VA()
function call here.

--

106. deparse_RefreshMatViewStmt

+ refreshStmt = new_objtree_VA("REFRESH MATERIALIZED VIEW", 0);
+
+ /* Add a CONCURRENTLY clause */
+ append_string_object(refreshStmt, "%{concurrently}s",
+ node->concurrent ? "CONCURRENTLY" : "");
+ /* Add the matview name */
+ append_object_object(refreshStmt, "%{identity}D",
+ new_objtree_for_qualname_id(RelationRelationId,
+ objectId));
+ /* Add a WITH NO DATA clause */
+ tmp = new_objtree_VA("WITH NO DATA", 1,
+ "present", ObjTypeBool,
+ node->skipData ? true : false);
+ append_object_object(refreshStmt, "%{with_no_data}s", tmp);

106a.
Don't use VA() function for 0 args.

~

106b.
Huh? There are 2 different implementation styles here for the optional clauses
- CONCURRENTLY just replaces with an empty string
- WITH NOT DATA - has a new ObjTree either present/not present

~

106c.
Most/all of this can be combined into a single VA call.

--

107. deparse_DefElem

+ set = new_objtree("");
+ optname = new_objtree("");
+
+ if (elem->defnamespace != NULL)
+ append_string_object(optname, "%{schema}I.", elem->defnamespace);
+
+ append_string_object(optname, "%{label}I", elem->defname);
+
+ append_object_object(set, "%{label}s", optname);

The set should be created *after* the optname, then it can be done
something like:

set = new_objtree_VA("%{label}s", 1, "label", OptTyeString, optname);

~

108.

+ append_string_object(set, " = %{value}L",
+ elem->arg ? defGetString(elem) :
+ defGetBoolean(elem) ? "TRUE" : "FALSE");

The calling code does not need to prefix the format with spaces like
this. The append_XXX will handle space separators automatically.

--

109. deparse_drop_command

+ stmt = new_objtree_VA(fmt, 1, "objidentity", ObjTypeString, identity);
+ stmt2 = new_objtree_VA("CASCADE", 1,
+"present", ObjTypeBool, behavior == DROP_CASCADE);
+
+ append_object_object(stmt, "%{cascade}s", stmt2);

109a.
'stmt2' is a poor name. "CASCADE" is not a statement. Even 'tmpobj'
would be better here.

~

109b.
The 2 lines of cascade should be grouped together -- i.e. the blank
line should be *above* the "CASCADE", not below it.

--

110. deparse_FunctionSet

+ obj = new_objtree("RESET");
+ append_string_object(obj, "%{set_name}I", name);

This can be combined as a single VA() call with a format "RESET %{set_name}I".

~

111.

+ if (kind == VAR_RESET_ALL)
+ {
+ obj = new_objtree("RESET ALL");
+ }
+ else if (value != NULL)


It seems a bit strange that the decision is judged sometimes by the
*value*. Why isn’t this just deciding according to different
VariableSetKind (e.g. VAR_SET_VALUE)

--

112. deparse_IndexStmt

+ indexStmt = new_objtree("CREATE");
+
+ append_string_object(indexStmt, "%{unique}s",
+

Re: Typo about subxip in comments

2022-11-10 Thread Julien Rouhaud

On Fri, Nov 11, 2022 at 10:39:06AM +0530, Amit Kapila wrote:
> On Fri, Nov 11, 2022 at 8:56 AM Japin Li  wrote:
> >
> > Recently, when I read the XidInMVCCSnapshot(), and find there are some
> > typos in the comments.
> >
> > diff --git a/src/backend/storage/ipc/procarray.c 
> > b/src/backend/storage/ipc/procarray.c
> > index 207c4b27fd..9e8b6756fe 100644
> > --- a/src/backend/storage/ipc/procarray.c
> > +++ b/src/backend/storage/ipc/procarray.c
> > @@ -2409,7 +2409,7 @@ GetSnapshotData(Snapshot snapshot)
> >  * We could try to store xids into xip[] first and then 
> > into subxip[]
> >  * if there are too many xids. That only works if the 
> > snapshot doesn't
> >  * overflow because we do not search subxip[] in that case. 
> > A simpler
> > -* way is to just store all xids in the subxact array 
> > because this is
> > +* way is to just store all xids in the subxip array 
> > because this is
> >  * by far the bigger array. We just leave the xip array 
> > empty.
> >  *
> >  * Either way we need to change the way XidInMVCCSnapshot() 
> > works
> > diff --git a/src/backend/utils/time/snapmgr.c 
> > b/src/backend/utils/time/snapmgr.c
> > index f1f2ddac17..2524b1c585 100644
> > --- a/src/backend/utils/time/snapmgr.c
> > +++ b/src/backend/utils/time/snapmgr.c
> > @@ -2345,7 +2345,7 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot 
> > snapshot)
> > else
> > {
> > /*
> > -* In recovery we store all xids in the subxact array 
> > because it is by
> > +* In recovery we store all xids in the subxip array 
> > because it is by
> >  * far the bigger array, and we mostly don't know which 
> > xids are
> >  * top-level and which are subxacts. The xip array is empty.
> >  *
> >
> 
> LGTM.

+1

Re: Support logical replication of DDLs

2022-11-10 Thread Peter Smith

On Fri, Nov 11, 2022 at 4:09 PM Peter Smith  wrote:
>
> On Fri, Nov 11, 2022 at 3:47 PM Peter Smith  wrote:
> >
> > Here are more review comments for the v32-0001 file ddl_deparse.c
> >
> > *** NOTE - my review post became too big, so I split it into smaller parts.
>

THIS IS PART 3 OF 4.

===

src/backend/commands/ddl_deparse.c

50. deparse_DefineStmt_Operator

+/*
+ * Deparse a DefineStmt (CREATE OPERATOR)
+ *
+ * Given a trigger OID and the parse tree that created it, return an ObjTree
+ * representing the creation command.
+ */
+static ObjTree *
+deparse_DefineStmt_Operator(Oid objectId, DefineStmt *define)

"trigger OID" ?? Is that right?

~

51.

/* MERGES */
tmpobj = new_objtree_VA("MERGES", 1,
"clause", ObjTypeString, "merges");
if (!oprForm->oprcanmerge)
append_bool_object(tmpobj, "present", false);
list = lappend(list, new_object_object(tmpobj));

/* HASHES */
tmpobj = new_objtree_VA("HASHES", 1,
"clause", ObjTypeString, "hashes");
if (!oprForm->oprcanhash)
append_bool_object(tmpobj, "present", false);
list = lappend(list, new_object_object(tmpobj));


Maybe HASHES and MERGES should be done in a different order, just to
be consistent with the PG documentation [2].

--

52. deparse_DefineStmt_Type

+ /* Shortcut processing for shell types. */
+ if (!typForm->typisdefined)
+ {
+ stmt = new_objtree_VA("CREATE TYPE", 0);
+ append_object_object(stmt, "%{identity}D",
+ new_objtree_for_qualname(typForm->typnamespace,
+   NameStr(typForm->typname)));
+ append_bool_object(stmt, "present", true);
+ ReleaseSysCache(typTup);
+ return stmt;
+ }
+
+ stmt = new_objtree_VA("CREATE TYPE", 0);
+ append_object_object(stmt, "%{identity}D",
+ new_objtree_for_qualname(typForm->typnamespace,
+   NameStr(typForm->typname)));
+ append_bool_object(stmt, "present", true);

52a.
This code looked strange because everything is the same except the
Release/return, so IMO it should be refactored to use the common code.

~

52b.
The VA(0 args) should be combined with the subsequent appends to use
fewer append_XXX calls.

~

53.
Is it necessary to say append_bool_object(stmt, "present", true); ? --
I'd assumed that is the default unless it explicitly says false.

~

54.

/* INPUT */
tmp = new_objtree_VA("(INPUT=", 1,
 "clause", ObjTypeString, "input");
append_object_object(tmp, "%{procedure}D",
 new_objtree_for_qualname_id(ProcedureRelationId,
 typForm->typinput));
list = lappend(list, new_object_object(tmp));

/* OUTPUT */
tmp = new_objtree_VA("OUTPUT=", 1,
 "clause", ObjTypeString, "output");
append_object_object(tmp, "%{procedure}D",
 new_objtree_for_qualname_id(ProcedureRelationId,
 typForm->typoutput));
list = lappend(list, new_object_object(tmp));

These could each be simplified into single VA() function calls, the
same as was done in deparse_DefineStmt_Operator PROCEDURE.

And the same comment applies to other parts. e.g.:
- /* CATEGORY */
- /* ALIGNMENT */
- STORAGE

~

55.

+ tmp = new_objtree_VA("STORAGE=", 1,
+ "clause", ObjTypeString, "storage");

Missing comment above this to say /* STORAGE */

~

56.

+ /* INTERNALLENGTH */
+ if (typForm->typlen == -1)
+ {
+ tmp = new_objtree_VA("INTERNALLENGTH=VARIABLE", 0);
+ }
+ else
+ {
+ tmp = new_objtree_VA("INTERNALLENGTH=%{typlen}n", 1,
+ "typlen", ObjTypeInteger, typForm->typlen);
+ }

56a.
The VA(args = 0) does not need to be a VA function.

~

56b.
The { } blocks are unnecessary

--

57. deparse_DefineStmt_TSConfig

+
+static ObjTree *
+deparse_DefineStmt_TSConfig(Oid objectId, DefineStmt *define,
+ ObjectAddress copied)

Missing function comment.

~

58.

+ stmt = new_objtree("CREATE");
+
+ append_object_object(stmt, "TEXT SEARCH CONFIGURATION %{identity}D",
+ new_objtree_for_qualname(tscForm->cfgnamespace,
+   NameStr(tscForm->cfgname)));

Why not combine these using VA() function?

~

59.

+ list = NIL;
+ /* COPY */

Just assign NIL when declared.

~

60.

+ if (copied.objectId != InvalidOid)

Use OidIsValid macro.

--

61. deparse_DefineStmt_TSParser

+
+static ObjTree *
+deparse_DefineStmt_TSParser(Oid objectId, DefineStmt *define)

Missing function comment.

~

62.

+ stmt = new_objtree("CREATE");
+
+ append_object_object(stmt, "TEXT SEARCH PARSER %{identity}D",
+ new_objtree_for_qualname(tspForm->prsnamespace,
+   NameStr(tspForm->prsname)));

Why not combine as a single VA() function call?

~

63.

+ list = NIL;

Just assign NIL when declared

~

64.

tmp = new_objtree_VA("START=", 1,
 "clause", ObjTypeString, "start");
append_object_object(tmp, "%{procedure}D",
 new_objtree_for_qualname_id(ProcedureRelationId,
 t

Re: Unit tests for SLRU

2022-11-10 Thread Michael Paquier

On Thu, Nov 10, 2022 at 06:40:44PM +0300, Aleksander Alekseev wrote:
> Fair enough. PFA the corrected patch v5.

Is there a reason why you need a TAP test here?  It is by design more
expensive than pg_regress and it does not require --enable-tap-tests.
See for example what we do for snapshot_too_old, commit_ts,
worker_spi, etc., where each module uses a custom configuration file.

Hmm.  If I were to write that, I think that I would make the SLRU
directory configurable as a PGC_POSTMASTER, at least, for the purpose
of the exercise, and also split test_slru() into more user-callable
functions so as it would be possible to mix more cross-check scenarios
with the low-level C APIs if need be, with adapted input parameters.
--
Michael


signature.asc
Description: PGP signature

Re: Typo about subxip in comments

2022-11-10 Thread Amit Kapila

On Fri, Nov 11, 2022 at 8:56 AM Japin Li  wrote:
>
> Recently, when I read the XidInMVCCSnapshot(), and find there are some
> typos in the comments.
>
> diff --git a/src/backend/storage/ipc/procarray.c 
> b/src/backend/storage/ipc/procarray.c
> index 207c4b27fd..9e8b6756fe 100644
> --- a/src/backend/storage/ipc/procarray.c
> +++ b/src/backend/storage/ipc/procarray.c
> @@ -2409,7 +2409,7 @@ GetSnapshotData(Snapshot snapshot)
>  * We could try to store xids into xip[] first and then into 
> subxip[]
>  * if there are too many xids. That only works if the 
> snapshot doesn't
>  * overflow because we do not search subxip[] in that case. A 
> simpler
> -* way is to just store all xids in the subxact array because 
> this is
> +* way is to just store all xids in the subxip array because 
> this is
>  * by far the bigger array. We just leave the xip array empty.
>  *
>  * Either way we need to change the way XidInMVCCSnapshot() 
> works
> diff --git a/src/backend/utils/time/snapmgr.c 
> b/src/backend/utils/time/snapmgr.c
> index f1f2ddac17..2524b1c585 100644
> --- a/src/backend/utils/time/snapmgr.c
> +++ b/src/backend/utils/time/snapmgr.c
> @@ -2345,7 +2345,7 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
> else
> {
> /*
> -* In recovery we store all xids in the subxact array because 
> it is by
> +* In recovery we store all xids in the subxip array because 
> it is by
>  * far the bigger array, and we mostly don't know which xids 
> are
>  * top-level and which are subxacts. The xip array is empty.
>  *
>

LGTM.

-- 
With Regards,
Amit Kapila.

Re: Support logical replication of DDLs

2022-11-10 Thread Peter Smith

On Fri, Nov 11, 2022 at 3:47 PM Peter Smith  wrote:
>
> Here are more review comments for the v32-0001 file ddl_deparse.c
>
> *** NOTE - my review post became too big, so I split it into smaller parts.

THIS IS PART 2 OF 4.

===

src/backend/commands/ddl_deparse.c

1. deparse_AlterExtensionStmt

+/*
+ * Deparse an AlterExtensionStmt (ALTER EXTENSION .. UPDATE TO VERSION)
+ *
+ * Given an extension  OID and a parse tree that modified it, return an ObjTree
+ * representing the alter type.
+ */
+static ObjTree *
+deparse_AlterExtensionStmt(Oid objectId, Node *parsetree)

Spurious blank space before "OID"

~

2.

+ ObjTree*stmt;
+ ObjTree*tmp;
+ List*list = NIL;
+ ListCell   *cell;

Variable 'tmp' can be declared only in the scope that it is used.

~

3.

+ foreach(cell, node->options)
+ {
+ DefElem*opt = (DefElem *) lfirst(cell);
+
+ if (strcmp(opt->defname, "new_version") == 0)
+ {
+ tmp = new_objtree_VA("TO %{version}L", 2,
+ "type", ObjTypeString, "version",
+ "version", ObjTypeString, defGetString(opt));
+ list = lappend(list, new_object_object(tmp));
+ }
+ else
+ elog(ERROR, "unsupported option %s", opt->defname);
+ }

This code seems strange to be adding new versions to a list. How can
there be multiple new versions? It does not seem compatible with the
command syntax [1]

--

4. deparse_CreateCastStmt

+ initStringInfo(&func);
+ appendStringInfo(&func, "%s(",
+ quote_qualified_identifier(get_namespace_name(funcForm->pronamespace),
+ NameStr(funcForm->proname)));
+ for (i = 0; i < funcForm->pronargs; i++)
+ appendStringInfoString(&func,
+format_type_be_qualified(funcForm->proargtypes.values[i]));
+ appendStringInfoChar(&func, ')');

Is this correct, or should there be some separators (e.g. commas)
between multiple arg-types?

--

5. deparse_AlterDefaultPrivilegesStmt

+
+static ObjTree *
+deparse_AlterDefaultPrivilegesStmt(CollectedCommand *cmd)

Missing function comment

~

6.

+ schemas = lappend(schemas,
+   new_string_object(strVal(val)));

Unnecessary wrapping.

~

7.

+ /* Add the IN SCHEMA clause, if any */
+ tmp = new_objtree("IN SCHEMA");
+ append_array_object(tmp, "%{schemas:, }I", schemas);
+ if (schemas == NIL)
+ append_bool_object(tmp, "present", false);
+ append_object_object(alterStmt, "%{in_schema}s", tmp);
+
+ /* Add the FOR ROLE clause, if any */
+ tmp = new_objtree("FOR ROLE");
+ append_array_object(tmp, "%{roles:, }R", roles);
+ if (roles == NIL)
+ append_bool_object(tmp, "present", false);
+ append_object_object(alterStmt, "%{for_roles}s", tmp);


I don't really understand why the logic prefers to add a whole new
empty tree with "present: false" versus just adding nothing at all
unless it is relevant.

~

8.

+ if (stmt->action->is_grant)
+ grant = new_objtree("GRANT");
+ else
+ grant = new_objtree("REVOKE");
+
+ /* add the GRANT OPTION clause for REVOKE subcommand */
+ if (!stmt->action->is_grant)
+ {
+ tmp = new_objtree_VA("GRANT OPTION FOR",
+1, "present", ObjTypeBool,
+stmt->action->grant_option);
+ append_object_object(grant, "%{grant_option}s", tmp);
+ }

That 2nd 'if' can just be combined with the 'else' logic of the prior if.

~

9.

+ Assert(priv->cols == NIL);
+ privs = lappend(privs,
+ new_string_object(priv->priv_name));

Unnecessary wrapping.

--

10. deparse_AlterTableStmt

Maybe this function name should be different because it is not only
for TABLEs but also serves for INDEX, VIEW, TYPE, etc

~

11.

AFAICT every case in the switch (subcmd->subtype) is doing subcmds =
lappend(subcmds, new_object_object(tmpobj));

Just doing this in common code at the end might be an easy way to
remove ~50 lines of duplicate code.

--

12. deparse_ColumnDef

+ * NOT NULL constraints in the column definition are emitted directly in the
+ * column definition by this routine; other constraints must be emitted
+ * elsewhere (the info in the parse node is incomplete anyway.).
+ */
+static ObjTree *
+deparse_ColumnDef(Relation relation, List *dpcontext, bool composite,
+   ColumnDef *coldef, bool is_alter, List **exprs)

"anyway.)." -> "anyway)."

~

13.

+ /* USING clause */
+ tmpobj = new_objtree("COMPRESSION");
+ if (coldef->compression)
+ append_string_object(tmpobj, "%{compression_method}I", coldef->compression);
+ else
+ {
+ append_null_object(tmpobj, "%{compression_method}I");
+ append_bool_object(tmpobj, "present", false);
+ }

Why is it necessary to specify a NULL compression method if the entire
"COMPRESSION" is anyway flagged as present=false?

~

14.

+ foreach(cell, coldef->constraints)
+ {
+ Constraint *constr = (Constraint *) lfirst(cell);
+
+ if (constr->contype == CONSTR_NOTNULL)
+ saw_notnull = true;
+ }

Why not break immediately from this loop the first time you find
'saw_notnull' true?

~~~

15.

+ tmpobj = new_objtree("DEFAULT");
+ if (attrForm->atthasdef)
+ {
+ char*defstr;
+
+ defstr = RelationGetColumnDefault(relation, attrForm->attnum,
+   dpcontext, exprs);
+
+ append_string_object(tmpobj, "%{default}s", defstr);

Re: Avoid overhead open-close indexes (catalog updates)

2022-11-10 Thread Michael Paquier

On Thu, Nov 10, 2022 at 08:56:25AM -0300, Ranier Vilela wrote:
> For CopyStatistics() have performance checks.

You are not giving all the details of your tests, though, so I had a
look with some of my stuff using the attached set of SQL functions
(create_function.sql) to create a bunch of indexes with a maximum
number of expressions, as of:
select create_table_cols('tab', 32);
select create_index_multi_exprs('ind', 400, 'tab', 32);
insert into tab values (1);
analyze tab; -- 12.8k~ pg_statistic records

On HEAD, a REINDEX CONCURRENTLY for the table 'tab' takes 1550ms on my
laptop with an average of 10 runs.  The patch impacts the runtime with
a single session, making the execution down to 1480ms as per an effect
of the maximum number of attributes on an index being 32.  There may
be some noise, but there is a trend, and some perf profiles confirm
the same with CopyStatistics().  My case is a bit extreme, of course,
still that's something.

Anyway, while reviewing this code, it occured to me that we could do
even better than this proposal once we switch to
CatalogTuplesMultiInsertWithInfo() for the data insertion.

This would reduce more the operation overhead by switching to multi
INSERTs rather than 1 INSERT for each index attribute with tuples
stored in a set of TupleTableSlots, meaning 1 WAL record rather than N
records.  The approach would be similar to what you do for
dependencies, see for example recordMultipleDependencies() when it
comes to the number of slots used etc.
--
Michael


create_function.sql
Description: application/sql


signature.asc
Description: PGP signature

Re: Support logical replication of DDLs

2022-11-10 Thread Peter Smith

Here are more review comments for the v32-0001 file ddl_deparse.c

This completes my first review pass over this overly large file.

This review has taken a long time, so for any of my review comments
(in this and previous posts) that get rejected, please reply citing
the rejected reference numbers, because I hope to avoid spending
multiple days (in a future review) trying to reconcile what was
addressed vs what was not addressed. TIA.

*** NOTE - my review post became too big, so I split it into smaller parts.

THIS IS PART 1 OF 4.

==

src/backend/commands/ddl_deparse.c

G.1. GENERAL _VA args wrapping

+ tmp = new_objtree_VA("WITH GRANT OPTION",
+ 1, "present", ObjTypeBool,
+ stmt->action->grant_option);

In general, I think all these _VA() style function calls are easier to
read if you can arrange to put each of the argument names on a new
line instead of just wrapping them randomly.

So the above would be better as:

tmp = new_objtree_VA("WITH GRANT OPTION", 1,
"present", ObjTypeBool, stmt->action->grant_option);

Please search/modify all cases of new_objtree_VA like this.

~~~

G.2. GENERAL - "type" object

There are many functions that insert a "type" object for some purpose:

e.g.
+ tmpobj = new_objtree_VA("DETACH PARTITION %{partition_identity}D FINALIZE", 2,
+ "type", ObjTypeString, "detach partition finalize",
+ "partition_identity", ObjTypeObject,
+ new_objtree_for_qualname_id(RelationRelationId,
+ sub->address.objectId));

e.g.
+ tmpobj = new_objtree_VA(fmtstr, 2,
+ "type", ObjTypeString, "add column",
+ "definition", ObjTypeObject, tree);

I'm not sure yet what these "type" objects are used for, but I felt
that these unsubstituted values should look slightly more like enum
values, and slightly less like real SQL syntax.

For example - maybe do like this (for the above):

"detach partition finalize" -> "type_detach_partition_finalize"
"add column" -> "type_add_column"
etc.

~~~

G.3. GENERAL - JSON deparsed structures should be documented

AFAICT there are mixtures of different JSON structure styles at play
in this module. Sometimes there are trees with names and sometimes
not, sometimes there are "present" objects and sometimes not.
Sometimes entire trees seemed unnecessary to me. It feels quite
arbitrary in places but it's quite hard to compare them because
everything is spread across 9000+ lines.

IMO all these deparse structures ought to be documented. Then I think
it will become apparent that lots of them are inconsistent with the
others. Even if such documentation is ultimately not needed by
end-users, I think it would be a very valuable internal design
accompaniment to this module, and it would help a lot for
reviews/maintenance/bug prevention etc. Better late than never.

~~~

G.4 GENERAL - Underuse of _VA() function.

(Probably I've mentioned this before in previous review comments, but
I keep encountering this many times).

The json is sort of built up part by part and objects are appended ...
it was probably easier to think about each part during coding but OTOH
I think this style is often unnecessary. IMO most times the function
can be far simpler just by gathering together all the necessary values
and then using a single big new_objtree_VA() call to deparse the
complete format in one call. I think it could also shave 100s of lines
of code from the module.

~~~

G.5 GENERAL - Inconsistent function comment wording.

The function comments are worded in different ways...

"Given a XXX OID and the parse tree that created it, return an ObjTree
representing the creation command."

versus

"Given a XXX OID and the parse tree that created it, return the JSON
blob representing the creation command."

Please use consistent wording throughout.

~~~

G.6 GENERAL - present=false

There are many calls that do like:
append_bool_object(tmpobj, "present", false);

I was thinking the code would be cleaner if there was a wrapper function like:

static void
append_not_present(ObjTree objTree)
{
append_bool_object(objTree, "present", false);
}

~~~

G.7 GENERAL - psprintf format strings

There are quite a few places where the format string is
pre-constructed using psprintf.

e.g.
+ fmt = psprintf("ALTER %s %%{identity}s OWNER TO %%{newowner}I",
+stringify_objtype(node->objectType));
+
+ ownerStmt = new_objtree_VA(fmt, 2,
+"identity", ObjTypeString,
+getObjectIdentity(&address, false),
+"newowner", ObjTypeString,
+get_rolespec_name(node->newowner));

It's not entirely clear to me why this kind of distinction is even
made, or even what are the rules governing the choice. AFAICT this
same result could be achieved by using another string substitution
marker. So why not do it that way instead of mixing different styles?

IMO many/most of the psprintf can be removed.

e.g. I mean something like this (for the above example):

fmt = "ALTER %{obj_type}s %{identity}s OWNER TO %{newowner}I";

ownerStmt = new_objtree_VA(fmt, 3,
"obj_type", ObjTypeString, stringify_objtype(node->objectType),

Re: User functions for building SCRAM secrets

2022-11-10 Thread Jonathan S. Katz


On 10/31/22 8:56 PM, Michael Paquier wrote:

On Mon, Oct 31, 2022 at 04:27:08PM -0400, Jonathan S. Katz wrote:

1. password only -- this defers to the PG defaults for SCRAM
2. password + salt -- this is useful for the password history / dictionary
case to allow for a predictable way to check a password.


Well, one could pass a salt based on something generated by random()
to emulate what we currently do in the default case, as well.  The
salt length is an extra possibility, letting it be randomly generated
by pg_strong_random().


Sure, this is a good point. From a SQL level we can get that from 
pgcrypto "gen_random_bytes".


Per this and ilmari's feedback, I updated the 2nd argument to be a 
bytea. See the corresponding tests that then show using decode(..., 
'base64') to handle this.


When I write the docs, I'll include that in the examples.


1. Location of the functions. I put them in
src/backend/utils/adt/cryptohashfuncs.c as I wasn't sure where it would make
sense to have them (and they could easily go into their own file).


As of adt/authfuncs.c?  cryptohashfuncs.c does not strike me as a good
fit.


I went with your suggested name.


Please let me know if you have any questions. I'll add a CF entry for this.


+{ oid => '8555', descr => 'Build a SCRAM secret',
+  proname => 'scram_build_secret_sha256', proleakproof => 't', prorettype => 
'text',
+  proargtypes => 'text', prosrc => 'scram_build_secret_sha256_from_password' },
+{ oid => '8556', descr => 'Build a SCRAM secret',
+  proname => 'scram_build_secret_sha256', proleakproof => 't',
+  provolatile => 'i', prorettype => 'text',
+  proargtypes => 'text text', prosrc => 
'scram_build_secret_sha256_from_password_and_salt' },
+{ oid => '8557', descr => 'Build a SCRAM secret',
+  proname => 'scram_build_secret_sha256', proleakproof => 't',
+  provolatile => 'i', prorettype => 'text',
+  proargtypes => 'text text int4', prosrc => 
'scram_build_secret_sha256_from_password_and_salt_and_iterations' },

Keeping this approach as-is, I don't think that you should consume 3
OIDs, but 1 (with scram_build_secret_sha256_from_password_and_..
as prosrc) that has two defaults for the second argument (salt string,
default as NULL) and third argument (salt, default at 0), with the
defaults set up in system_functions.sql via a redefinition.


Thanks for the suggestion. I went with this as well.


Note that you cannot pass down an expression for the password of
CREATE/ALTER USER, meaning that this would need a \gset at least if
done by a psql client for example, and passing down a password string
is not an encouraged practice, either.  Another approach is to also
provide a role OID in input and store the newly-computed password in
pg_authid (as of [1]), so as one can store it easily.


...unless you dynamically generate the CREATE/ALTER ROLE command ;) (and 
yes, lots of discussion on that).



Did you look at extending \password?  Being able to extend
PQencryptPasswordConn() with custom parameters has been discussed in
the past, but this has gone nowhere.  That's rather unrelated to what
you are looking for, just mentioning as we are playing with options to
have control the iteration number and the salt.


Not yet, but happy to do that as a follow-up patch.

Please see version 2. If folks are generally happy with this, I'll 
address any additional feedback and write up docs.


Thanks,

Jonathan
diff --git a/src/backend/catalog/system_functions.sql 
b/src/backend/catalog/system_functions.sql
index 30a048f6b0..4aa76b81d9 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -594,6 +594,12 @@ LANGUAGE internal
 STRICT IMMUTABLE PARALLEL SAFE
 AS 'unicode_is_normalized';
 
+-- defaults for building a "scram-sha-256" secret
+CREATE OR REPLACE FUNCTION
+  scram_build_secret_sha256(text, bytea DEFAULT NULL, int DEFAULT NULL)
+RETURNS text
+LANGUAGE INTERNAL
+VOLATILE AS 'scram_build_secret_sha256';
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 0de0bbb1b8..7ddb186f96 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -22,6 +22,7 @@ OBJS = \
arraysubs.o \
arrayutils.o \
ascii.o \
+   authfuncs.o \
bool.o \
cash.o \
char.o \
diff --git a/src/backend/utils/adt/authfuncs.c 
b/src/backend/utils/adt/authfuncs.c
new file mode 100644
index 00..4b4bbd7b7f
--- /dev/null
+++ b/src/backend/utils/adt/authfuncs.c
@@ -0,0 +1,120 @@
+/*-
+ *
+ * authfuncs.c
+ *   Functions that assist with authentication management
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *   src/backend/utils/adt/authfuncs.c
+ *
+ *---

Re: New strategies for freezing, advancing relfrozenxid early

2022-11-10 Thread Justin Pryzby

Note that this fails under -fsanitize=align

Subject: [PATCH v5 2/6] Teach VACUUM to use visibility map snapshot.

performing post-bootstrap initialization ...
../src/backend/access/heap/visibilitymap.c:482:38: runtime error: load of 
misaligned address 0x5559e1352424 for type 'uint64', which requires 8 byte 
alignment

> *all_visible += pg_popcount64(umap[i] & 
> VISIBLE_MASK64);

Typo about subxip in comments

2022-11-10 Thread Japin Li



Hi, hackers

Recently, when I read the XidInMVCCSnapshot(), and find there are some
typos in the comments.

diff --git a/src/backend/storage/ipc/procarray.c 
b/src/backend/storage/ipc/procarray.c
index 207c4b27fd..9e8b6756fe 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2409,7 +2409,7 @@ GetSnapshotData(Snapshot snapshot)
 * We could try to store xids into xip[] first and then into 
subxip[]
 * if there are too many xids. That only works if the snapshot 
doesn't
 * overflow because we do not search subxip[] in that case. A 
simpler
-* way is to just store all xids in the subxact array because 
this is
+* way is to just store all xids in the subxip array because 
this is
 * by far the bigger array. We just leave the xip array empty.
 *
 * Either way we need to change the way XidInMVCCSnapshot() 
works
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index f1f2ddac17..2524b1c585 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -2345,7 +2345,7 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
else
{
/*
-* In recovery we store all xids in the subxact array because 
it is by
+* In recovery we store all xids in the subxip array because it 
is by
 * far the bigger array, and we mostly don't know which xids are
 * top-level and which are subxacts. The xip array is empty.
 *

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans

2022-11-10 Thread Justin Pryzby

On Thu, Nov 10, 2022 at 04:48:17PM -0800, Peter Geoghegan wrote:
> On Tue, Sep 20, 2022 at 3:12 PM Peter Geoghegan  wrote:
> > Attached is v2, which I'm just posting to keep CFTester happy. No real
> > changes here.
> 
> Attached is v3. I'd like to move forward with commit soon. I'll do so
> in the next few days, barring objections.

Note that this comment is dangling in your patch:

+{
+   Pagepage = BufferGetPage(buffer);
+
+   /* nor when there are no tuples to freeze */
...
-   /* Caller should not call me on a non-WAL-logged relation */
-   Assert(RelationNeedsWAL(reln));
-   /* nor when there are no tuples to freeze */
-   Assert(ntuples > 0);

Re: Printing backtrace of postgres processes

2022-11-10 Thread Kyotaro Horiguchi

At Thu, 10 Nov 2022 15:56:35 +0530, Bharath Rupireddy 
 wrote in 
> On Mon, Apr 18, 2022 at 9:10 AM vignesh C  wrote:
> >
> > The attached v21 patch has the changes for the same.
> 
> Thanks for the patch. Here are some comments:
> 
> 1. I think there's a fundamental problem with this patch, that is, it
> prints the backtrace when the interrupt is processed but not when
> interrupt is received. This way, the backends/aux processes will

Yeah, but the obstacle was backtrace(3) itself. Andres pointed [1]
that that may be doable with some care (and I agree to the opinion).
AFAIS no discussions followed and things have been to the current
shape since then.

[1] 
https://www.postgresql.org/message-id/20201201032649.aekv5b5dicvmovf4%40alap3.anarazel.de
| > Surely this is *utterly* unsafe.  You can't do that sort of stuff in
| > a signal handler.
| 
| That's of course true for the current implementation - but I don't think
| it's a fundamental constraint. With a bit of care backtrace() and
| backtrace_symbols() itself can be signal safe:

man 3 backtrace
>  *  backtrace()  and  backtrace_symbols_fd() don't call malloc() explic‐
> itly, but they are part of libgcc,  which  gets  loaded  dynamically
> when  first  used.   Dynamic loading usually triggers a call to mal‐
> loc(3).  If you need certain calls to these  two  functions  to  not
> allocate  memory (in signal handlers, for example), you need to make
> sure libgcc is loaded beforehand.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-10 Thread houzj.f...@fujitsu.com

On Monday, November 7, 2022 6:18 PM Masahiko Sawada  
wrote:
> 
> On Thu, Nov 3, 2022 at 10:06 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Wednesday, November 2, 2022 10:50 AM Masahiko Sawada
>  wrote:
> > >
> > > On Mon, Oct 24, 2022 at 8:42 PM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Wed, Oct 12, 2022 at 3:04 PM Amit Kapila
> 
> > > wrote:
> > > > >
> > > > > On Tue, Oct 11, 2022 at 5:52 AM Masahiko Sawada
> > >  wrote:
> > > > > >
> > > > > > On Fri, Oct 7, 2022 at 2:00 PM Amit Kapila
> 
> > > wrote:
> > > > > > >
> > > > > > > About your point that having different partition structures for
> > > > > > > publisher and subscriber, I don't know how common it will be once
> we
> > > > > > > have DDL replication. Also, the default value of
> > > > > > > publish_via_partition_root is false which doesn't seem to indicate
> > > > > > > that this is a quite common case.
> > > > > >
> > > > > > So how can we consider these concurrent issues that could happen
> only
> > > > > > when streaming = 'parallel'? Can we restrict some use cases to avoid
> > > > > > the problem or can we have a safeguard against these conflicts?
> > > > > >
> > > > >
> > > > > Yeah, right now the strategy is to disallow parallel apply for such
> > > > > cases as you can see in *0003* patch.
> > > >
> > > > Tightening the restrictions could work in some cases but there might
> > > > still be coner cases and it could reduce the usability. I'm not really
> > > > sure that we can ensure such a deadlock won't happen with the current
> > > > restrictions. I think we need something safeguard just in case. For
> > > > example, if the leader apply worker is waiting for a lock acquired by
> > > > its parallel worker, it cancels the parallel worker's transaction,
> > > > commits its transaction, and restarts logical replication. Or the
> > > > leader can log the deadlock to let the user know.
> > > >
> > >
> > > As another direction, we could make the parallel apply feature robust
> > > if we can detect deadlocks that happen among the leader worker and
> > > parallel workers. I'd like to summarize the idea discussed off-list
> > > (with Amit, Hou-San, and Kuroda-San) for discussion. The basic idea is
> > > that when the leader worker or parallel worker needs to wait for
> > > something (eg. transaction completion, messages) we use lmgr
> > > functionality so that we can create wait-for edges and detect
> > > deadlocks in lmgr.
> > >
> > > For example, a scenario where a deadlock occurs is the following:
> > >
> > > [Publisher]
> > > create table tab1(a int);
> > > create publication pub for table tab1;
> > >
> > > [Subcriber]
> > > creat table tab1(a int primary key);
> > > create subscription sub connection 'port=1 dbname=postgres'
> > > publication pub with (streaming = parallel);
> > >
> > > TX1:
> > > BEGIN;
> > > INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); -- streamed
> > > Tx2:
> > > BEGIN;
> > > INSERT INTO tab1 SELECT i FROM generate_series(1, 5000) s(i); --
> streamed
> > > COMMIT;
> > > COMMIT;
> > >
> > > Suppose a parallel apply worker (PA-1) is executing TX-1 and the
> > > leader apply worker (LA) is executing TX-2 concurrently on the
> > > subscriber. Now, LA is waiting for PA-1 because of the unique key of
> > > tab1 while PA-1 is waiting for LA to send further messages. There is a
> > > deadlock between PA-1 and LA but lmgr cannot detect it.
> > >
> > > One idea to resolve this issue is that we have LA acquire a session
> > > lock on a shared object (by LockSharedObjectForSession()) and have
> > > PA-1 wait on the lock before trying to receive messages. IOW,  LA
> > > acquires the lock before sending STREAM_STOP and releases it if
> > > already acquired before sending STREAM_START, STREAM_PREPARE and
> > > STREAM_COMMIT. For PA-1, it always needs to acquire the lock after
> > > processing STREAM_STOP and then release immediately after acquiring
> > > it. That way, when PA-1 is waiting for LA, we can have a wait-edge
> > > from PA-1 to LA in lmgr, which will make a deadlock in lmgr like:
> > >
> > > LA (waiting to acquire lock) -> PA-1 (waiting to acquire the shared
> > > object) -> LA
> > >
> > > We would need the shared objects per parallel apply worker.
> > >
> > > After detecting a deadlock, we can restart logical replication with
> > > temporarily disabling the parallel apply, which is done by 0005 patch.
> > >
> > > Another scenario is similar to the previous case but TX-1 and TX-2 are
> > > executed by two parallel apply workers (PA-1 and PA-2 respectively).
> > > In this scenario, PA-2 is waiting for PA-1 to complete its transaction
> > > while PA-1 is waiting for subsequent input from LA. Also, LA is
> > > waiting for PA-2 to complete its transaction in order to preserve the
> > > commit order. There is a deadlock among three processes but it cannot
> > > be detected in lmgr because the fact that LA is waiting for PA-2 to
> > > complete its transaction doesn't appear in lmgr (s

RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-10 Thread houzj.f...@fujitsu.com

On Tuesday, November 8, 2022 7:50 PM Amit Kapila  
wrote:
> On Mon, Nov 7, 2022 at 6:49 PM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Friday, November 4, 2022 7:45 PM Amit Kapila
>  wrote:
> > > 3.
> > > apply_handle_stream_start(StringInfo s) { ...
> > > + if (!first_segment)
> > > + {
> > > + /*
> > > + * Unlock the shared object lock so that parallel apply worker
> > > + * can continue to receive and apply changes.
> > > + */
> > > + parallel_apply_unlock(winfo->shared->stream_lock_id);
> > > ...
> > > }
> > >
> > > Can we have an assert before this unlock call that the lock must be
> > > held? Similarly, if there are other places then we can have assert
> > > there as well.
> >
> > It seems we don't have a standard API can be used without a transaction.
> > Maybe we can use the list ParallelApplyLockids to check that ?
> >
> 
> Yeah, that occurred to me as well but I am not sure if it is a good
> idea to maintain this list just for assertion but if it turns out that
> we need to maintain it for a different purpose then we can probably
> use it for assert as well.
> 
> Few other comments/questions:
> =
> 1.
> apply_handle_stream_start(StringInfo s)
> {
> ...
> 
> + case TRANS_PARALLEL_APPLY:
> ...
> ...
> + /*
> + * Unlock the shared object lock so that the leader apply worker
> + * can continue to send changes.
> + */
> + parallel_apply_unlock(MyParallelShared->stream_lock_id,
> AccessShareLock);
> 
> As per the design in the email [1], this lock needs to be released by
> the leader worker during stream start which means it should be
> released under the state TRANS_LEADER_SEND_TO_PARALLEL. From the
> comments as well, it is not clear to me why at this time leader is
> supposed to be blocked. Is there a reason for doing differently than
> what is proposed in the original design?
> 2. Similar to above, it is not clear why the parallel worker needs to
> release the stream_lock_id lock at stream_commit and stream_prepare?

Sorry, these were due to my miss. Changed.

> 3. Am, I understanding correctly that you need to lock/unlock in
> apply_handle_stream_abort() for the parallel worker because after
> rollback to savepoint, there could be another set of stream or
> transaction end commands for which you want to wait? If so, maybe an
> additional comment would serve the purpose.

I think you are right. I will think about this in case I missed something and
add some comments in next version.

> 4.
> The leader may have sent multiple streaming blocks in the queue
> + * When the child is processing a streaming block. So only try to
> + * lock if there is no message left in the queue.
> 
> Let's slightly reword this to: "By the time child is processing the
> changes in the current streaming block, the leader may have sent
> multiple streaming blocks. So, try to lock only if there is no message
> left in the queue."

Changed.

> 5.
> +parallel_apply_unlock(uint16 lockid, LOCKMODE lockmode)
> +{
> + if (!list_member_int(ParallelApplyLockids, lockid))
> + return;
> +
> + UnlockSharedObjectForSession(SubscriptionRelationId,
> MySubscription->oid,
> + lockid, am_leader_apply_worker() ?
> + AccessExclusiveLock:
> + AccessShareLock);
> 
> This function should use lockmode argument passed rather than deciding
> based on am_leader_apply_worker. I think this is anyway going to
> change if we start using a different locktag as discussed in one of
> the above emails.

Changed.

> 6.
> +
>  /*
>   * Common spoolfile processing.
>   */
> -static void
> -apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
> +void
> +apply_spooled_messages(FileSet *stream_fileset, TransactionId xid,
> 
> Seems like a spurious line addition.

Removed.

> Fair point. I think if the user wants, she can join with pg_stat_subscription
> based on PID and find the corresponding subscription. However, if we want to
> identify everything via pg_locks then I think we should also mention classid
> or database id as field1. So, it would look like: field1: (pg_subscription's
> oid or current db id); field2: OID of subscription in pg_subscription;
> field3: local or remote xid; field4: 0/1 to differentiate between remote and
> local xid.

I tried to use local xid to lock the transaction, but we currently can only get
the local xid after applying the first change. And it's possible that the first
change in parallel apply worker is blocked by other parallel apply worker which
means the parallel apply worker might not have a chance to share the local xid
with the leader.

To resolve this, I tried to use remote_xid for both stream and transaction lock
and use field4: 0/1 to differentiate between stream and transaction lock. Like:

field1: (current db id); field2: OID of subscription in pg_subscription;
field3: remote xid; field4: 0/1 to differentiate between stream_lock and
transaction_lock.


> IIUC, this handling is required for the case when we are not able to send a
> message to parallel apply worker and switch to serialize mo

Re: Lack of PageSetLSN in heap_xlog_visible

2022-11-10 Thread Jeff Davis

On Thu, 2022-10-13 at 12:49 -0700, Jeff Davis wrote:

> It may violate our torn page protections for checksums, as well...

I could not reproduce a problem here, but I believe one exists when
checksums are enabled, because it bypasses the protections of
UpdateMinRecoveryPoint(). By not updating the page's LSN, it would
allow the page to be flushed (and torn) without updating the minimum
recovery point. That would allow the administrator to subsequently do a
PITR or standby promotion while there's a torn page on disk.

I'm considering this a theoretical risk because for a page tear to be
consequential in this case, it would need to happen between the
checksum and the flags; and I doubt that's a real possibility. But
until we formalize that declaration, then this problem should be fixed.

Patch attached. I also fixed:

  * The comment in heap_xlog_visible() says that not updating the page
LSN avoids full page writes; but the causation is backwards: avoiding
full page images requires that we don't update the page's LSN.
  * Also in heap_xlog_visible(), there are comments and a branch
leftover from before commit f8f4227976 which don't seem to be necessary
any more.
  * In visibilitymap_set(), I clarified that we *must* not set the page
LSN of the heap page if no full page image was taken.

> > But it still not clear for me that not bumping LSN in this place is
> > correct if wal_log_hints is set.
> > In this case we will have VM page with larger LSN than heap page, 
> > because visibilitymap_set
> > bumps LSN of VM page. It means that in theory after recovery we may
> > have 
> > page marked as all-visible in VM,
> > but not having PD_ALL_VISIBLE  in page header. And it violates VM 
> > constraint:
> 
> I'm not quite following this scenario. If the heap page has a lower
> LSN
> than the VM page, how could we recover to a point where the VM bit is
> set but the heap flag isn't?

I still don't understand this problem scenario.

-- 
Jeff Davis
PostgreSQL Contributor Team - AWS

From aa3890a7364ee8f67f04a59c7a8a9cb65086b084 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Thu, 10 Nov 2022 14:46:30 -0800
Subject: [PATCH v1] Fix theoretical torn page hazard.

The original report was concerned with a possible inconsistency
between the heap and the visibility map, which I was unable to
confirm.

However, there does seem to be a torn page hazard when checksums are
enabled. It may be impossible in practice, because it would require a
page tear between the checksum and the flags, so I am considering this
a theoretical risk.

Also fix related comments, which are misleading.

Reported-by: Konstantin Knizhnik
Discussion: https://postgr.es/m/fed17dac-8cb8-4f5b-d462-1bb4908c0...@garret.ru
Backpatch-through: 11
---
 src/backend/access/heap/heapam.c| 28 ++---
 src/backend/access/heap/visibilitymap.c | 13 ++--
 2 files changed, 17 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 12be87efed..5c8cdfb9b2 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8823,21 +8823,17 @@ heap_xlog_visible(XLogReaderState *record)
 		/*
 		 * We don't bump the LSN of the heap page when setting the visibility
 		 * map bit (unless checksums or wal_hint_bits is enabled, in which
-		 * case we must), because that would generate an unworkable volume of
-		 * full-page writes.  This exposes us to torn page hazards, but since
+		 * case we must). This exposes us to torn page hazards, but since
 		 * we're not inspecting the existing page contents in any way, we
 		 * don't care.
-		 *
-		 * However, all operations that clear the visibility map bit *do* bump
-		 * the LSN, and those operations will only be replayed if the XLOG LSN
-		 * follows the page LSN.  Thus, if the page LSN has advanced past our
-		 * XLOG record's LSN, we mustn't mark the page all-visible, because
-		 * the subsequent update won't be replayed to clear the flag.
 		 */
 		page = BufferGetPage(buffer);

 		PageSetAllVisible(page);

+		if (XLogHintBitIsNeeded())
+			PageSetLSN(page, lsn);
+
 		MarkBufferDirty(buffer);
 	}
 	else if (action == BLK_RESTORED)
@@ -8901,20 +8897,8 @@ heap_xlog_visible(XLogReaderState *record)
 		reln = CreateFakeRelcacheEntry(rlocator);
 		visibilitymap_pin(reln, blkno, &vmbuffer);

-		/*
-		 * Don't set the bit if replay has already passed this point.
-		 *
-		 * It might be safe to do this unconditionally; if replay has passed
-		 * this point, we'll replay at least as far this time as we did
-		 * before, and if this bit needs to be cleared, the record responsible
-		 * for doing so should be again replayed, and clear it.  For right
-		 * now, out of an abundance of conservatism, we use the same test here
-		 * we did for the heap page.  If this results in a dropped bit, no
-		 * real harm is done; and the next VACUUM will fix it.
-		 */
-		if (lsn > PageGetLSN(vmpage))
-			visibilitymap_set(reln, blkno

Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans

2022-11-10 Thread Peter Geoghegan

On Tue, Sep 20, 2022 at 3:12 PM Peter Geoghegan  wrote:
> Attached is v2, which I'm just posting to keep CFTester happy. No real
> changes here.

Attached is v3. I'd like to move forward with commit soon. I'll do so
in the next few days, barring objections.

v3 has vacuumlazy.c pass NewRelfrozenXid instead of FreezeLimit for
the purposes of generating recovery conflicts during subsequent REDO
of the resulting xl_heap_freeze_page WAL record. This more general
approach is preparation for my patch to add page-level freezing [1].
It might theoretically lead to more recovery conflicts, but in
practice the impact should be negligible. For one thing VACUUM must
freeze *something* before any recovery conflict can happen during
subsequent REDO on a replica in hot standby. It's far more likely that
any disruptive recovery conflicts come from pruning.

It also makes the cutoff_xid field from the xl_heap_freeze_page WAL
record into a "standard latestRemovedXid format" field. In other words
it backs up an XID passed by vacuumlazy.c caller during original
execution (not in the REDO routine, as on HEAD). To make things
clearer, the patch also renames the nearby xl_heap_visible.cutoff_xid
field to xl_heap_visible.latestRemovedXid. Now there are no WAL
records with a field called "cutoff_xid" (they're all called
"latestRemovedXid" now). This matches PRUNE records, and B-Tree DELETE
records.

The overall picture is that all REDO routines (for both heapam and
index AMs) now advertise that they have a field that they use to
generate recovery conflicts that follows a standard format.  All
latestRemovedXid XIDs are applied in a standard way during REDO: by
passing them to ResolveRecoveryConflictWithSnapshot(). Users can grep
the output of tools like pg_waldump to find latestRemovedXid fields,
without necessarily needing to give any thought to which kind of WAL
records are involved, or even the rmgr. Presenting this information
precisely and uniformity seems useful to me. (Perhaps we should have a
truly generic name, which latestRemovedXid isn't, but that can be
handled separately.)

[1] https://commitfest.postgresql.org/39/3843/
--
Peter Geoghegan

v3-0001-Shrink-freeze-WAL-records-via-deduplication.patch
Description: Binary data

Re: [PATCH] ALTER TABLE ... SET STORAGE default

2022-11-10 Thread Tom Lane

I wrote:
> I've not read the patch in any detail, but I don't see a problem
> with the design.

Hearing no push-back on that position, I reviewed and pushed the
patch.  You'd missed that it also affects CREATE TABLE, but
otherwise it was in pretty good shape.

regards, tom lane

Re: Call lazy_check_wraparound_failsafe earlier for parallel vacuum

2022-11-10 Thread Peter Geoghegan

On Thu, Nov 10, 2022 at 10:20 AM Imseih (AWS), Sami  wrote:
> Consistency is the key point here. It is odd that a serial
> vacuum may skip the remainder of the indexes if failsafe
> kicks-in, but in the parallel case it will go through the entire index
> cycle.

Yeah, it's a little inconsistent.

> >My gut instinct is that the most important thing is to at least call
> >lazy_check_wraparound_failsafe() once per index scan.
>
> I agree. And this should happen in the serial and parallel case.

I meant that there should definitely be a check between each round of
index scans (one index scan here affects each and every index). Doing
more than a single index scan is presumably rare, but are likely
common among VACUUM operations that take an unusually long time --
which is where the failsafe is relevant.

I'm just highlighting that multiple index scans (rather than just 0 or
1 index scans) is by far the primary risk factor that leads to a
VACUUM that takes way longer than is typical. (The other notable risk
comes from aggressive VACUUMs that freeze a great deal of heap pages
all at once, which I'm currently addressing by getting rid of the
whole concept of discrete aggressive mode VACUUM operations.)

> >This code effectively treats pages skipped using the visibility map as
> >equivalent to pages physically scanned (scanned_pages), even though
> >skipping requires essentially no work at all. That just isn't logical,
> >and feels like something worth fixing. The fundamental unit of work in
> >lazy_scan_heap() is a *scanned* heap page.
>
> It makes perfect sense to use the scanned_pages instead.

Want to have a go at writing a patch for that?

-- 
Peter Geoghegan

Re: User functions for building SCRAM secrets

2022-11-10 Thread Jacob Champion

On Tue, Nov 8, 2022 at 9:28 PM Michael Paquier  wrote:
> On Tue, Nov 08, 2022 at 04:57:09PM -0800, Jacob Champion wrote:
> > But I guess that wouldn't really help with ALTER ROLE ... PASSWORD,
> > because you can't parameterize it. Hm...
>
> Yeah, and I'd like to think that this is never something we should
> allow, either, as that could be easily a footgun for users (?).

What would make it unsafe? I don't know a lot about the tradeoffs for
parameterizing queries.

--Jacob

JSONPath Child Operator?

2022-11-10 Thread David E. Wheeler

Greetings! Long time no see, I know. How are you, Hackers?

I notice from the docs in the Postgres JSONPath type, brackets are described as:

> • Square brackets ([]) are used for array access.


  https://www.postgresql.org/docs/current/datatype-json.html#DATATYPE-JSONPATH

Notably they are not used for object field path specifications, in contrast to 
the original JSON Path design, which says:

> JSONPath expressions can use the dot–notation
> 
> $.store.book[0].title
> 
> or the bracket–notation
> 
> $['store']['book'][0]['title']

  https://goessner.net/articles/JsonPath/index.html#e2

Similarly, the current IETF RFC Draft says:

> JSONPath expressions use the _bracket notation_, for example:
> 
>$['store']['book'][0]['title']
> 
>or the more compact _dot notation_, for example:
> 
>$.store.book[0].title

  https://datatracker.ietf.org/doc/draft-ietf-jsonpath-base/


My question: Are there plans to support square bracket syntax for JSON object 
field name strings like this? Or to update to follow the standard as it’s 
finalized?

Thanks,

David

Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-11-10 Thread Tom Lane

Yugo NAGATA  writes:
> Tom Lane  wrote:
>> Hmm.  Maybe the right way to think about this is "if we have completed an
>> EXECUTE, and not yet received a following SYNC, then report that we are in
>> a transaction block"?  But I'm not sure if that breaks any other cases.

> Or, in that case, regarding it as an implicit transaction if multiple commands
> are executed in a pipeline as proposed in [1] could be another solution, 
> although I have once withdrawn this for not breaking backward compatibility.

I didn't like that patch then and I still don't.  In particular, it's
mighty weird to be issuing BeginImplicitTransactionBlock after we've
already completed one command of the pipeline.  If that works without
obvious warts, it's only accidental.

Attached is a draft patch along the lines I speculated about above.
It breaks backwards compatibility in that PreventInTransactionBlock
commands will now be rejected if they're a non-first command in a
pipeline.  I think that's okay, and arguably desirable, for HEAD
but I'm pretty uncomfortable about back-patching it.

I thought of a variant idea that I think would significantly reduce
the risk of breaking working applications, which is to restrict things
only in the case of pipelines with previous data-modifying commands.
I tried to implement that by having PreventInTransactionBlock test

if (GetTopTransactionIdIfAny() != InvalidTransactionId)

but it blew up, because we have various (mostly partitioning-related)
DDL commands that run PreventInTransactionBlock only after they've
acquired an exclusive lock on something, and LogAccessExclusiveLock
gets an XID.  (That was always a horrid POLA-violating kluge that
would bite us on the rear someday, and now it has.  But I can't see
trying to change that in back branches.)

Something could still be salvaged of the idea, perhaps: we could
adjust this patch so that the tests are like

if ((MyXactFlags & XACT_FLAGS_PIPELINING) &&
GetTopTransactionIdIfAny() != InvalidTransactionId)

Maybe that makes it a small enough hazard to be back-patchable.

Another objection that could be raised is the same one I made
already, that !IsInTransactionBlock() doesn't provide the same
guarantee as PreventInTransactionBlock.  I'm not too happy
about that either, but given that we know of no other uses of
IsInTransactionBlock besides ANALYZE, maybe it's okay.  I'm
not sure it's worth trying to avoid it anyway --- we'd just
end up with a probably-dead backwards compatibility stub.

regards, tom lane

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8086b857b9..b7c7fd9f00 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -3488,6 +3488,16 @@ PreventInTransactionBlock(bool isTopLevel, const char *stmtType)
  errmsg("%s cannot run inside a subtransaction",
 		stmtType)));

+	/*
+	 * inside a pipeline that has started an implicit transaction?
+	 */
+	if (MyXactFlags & XACT_FLAGS_PIPELINING)
+		ereport(ERROR,
+(errcode(ERRCODE_ACTIVE_SQL_TRANSACTION),
+		/* translator: %s represents an SQL statement name */
+ errmsg("%s cannot be executed within a pipeline",
+		stmtType)));
+
 	/*
 	 * inside a function call?
 	 */
@@ -3577,9 +3587,11 @@ CheckTransactionBlock(bool isTopLevel, bool throwError, const char *stmtType)
  *	a transaction block than when running as single commands.  ANALYZE is
  *	currently the only example.
  *
- *	If this routine returns "false", then the calling statement is
- *	guaranteed that if it completes without error, its results will be
- *	committed immediately.
+ *	If this routine returns "false", then the calling statement is allowed
+ *	to perform internal transaction-commit-and-start cycles; there is not a
+ *	risk of messing up any transaction already in progress.  (Note that this
+ *	is not the identical guarantee provided by PreventInTransactionBlock,
+ *	since we will not force a post-statement commit.)
  *
  *	isTopLevel: passed down from ProcessUtility to determine whether we are
  *	inside a function.
@@ -3597,6 +3609,9 @@ IsInTransactionBlock(bool isTopLevel)
 	if (IsSubTransaction())
 		return true;

+	if (MyXactFlags & XACT_FLAGS_PIPELINING)
+		return true;
+
 	if (!isTopLevel)
 		return true;

@@ -3604,13 +3619,6 @@ IsInTransactionBlock(bool isTopLevel)
 		CurrentTransactionState->blockState != TBLOCK_STARTED)
 		return true;

-	/*
-	 * If we tell the caller we're not in a transaction block, then inform
-	 * postgres.c that it had better commit when the statement is done.
-	 * Otherwise our report could be a lie.
-	 */
-	MyXactFlags |= XACT_FLAGS_NEEDIMMEDIATECOMMIT;
-
 	return false;
 }

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3082093d1e..f084d9d43b 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2228,6 +2228,12 @@ exec_execute_message(const char *portal_name, long max_rows)
 			 */
 			Comm

RE: Ability to reference other extensions by schema in extension scripts

2022-11-10 Thread Regina Obe

> "Regina Obe"  writes:
> >> I have a distinct sense of deja vu here.  I think this idea, or
> >> something isomorphic to it, was previously discussed with some other
> syntax details.
> 
> > I found the old discussion I recalled having and Stephen had suggested
> > using @extschema{'postgis'}@ On this thread --
> > https://www.postgresql.org/message-
> id/20160425232251.GR10850@tamriel.s
> > nowman.net
> > Is that the one you remember?
> 
> Hmmm ... no, ISTM it was considerably more recent than that.
> [ ...digs... ]  Here we go, it was in the discussion around converting
contrib SQL
> functions to new-style:
> 
> https://www.postgresql.org/message-
> id/flat/3395418.1618352794%40sss.pgh.pa.us
> 
> There are a few different ideas bandied around in there.
> Personally I still like the @extschema:extensionname@ option the best,
> though.
> 
>   regards, tom lane

I had initially thought of a syntax that could always be used even outside
of extension install as some mentioned. Like the PG_EXTENSION_SCHEMA(cube)
example. Main benefit I see with that is that even if an extension is moved,
all the dependent extensions that reference it would still work fine.

I had dismissed that because it seemed too invasive.  Seems like it would
require changes to the parser and possibly add query performance overhead to
resolve the schema. Not to mention the added testing required to do no harm.

The other reason I dismissed it is because at least for PostGIS it would be
harder to conditionally replace. The issue with
PG_EXTENSION_SCHEMA(cube) is we can't support that in lower PG versions so
we'd need to strip for lower versions, and that would introduce the
possibility of missing
PG_EXTENSION_SCHEMA(cube) vs. PG_EXTENSION_SCHEMA( cube ), not a huge deal
though, but not quite as easy and precise as just stripping
@extschema:extensionname@. References.

With the @extschema:extensionname@, it doesn't solve all problems, but the
key ones we care about like breakage of functions used in indexes,
materialized views, and added security and is a little easier to strip out.

I'll work on producing a patch.

Thanks,
Regina

Re: Call lazy_check_wraparound_failsafe earlier for parallel vacuum

2022-11-10 Thread Imseih (AWS), Sami

>It makes sense to prefer consistency here, I suppose. The reason why
>we're not consistent is because it was easier not to be, which isn't
>exactly the best reason (nor the worst).

Consistency is the key point here. It is odd that a serial
vacuum may skip the remainder of the indexes if failsafe
kicks-in, but in the parallel case it will go through the entire index
cycle.

>My gut instinct is that the most important thing is to at least call
>lazy_check_wraparound_failsafe() once per index scan. 

I agree. And this should happen in the serial and parallel case.

> That said, one thing that does bother me in this area occurs to me: we
> call lazy_check_wraparound_failsafe() from lazy_scan_heap() (before we
> get to the index scans that you're talking about) at an interval that
> is based on how many heap pages we've either processed (and recorded
> as a scanned_pages page) *or* have skipped over using the visibility
> map. In other words we use blkno here, when we should really be using
> scanned_pages instead:

>if (blkno - next_failsafe_block >= FAILSAFE_EVERY_PAGES)
>{
>lazy_check_wraparound_failsafe(vacrel);
>next_failsafe_block = blkno;
>}

>This code effectively treats pages skipped using the visibility map as
>equivalent to pages physically scanned (scanned_pages), even though
>skipping requires essentially no work at all. That just isn't logical,
>and feels like something worth fixing. The fundamental unit of work in
>lazy_scan_heap() is a *scanned* heap page.

It makes perfect sense to use the scanned_pages instead.

Regards,

Sami imseih
Amazon Web Services (AWS)

Re: Document parameter count limit

2022-11-10 Thread David G. Johnston

On Thu, Nov 10, 2022 at 10:58 AM Corey Huinker 
wrote:

>
>> +if you are reading this prepatorily, please redesign your
>> query to use temporary tables or arrays
>>
>
> I agree with the documentation of this parameter.
> I agree with dissuading anyone from attempting to change it
> The wording is bordering on snark (however well deserved) and I think the
> voice is slightly off.
>
> Alternate suggestion:
>
> Queries approaching this limit usually can be refactored to use arrays or
> temporary tables, thus reducing parameter overhead.
>
>
> The bit about parameter overhead appeals to the reader's desire for
> performance, rather than just focusing on "you shouldn't want this".
>

Yeah, the wording is a bit tongue-in-cheek.  Figured assuming a committer
wants this at all we'd come up with better wording.  I like your suggestion.

David J.

Re: Lock on ShmemVariableCache fields?

2022-11-10 Thread Alvaro Herrera

On 2022-Nov-01, Michael Paquier wrote:

> On Mon, Oct 31, 2022 at 03:14:54PM +0800, Japin Li wrote:
> > For example, since SetCommitTsLimit() is only used in BootStrapXLog() and
> > StartupXLOG(), we can safely remove the code of acquiring/releasing lock?
> 
> Logically yes, I guess that you could go without the LWLock acquired
> in this routine at this early stage of the game.  Now, perhaps that's
> not worth worrying, but removing these locks could impact any external
> code relying on SetCommitTsLimit() to actually hold them.

My 0.02€:  From an API point of view it makes no sense to remove
acquisition of the lock in SetCommitTsLimit, particularly given that
that function is not at all performance critical.  I think the first
question I would ask, when somebody proposes to make a change like that,
is *why* do they want to make that change.  Is it just because we *can*?
That doesn't sound, to me, sufficient motivation.  You actually
introduce *more* complexity, because after such a change any future
hacker would have to worry about whether their changes are still valid
considering that these struct members are modified unlocked someplace.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Syntax error: function hell() needs an argument.
Please choose what hell you want to involve.

Re: Document parameter count limit

2022-11-10 Thread Corey Huinker

>
>
> +if you are reading this prepatorily, please redesign your
> query to use temporary tables or arrays
>

I agree with the documentation of this parameter.
I agree with dissuading anyone from attempting to change it
The wording is bordering on snark (however well deserved) and I think the
voice is slightly off.

Alternate suggestion:

Queries approaching this limit usually can be refactored to use arrays or
temporary tables, thus reducing parameter overhead.


The bit about parameter overhead appeals to the reader's desire for
performance, rather than just focusing on "you shouldn't want this".

Re: Ability to reference other extensions by schema in extension scripts

2022-11-10 Thread Tom Lane

"Regina Obe"  writes:
>> I have a distinct sense of deja vu here.  I think this idea, or something
>> isomorphic to it, was previously discussed with some other syntax details.

> I found the old discussion I recalled having and Stephen had suggested using
> @extschema{'postgis'}@
> On this thread --
> https://www.postgresql.org/message-id/20160425232251.gr10...@tamriel.snowman.net
> Is that the one you remember?

Hmmm ... no, ISTM it was considerably more recent than that.
[ ...digs... ]  Here we go, it was in the discussion around
converting contrib SQL functions to new-style:

https://www.postgresql.org/message-id/flat/3395418.1618352794%40sss.pgh.pa.us

There are a few different ideas bandied around in there.
Personally I still like the @extschema:extensionname@
option the best, though.

regards, tom lane

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-10 Thread David Christensen

On Thu, Nov 10, 2022 at 2:33 AM Bharath Rupireddy
 wrote:
>
> On Thu, Nov 10, 2022 at 1:31 AM David Christensen
>  wrote:
> >
>
> Thanks for providing the v7 patch, please see my comments and responses below.

Hi Bharath, Thanks for the feedback.

> > > 2. I'm unable to understand the use-case for --fixup-fpi option.
> > > pg_waldump is supposed to be just WAL reader, and must not return any
> > > modified information, with --fixup-fpi option, the patch violates this
> > > principle i.e. it sets page LSN and returns. Without actually
> > > replaying the WAL record on the page, how is it correct to just set
> > > the LSN? How will it be useful? ISTM, we must ignore this option
> > > unless there's a strong use-case.
> >
> > How I was envisioning this was for cases like extreme surgery for
> > corrupted pages, where you extract the page from WAL but it has lsn
> > and checksum set so you could do something like `dd if=fixup-block
> > of=relation ...`, so it *simulates* the replay of said fullpage blocks
> > in cases where for some reason you can't play the intermediate
> > records; since this is always a fullpage block, it's capturing what
> > would be the snapshot so you could manually insert somewhere as needed
> > without needing to replay (say if dealing with an incomplete or
> > corrupted WAL stream).
>
> Recovery sets the page LSN after it replayed the WAL record on the
> page right? Recovery does this - base_page/FPI +
> apply_WAL_record_and_then_set_applied_WAL_record's_LSN =
> new_version_of_page. Essentially, in your patch, you are just setting
> the WAL record LSN with the page contents being the base page's. I'm
> still not sure what's the real use-case here. We don't have an
> independent function in postgres, given a base page and a WAL record
> that just replays the WAL record and output's the new version of the
> page, so I think what you do in the patch with fixup option seems
> wrong to me.

Well if it's not the same output then I guess you're right and there's
not a use for the `--fixup` mode.  By the same token, I'd say
calculating/setting the checksum also wouldn't need to be done, we
should just include the page as included in the WAL stream.

> > > 5.
> > > +if (!RestoreBlockImage(record, block_id, page))
> > > +continue;
> > > +
> > > +/* we have our extracted FPI, let's save it now */
> > > After extracting the page from the WAL record, do we need to perform a
> > > checksum on it?
>
> I think you just need to do the following, this will ensure the
> authenticity of the page that pg_waldump returns.
> if ((PageHeader) page)->pd_checksum != pg_checksum_page((char *) page, blk))
> {
> pg_fatal("page checksum failed");
> }

The WAL already has a checksum, so not certain this makes sense on its
own. Also I'm inclined to make it a warning if it doesn't match rather
than a fatal. (I'd also have to verify that the checksum is properly
set on the page prior to copying the FPI into WAL, which I'm pretty
sure it is but not certain.)

> > > case 'W':
> > > config.save_fpw_path = pg_strdup(optarg);
> > > case 'X':
> > >config.fixup_fpw = true;
> > >config.save_fpw_path = pg_strdup(optarg);
> >
> > Like separate opt processing with their own `break` statement?
> > Probably a bit more readable/conventional.
>
> Yes.

Moot with the removal of the --fixup mode.

> Some more comments:
>
> 1.
> +PGAlignedBlockzerobuf;
> Essentially, it's not a zero buffer, please rename the variable to
> something like 'buf' or 'page_buf' or someother?

Sure.

> 2.
> +if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ)
> Replace pg_pwrite with fwrite() and avoid fileno() system calls that
> should suffice here, AFICS, we don't need pg_pwrite.

Sure.

> 3.
> +if (config.save_fpw_path != NULL)
> +{
> +/* Create the dir if it doesn't exist */
> +if (pg_mkdir_p(config.save_fpw_path, pg_dir_create_mode) < 0)
> I think  you still need pg_check_dir() here, how about something like below?

I was assuming pg_mkdir_p() acted just like mkdir -p, where it's just
an idempotent action, so an existing dir is just treated the same.
What's the benefit here? Would assume if a non-dir file existed at
that path or other permissions issues arose we'd just get an error
from pg_mkdir_p().  (Will review the code there and confirm.)

> 4.
> +/* Create the dir if it doesn't exist */
> +if (pg_mkdir_p(config.save_fpw_path, pg_dir_create_mode) < 0)
> +{
> +pg_log_error("could not create output directory \"%s\": %m",
> + config.save_fpw_path);
> +goto bad_argument;
> Why is the directory creation error a bad_argument? I think you need
> just pg_fatal() here.

Sure. Was just following the other patterns I'd seen for argument handling.

> 5.
> +fsync(fileno(OPF));
> +fclose(OPF);
> I think just the fsync() isn't enough, you still need fsync_fname()
> and/or fsync_parent_pa

Re: Unit tests for SLRU

2022-11-10 Thread Aleksander Alekseev

Hi Michael,

Thanks for the review and also for committing 5ca3645.

> This patch redesigns SimpleLruInit() [...]
> I am not really convinced that this is something that a
> patch aimed at extending testing coverage should redesign, especially
> with a routine as old as that.
> [...] you'd better put this facility in src/test/modules/

Fair enough. PFA the corrected patch v5.

-- 
Best regards,
Aleksander Alekseev


v5-0001-Add-unit-tests-for-SLRU.patch
Description: Binary data

Re: Database "contrib_regression" does not exist during testing

2022-11-10 Thread Tom Lane

Jingtang Zhang  writes:
> Recently when I was running regression tests, I got 'Database 
> "contrib_regression" does not exist' error. After I reproduce the 
> problem, I found it is an auto-vacuum worker process who complains about 
> this error.

That's perfectly normal.  I don't see anything to change here.

regards, tom lane

Re: [PATCH] Support % wildcard in extension upgrade filenames

2022-11-10 Thread Regina Obe

Apologies.  I made a mistake on my downgrade test.
So 3 should be

3) It is possible to downgrade with the wildcard.  For example I had
paths
wildtest--%--2.1.sql 

and confirmed it used the downgrade path when going from 2.2 to 2.1

RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-10 Thread houzj.f...@fujitsu.com

On Monday, November 7, 2022 4:17 PM Peter Smith 
> 
> Here are my review comments for v42-0001

Thanks for the comments.
> ==
> 
> 28. handle_streamed_transaction
> 
>  static bool
>  handle_streamed_transaction(LogicalRepMsgType action, StringInfo s)  {
> - TransactionId xid;
> + TransactionId current_xid;
> + ParallelApplyWorkerInfo *winfo;
> + TransApplyAction apply_action;
> + StringInfoData origin_msg;
> +
> + apply_action = get_transaction_apply_action(stream_xid, &winfo);
> 
>   /* not in streaming mode */
> - if (!in_streamed_transaction)
> + if (apply_action == TRANS_LEADER_APPLY)
>   return false;
> 
> - Assert(stream_fd != NULL);
>   Assert(TransactionIdIsValid(stream_xid));
> 
> + origin_msg = *s;
> 
> ~
> 
> 28b.
> Why not assign it at the declaration, the same as apply_handle_stream_prepare
> does?

The assignment is unnecessary for non-streaming transaction, so I delayed it.
> ~
> 
> 44b.
> If this is always written to a file, then wouldn't a better function name be
> something including the word "serialize" - e.g.
> serialize_message()?

I feel it would be better to be consistent with the existing style 
stream_xxx_xx().

I think I have addressed all the comments, but since quite a few logics are
changed in the new version so I might missed something. And dome code wrapping 
need to
be adjusted, I plan to run pg_indent for next version.

Best regards,
Hou zj

RE: Perform streaming logical transactions by background workers and parallel apply

2022-11-10 Thread houzj.f...@fujitsu.com

On Monday, November 7, 2022 7:43 PM Kuroda, Hayato/黒田 隼人 
 wrote:
> 
> Dear Hou,
> 
> The followings are my comments. I want to consider the patch more, but I sent
> it once.

Thanks for the comments.

> 
> ===
> worker.c
> 
> 01. typedef enum TransApplyAction
> 
> ```
> /*
>  * What action to take for the transaction.
>  *
>  * TRANS_LEADER_APPLY means that we are in the leader apply worker and
> changes
>  * of the transaction are applied directly in the worker.
>  *
>  * TRANS_LEADER_SERIALIZE means that we are in the leader apply worker or
> table
>  * sync worker. Changes are written to temporary files and then applied when
>  * the final commit arrives.
>  *
>  * TRANS_LEADER_SEND_TO_PARALLEL means that we are in the leader apply
> worker
>  * and need to send the changes to the parallel apply worker.
>  *
>  * TRANS_PARALLEL_APPLY means that we are in the parallel apply worker and
>  * changes of the transaction are applied directly in the worker.
>  */
> ```
> 
> TRANS_LEADER_PARTIAL_SERIALIZE should be listed in.
> 

Added.

> 02. handle_streamed_transaction()
> 
> ```
> +   StringInfoData  origin_msg;
> ...
> +   origin_msg = *s;
> ...
> +   /* Write the change to the current file */
> +   stream_write_change(action,
> +
> apply_action == TRANS_LEADER_SERIALIZE ?
> +
> + s : &origin_msg);
> ```
> 
> I'm not sure why origin_msg is needed. Can we remove the conditional
> operator?

Currently, the parallel apply worker would need the transaction xid of this 
change to
define savepoint. So, it need to write the original message to file.

> 
> 03. apply_handle_stream_start()
> 
> ```
> + * XXX We can avoid sending pairs of the START/STOP messages to the
> + parallel
> + * worker because unlike apply worker it will process only one
> + transaction at a
> + * time. However, it is not clear whether any optimization is
> + worthwhile
> + * because these messages are sent only when the
> + logical_decoding_work_mem
> + * threshold is exceeded.
> ```
> 
> This comment should be modified because PA must acquire and release locks at
> that time.
> 
> 
> 04. apply_handle_stream_prepare()
> 
> ```
> +   /*
> +* After sending the data to the parallel apply 
> worker,
> wait for
> +* that worker to finish. This is necessary to 
> maintain
> commit
> +* order which avoids failures due to transaction
> dependencies and
> +* deadlocks.
> +*/
> +
> + parallel_apply_wait_for_xact_finish(winfo->shared);
> ```
> 
> Here seems not to be correct. LA may not send data but spill changes to file.

Changed.

> 05. apply_handle_stream_commit()
> 
> ```
> +   if (apply_action ==
> TRANS_LEADER_PARTIAL_SERIALIZE)
> +
> + stream_cleanup_files(MyLogicalRepWorker->subid, xid);
> ```
> 
> I'm not sure whether the stream files should be removed by LA or PAs. Could
> you tell me the reason why you choose LA?

I think the logic would be natural that only LA can write/delete/create the 
file and
PA only need to read from it.

> ===
> applyparallelworker.c
> 
> 05. parallel_apply_can_start()
> 
> ```
> +   if (switching_to_serialize)
> +   return false;
> ```
> 
> Could you add a comment like:
> Don't start a new parallel apply worker if the leader apply worker has been
> spilling changes to the disk temporarily.

These codes have been removed.

> 06. parallel_apply_start_worker()
> 
> ```
> +   /*
> +* Set the xact_state flag in the leader instead of the
> +* parallel apply worker to avoid the race condition where the leader
> has
> +* already started waiting for the parallel apply worker to finish
> +* processing the transaction while the child process has not yet
> +* processed the first STREAM_START and has not set the
> +* xact_state to true.
> +*/
> ```
> 
> I thinkg the word "flag" should be used for boolean, so the comment should be
> modified.
> (There are so many such code-comments, all of them should be modified.)

Changed.

> 
> 07. parallel_apply_get_unique_id()
> 
> ```
> +/*
> + * Returns the unique id among all parallel apply workers in the subscriber.
> + */
> +static uint16
> +parallel_apply_get_unique_id()
> ```
> 
> I think this function is inefficient: the computational complexity will be 
> increased
> linearly when the number of PAs is increased. I think the Bitmapset data
> structure may be used.

This function is removed.

> 08. parallel_apply_send_data()
> 
> ```
> #define CHANGES_THRESHOLD 1000
> #define SHM_SEND_TIMEOUT_MS   1
> ```
> 
> I think the timeout may be too long. Could you tell me the background about 
> it?

Serializing data to file would affect the performance, so I tried to make it 
difficult to happen unless the
PA is really blocked by another PA or BA.

> 09. parallel_apply_send_data()
> 
> `

Database "contrib_regression" does not exist during testing

2022-11-10 Thread Jingtang Zhang


Hello everyone.

Recently when I was running regression tests, I got 'Database 
"contrib_regression" does not exist' error. After I reproduce the 
problem, I found it is an auto-vacuum worker process who complains about 
this error.


Then I tried to analyze the code. When this auto-vacuum worker process 
is forked from PostMaster and get into `InitPostgres` in postinit.c, it 
will do following steps:


1. Use the oid of current database to search for the tuple in catalog, 
and get the database name. During this time, it will add AccessShareLock 
on catalog and release it after scan;

2. Call LockSharedObject to add RowExclusiveLock on catalog
3. Use database name to search catalog again, make sure the tuple of 
current database still exists.


During the interval between step 1 and 2, the catalog is not protected 
by any lock, so that another backend process can drop the database 
successfully, causing current process complains about database does not 
exist in step 3.


This issue could not only happen between auto vacuum worker process and 
backend process, but also can happen between two backend processes, 
given the special interleaving order of processes. We can use psql to 
connect to the database, and make the backend process stops at the 
interval between step 1 and 2, and let another backend process drop this 
database, then the first backend process will complain about this error.


I am confused about whether this error should happen in regression 
testing? Is it possible to lock the catalog at step 1 and hold it, so 
that another process will not have the chance to drop the database, 
since dropdb needs to lock the catalog with AccessExclusiveLock? And 
what is the consideration of the design at these 3 steps?


Hopefully to get some voice from kernel hackers, thanks~


--
Best Regards,

Jingtang

——

Jingtang Zhang

E-Mail: mrdrivingd...@gmail.com
GitHub: @mrdrivingduck

Sent from Microsoft Surface Book 2.

What’s the usage of SO_TYPE_ANALYZE

2022-11-10 Thread Zhang Mingli

Hi, hackers

I found that we set the SO_TYPE_ANALYZE option in table_beginscan_analyze()

static inline TableScanDesc
table_beginscan_analyze(Relation rel)
{
uint32 flags = SO_TYPE_ANALYZE;

return rel->rd_tableam->scan_begin(rel, NULL, 0, NULL, NULL, flags);
}

But I didn’t find a place to handle that option.
Do we miss something?  ex: forget handle it in table_endscan.
Else, it’s  not used at all.

The first commit introduced this option is c3b23ae457.
Other commits modify this option just changed the enum order.

Regards,
Zhang Mingli

Re: Documentation for building with meson

2022-11-10 Thread Nazir Bilal Yavuz

Hi,

I did some tests on windows. I used 'ninja' as a backend.

On 11/8/2022 9:23 PM, samay sharma wrote:

Hi,

On Sat, Nov 5, 2022 at 2:39 PM Andres Freund  wrote:

Hi,

On 2022-10-30 20:51:59 -0700, samay sharma wrote:
> +# setup and enter build directory (done only first time)
> +meson setup build src --prefix=$PWD/install

This command won't work on windows, I think.

Yes, $PWD isn't recognized on windows, %CD% could be alternative.

> +      
> +

--sysconfdir=DIRECTORY
> +       
> +        
> +         Sets the directory for various configuration files,
> +  PREFIX/etc by
default.
> +        
> +       
> +      

Need to check what windows does here.

It is same on windows: 'PREFIX/etc'.

I also checked other dirs(bindir, sysconfdir, libdir, includedir, 
datadir, localedir, mandir), default path is correct for all of them.

Regards,
Nazir Bilal Yavuz

Re: Crash in BRIN minmax-multi indexes

2022-11-10 Thread Tomas Vondra

On 11/9/22 00:13, Justin Pryzby wrote:
> On Mon, Oct 03, 2022 at 10:29:38PM +0200, Tomas Vondra wrote:
>> I'll get this fixed in a couple days. Considering the benign nature of
>> this issue (unnecessary assert) I'm not going to rush.
> 
> Is this still an outstanding issue ?
> 

Yes. I'll get it fixed, but it's harmless in practice (without asserts),
and I've been focusing on the other issue with broken NULL-handling in
BRIN indexes.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fix comments atop ReorderBufferAddInvalidations

2022-11-10 Thread Amit Kapila

On Fri, Nov 4, 2022 at 11:14 AM Masahiko Sawada  wrote:
>
> On Thu, Nov 3, 2022 at 7:53 PM Amit Kapila  wrote:
> >
> > The comments atop seem to indicate that we always accumulate
> > invalidation messages in top-level transactions which is neither
> > required nor match with the code. This is introduced in the commit
> > c55040ccd0 and I have observed it while working on a fix for commit
> > 16b1fe0037.
>
> Thank you for the patch. It looks good to me.
>
> I think we can backpatch it to avoid confusion in future.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: ExecRTCheckPerms() and many prunable partitions

2022-11-10 Thread Alvaro Herrera

On 2022-Oct-06, Amit Langote wrote:

> Actually, List of Bitmapsets turned out to be something that doesn't
> just-work with our Node infrastructure, which I found out thanks to
> -DWRITE_READ_PARSE_PLAN_TREES.  So, I had to go ahead and add
> first-class support for copy/equal/write/read support for Bitmapsets,
> such that writeNode() can write appropriately labeled versions of them
> and nodeRead() can read them as Bitmapsets.  That's done in 0003.  I
> didn't actually go ahead and make *all* Bitmapsets in the plan trees
> to be Nodes, but maybe 0003 can be expanded to do that.  We won't need
> to make gen_node_support.pl emit *_BITMAPSET_FIELD() blurbs then; can
> just use *_NODE_FIELD().

Hmm, is this related to what Tom posted as part of his 0004 in
https://postgr.es/m/2901865.1667685...@sss.pgh.pa.us

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Allow pageinspect's bt_page_stats function to return a set of rows instead of a single row

2022-11-10 Thread Hamid Akhtar

On Wed, 12 Oct 2022 at 10:51, Michael Paquier  wrote:

> On Mon, Sep 12, 2022 at 03:18:59PM -0400, Tom Lane wrote:
> > I spun up a 32-bit VM, since that had been on my to-do list anyway,
> > and it looks like I was right:
>
> This feedback has not been addressed and the thread is idle four
> weeks, so I have marked this CF entry as RwF.  Please feel free to
> resubmit once a new version of the patch is available.
>

Attaching the version 5 of the patch that addresses all 3 points raised by
Tom Lane earlier in the thread.
(1) Documentation is added.
(2) Passing "-1" for the number of blocks required now returns all the
remaining index pages after the starting block.
(3) The newly added test cases work for both 32-bit and 64-bit systems.

> --
> Michael
>

pageinspect_btree_multipagestats-v5.patch
Description: Binary data

Re: ExecRTCheckPerms() and many prunable partitions

2022-11-10 Thread Alvaro Herrera

Hello

I've been trying to understand what 0001 does.  One thing that strikes
me is that it seems like it'd be easy to have bugs because of modifying
the perminfo list inadequately.  I couldn't find any cross-check that
some perminfo element that we obtain for a rte does actually match the
relation we wanted to check.  Maybe we could add a test in some central
place that perminfo->relid equals rte->relid?

A related point is that concatenating lists doesn't seem to worry about
not processing one element multiple times and ending up with bogus offsets.
(I suppose you still have to let an element be processed multiple times
in case you have nested subqueries?  I wonder how good is the test
coverage for such scenarios.)

Why do callers of add_rte_to_flat_rtable() have to modify the rte's
perminfoindex themselves, instead of having the function do it for them?
That looks strange.  But also it's odd that flatten_unplanned_rtes
concatenates the two lists after all the perminfoindexes have been
modified, rather than doing both things (adding each RTEs perminfo to
the global list and offsetting the index) as we walk the list, in
flatten_rtes_walker.  It looks like these two actions are disconnected
from one another, but unless I misunderstand, in reality the opposite is
true.

I think the API of ConcatRTEPermissionInfoLists is a bit weird.  Why not
have the function return the resulting list instead, just like
list_append?  It is more verbose, but it seems easier to grok.

Two trivial changes attached.  (Maybe 0002 is not correct, if you're
also trying to reference finalrtepermlist; but in that case I think the
original may have been misleading as well.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
>From 47afb0d3e1cf6fdf77b5ace643ece15e9c618006 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 10 Nov 2022 12:34:01 +0100
Subject: [PATCH 2/2] Simplify comment a little bit

---
 src/include/nodes/parsenodes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 39e9e82bfc..abcf8ee229 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1026,7 +1026,7 @@ typedef struct RangeTblEntry
 * avoid getting an additional, lesser lock.
 *
 * perminfoindex is 1-based index of the RTEPermissionInfo belonging to
-* this RTE in the query's list of RTEPermissionInfos; 0 if permissions
+* this RTE in the containing query's rtepermlist; 0 if permissions
 * need not be checked for the RTE.
 */
Oid relid;  /* OID of the relation 
*/
-- 
2.30.2

>From 682313f8cd4402b4049792e001b4c24ff691a8f1 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 10 Nov 2022 12:33:44 +0100
Subject: [PATCH 1/2] fix typo

---
 src/include/nodes/parsenodes.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 2d648a8fdd..39e9e82bfc 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -1187,7 +1187,7 @@ typedef struct RangeTblEntry
  * reference is represented by setting the bit for InvalidAttrNumber.
  *
  * updatedCols is also used in some other places, for example, to determine
- * which triggers to fire and in FDWs to know which changed columns the need
+ * which triggers to fire and in FDWs to know which changed columns they need
  * to ship off.
  *
  * Generated columns that are caused to be updated by an update to a base
-- 
2.30.2

Re: Avoid overhead open-close indexes (catalog updates)

2022-11-10 Thread Ranier Vilela

Em qui., 10 de nov. de 2022 às 05:16, Michael Paquier 
escreveu:

> On Thu, Sep 01, 2022 at 08:42:15AM -0300, Ranier Vilela wrote:
> > Let's wait for the patch to be accepted and committed, so we can try to
> > change it.
>
> FWIW, I think that this switch is a good idea for cases where we
> potentially update a bunch of tuples, especially based on what
> CatalogTupleInsert() tells in its top comment.

That's the idea.


> Each code path updated
> here needs a performance check to see if that's noticeable enough, but
> I can get behind the one of CopyStatistics(), at least.
>
For CopyStatistics() have performance checks.


>
> EnumValuesCreate() would matter less as this would require a large set
> of values in an enum, but perhaps ORMs would care and that should be
> measurable.

Have a list_length call, for a number of vals.
For 2 or more vals, it is already worth it, since
CatalogOpenIndexes/CatalogCloseIndexes will be called for each val.



> update_attstats() should lead to a measurable difference
> with a relation that has a bunch of attributes with few tuples.
>
Same here.
For 2 or more attributes, it is already worth it, since
CatalogOpenIndexes/CatalogCloseIndexes will be called for each.

DefineTSConfiguration() is less of an issue, still fine to change.
>
Ok.

AddRoleMems() should be equally measurable with a large DDL.  As a
> whole, this looks pretty sane to me and a good idea to move on with.
>
One filter, only.

For all these functions, the only case that would possibly have no effect
would be in the case of changing a single tuple, in which case there would
be only one call CatalogOpenIndexes/CatalogCloseIndexes for both paths.


> I still need to check properly the code paths changed here, of
> course..
>
At least, the patch still applies properly.

regards,
Ranier Vilela

Re: New docs chapter on Transaction Management and related changes

2022-11-10 Thread Simon Riggs

On Mon, 7 Nov 2022 at 22:04, Laurenz Albe  wrote:
>
> On Sat, 2022-11-05 at 10:08 +, Simon Riggs wrote:
> > Agreed; new compilation patch attached, including mine and then
> > Robert's suggested rewordings.
>
> Thanks.  There is clearly a lot of usefule information in this.
>
> Some comments:
>
> > --- a/doc/src/sgml/func.sgml
> > +++ b/doc/src/sgml/func.sgml
> > @@ -24673,7 +24673,10 @@ SELECT collation for ('foo' COLLATE "de_DE");
> > 
> >  Returns the current transaction's ID.  It will assign a new one if 
> > the
> >  current transaction does not have one already (because it has not
> > -performed any database updates).
> > +performed any database updates);  see  > +linkend="transaction-id"/> for details.  If executed in a
> > +subtransaction this will return the top-level xid;  see  > +linkend="subxacts"/> for details.
> > 
> >
>
> I would use a comma after "subtransaction", and I think it would be better to 
> write
> "transaction ID" instead of "xid".
>
> > @@ -24690,6 +24693,7 @@ SELECT collation for ('foo' COLLATE "de_DE");
> >  ID is assigned yet.  (It's best to use this variant if the 
> > transaction
> >  might otherwise be read-only, to avoid unnecessary consumption of 
> > an
> >  XID.)
> > +If executed in a subtransaction this will return the top-level xid.
> > 
> >
>
> Same as above.
>
> > @@ -24733,6 +24737,8 @@ SELECT collation for ('foo' COLLATE "de_DE");
> > 
> >  Returns a current snapshot, a data structure
> >  showing which transaction IDs are now in-progress.
> > +Only top-level xids are included in the snapshot; subxids are not
> > +shown;  see  for details.
> > 
> >
>
> Again, I would avoid "xid" and "subxid", or at least use "transaction ID 
> (xid)"
> and similar.
>
> > --- a/doc/src/sgml/ref/release_savepoint.sgml
> > +++ b/doc/src/sgml/ref/release_savepoint.sgml
> > @@ -34,23 +34,16 @@ RELEASE [ SAVEPOINT ] 
> > savepoint_name
> >Description
> >
> >
> > -   RELEASE SAVEPOINT destroys a savepoint previously 
> > defined
> > -   in the current transaction.
> > +   RELEASE SAVEPOINT will subcommit the subtransaction
> > +   established by the named savepoint, if one exists. This will release
> > +   any resources held by the subtransaction. If there were any
> > +   subtransactions of the named savepoint, these will also be subcommitted.
> >
> >
> >
>
> "Subtransactions of the named savepoint" is somewhat confusing; how about
> "subtransactions of the subtransaction established by the named savepoint"?
>
> If that is too long and explicit, perhaps "subtransactions of that 
> subtransaction".
>
> > @@ -78,7 +71,7 @@ RELEASE [ SAVEPOINT ] 
> > savepoint_name
> >
> >
> > It is not possible to release a savepoint when the transaction is in
> > -   an aborted state.
> > +   an aborted state, to do that use .
> >
> >
> >
>
> I think the following is more English:
> "It is not possible ... state; to do that, use ."
>
> > --- a/doc/src/sgml/ref/rollback.sgml
> > +++ b/doc/src/sgml/ref/rollback.sgml
> > @@ -56,11 +56,14 @@ ROLLBACK [ WORK | TRANSACTION ] [ AND [ NO ] CHAIN ]
> >  AND CHAIN
> >  
> >   
> > -  If AND CHAIN is specified, a new transaction is
> > +  If AND CHAIN is specified, a new unaborted 
> > transaction is
> >immediately started with the same transaction characteristics (see 
> >  >linkend="sql-set-transaction"/>) as the just finished one.  
> > Otherwise,
> >no new transaction is started.
>
> I don't think that is an improvement.  "Unaborted" is an un-word.  A new 
> transaction
> is always "unaborted", isn't it?
>
> > --- a/doc/src/sgml/wal.sgml
> > +++ b/doc/src/sgml/wal.sgml
> > @@ -909,4 +910,36 @@
> > seem to be a problem in practice.
> >
> >   
> > +
> > + 
> > +
> > +  Two-Phase Transactions
> > +
> > +  
> > +   PostgreSQL supports a two-phase commit (2PC)
> [...]
> > +   pg_twophase directory. Currently-prepared
> > +   transactions can be inspected using  > +   
> > linkend="view-pg-prepared-xacts">pg_prepared_xacts.
> > +  
> > + 
> > +
> >  
>
> I don't like "currently-prepared".  How about:
> "Transaction that are currently prepared can be inspected..."
>
> This is clearly interesting information, but I don't think the WAL chapter is 
> the right
> place for this.  "pg_twophase" is already mentioned in "storage.sgml", and 
> details about
> when exactly a prepared transaction is persisted may exceed the details level 
> needed by
> the end user.
>
> I'd look for that information in the reference page for PREPARE TRANSACTION; 
> perhaps
> that would be a better place.  Or, even better, the new "xact.sgml" chapter.
>
> > --- /dev/null
> > +++ b/doc/src/sgml/xact.sgml
>
> +  Transaction Management
>
> +   The word transaction is often abbreviated as "xact".
>
> Should use  here.
>
> > +   Tra

Re: New docs chapter on Transaction Management and related changes

2022-11-10 Thread Alvaro Herrera

On 2022-Nov-10, Laurenz Albe wrote:

> On Wed, 2022-11-09 at 09:16 -0500, Robert Treat wrote:

> > > > -  If AND CHAIN is specified, a new transaction 
> > > > is
> > > > +  If AND CHAIN is specified, a new unaborted 
> > > > transaction is
> > > >    immediately started with the same transaction characteristics 
> > > > (see  > > >    linkend="sql-set-transaction"/>) as the just finished one.  
> > > > Otherwise,
> > > >    no new transaction is started.
> > > 
> > > I don't think that is an improvement.  "Unaborted" is an un-word.  A new 
> > > transaction
> > > is always "unaborted", isn't it?
> > 
> > I thought about this as well when reviewing it, but I do think
> > something is needed for the case where you have a transaction which
> > has suffered an error and then you issue "rollback and chain"; if you
> > just say "a new transaction is immediately started with the same
> > transaction characteristics" it might imply to some the new
> > transaction has some kind of carry over of the previous broken
> > transaction... the use of the word unaborted makes it clear that the
> > new transaction is 100% functional.
> 
> A new transaction is never aborted in my understanding.  Being aborted
> is not a characteristic of a transaction, but a state.

I agree, but maybe it's good to make the point explicit, because it
doesn't seem obvious.  Perhaps something like

"If X is specified, a new transaction (never in aborted state) is
immediately started with the same transaction characteristics (see X) as
the just finished one.  Otherwise ..."

Getting the wording of that parenthical comment right is tricky, though.
What I propose above is not great, but I don't know how to make it
better.  Other ideas that seem slightly worse but may inspire someone:

  ... a new transaction (which is never in aborted state) is ...
  ... a new transaction (not in aborted state) is ...
  ... a new transaction (never aborted, even if the previous is) is ...
  ... a new (not-aborted) transaction is ...
  ... a new (never aborted) transaction is ...
  ... a new (never aborted, even if the previous is) transaction is ...
  ... a new (never aborted, regardless of the status of the previous one) 
transaction is ...

Maybe there's a way to reword the entire phrase that leads to a better
formulation of the idea.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: TAP output format in pg_regress

2022-11-10 Thread Nikolay Shaplov

Hi! Do this patch still need reviewer, or reviewer field in commit-fest have 
been left empty by mistake?

I am fan of TAP-tests, so I am quite happy that pg_regress output changed to 
TAP tests...

I've checked new output, if is conform TAP specification. Checked that prove 
consumes new pg_regress output well.

Did not found quick way to include prove TAP harness right into Makefile, so I 
check dumped output, but it is not really important for now, I guess.

As for the code, I gave it a quick readthrough... And main issue I've stumbled 
was here:

--
 indent = 36 - (parallel ? 8 : 0) - floor(log10(testnumber) + 1);

status("ok %-5i %s%-*s %8.0f ms",
   testnumber, (parallel ? "" : ""), indent,
   testname,
   runtime);
--

This peace of code has non-obvious logic (I can solve it without taking time 
for deep thinking) and this code (almost similar) is repeated three times. 
This is hard to support, and error prone solution. 

I would suggest to add one _print_test_status function, that also accepts 
status string and do this complex calculations and formatting only once (
"# SKIP (ignored)" should also added as %s, so we have only one print with 
complex format string).

Then you will have to read check and fix it only once.  

And may be it would be good to make this calculations more human-readable...


If this patch really needs reviewer and my way of thinking is acceptable, 
please let me know, I will set myself as a reviewer and will dig further into 
the code.

-- 
Nikolay Shaplov aka Nataraj
Fuzzing Engineer at Postgres Professional
Matrix IM: @dhyan:nataraj.su


signature.asc
Description: This is a digitally signed message part.

Re: Reviving lost replication slots

2022-11-10 Thread sirisha chamarthi

On Wed, Nov 9, 2022 at 12:32 AM Kyotaro Horiguchi 
wrote:

> I don't think walsenders fetching segment from archive is totally
> stupid. With that feature, we can use fast and expensive but small
> storage for pg_wal, while avoiding replciation from dying even in
> emergency.
>

Thanks! If there is a general agreement on this in this forum, I would like
to start working on this patch,


>
> At Tue, 8 Nov 2022 19:39:58 -0800, sirisha chamarthi <
> sirichamarth...@gmail.com> wrote in
> > > If it's a streaming replication slot, the standby will anyway jump to
> > > archive mode ignoring the replication slot and the slot will never be
> > > usable again unless somebody creates a new replication slot and
> > > provides it to the standby for reuse.
> > > If it's a logical replication slot, the subscriber will start to
> > > diverge from the publisher and the slot will have to be revived
> > > manually i.e. created again.
> > >
> >
> > Physical slots can be revived with standby downloading the WAL from the
> > archive directly. This patch is helpful for the logical slots.
>
> However, supposing that WalSndSegmentOpen() fetches segments from
> archive as the fallback and that succeeds, the slot can survive
> missing WAL in pg_wal in the first place. So this patch doesn't seem
> to be needed for the purpose.
>

Agree on this. If we add the proposed support, we don't need this patch.


>
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>

Re: Reviving lost replication slots

2022-11-10 Thread sirisha chamarthi

On Wed, Nov 9, 2022 at 2:37 AM Amit Kapila  wrote:

> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi
>  wrote:
> >
>  Is the intent of setting restart_lsn to InvalidXLogRecPtr was to
> disallow reviving the slot?
> >
>
> I think the intent is to compute the correct value for
> replicationSlotMinLSN as we use restart_lsn for it and using the
> invalidated slot's restart_lsn value for it doesn't make sense.
>

 Correct. If a slot is invalidated (lost), then shouldn't we ignore the
slot from computing the catalog_xmin?  I don't see it being set to
InvalidTransactionId in ReplicationSlotsComputeRequiredXmin. Attached a
small patch to address this and the output after the patch is as shown
below.

postgres=# select * from pg_replication_slots;
 slot_name |plugin | slot_type | datoid | database | temporary |
active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
---+---+---++--+---+++--+--+-+-++---+---
 s2| test_decoding | logical   |  5 | postgres | f | f
 ||  |  771 | 0/30466368  | 0/304663A0
 | reserved   |  28903824 | f
(1 row)

postgres=# create table t2(c int, c1 char(100));
CREATE TABLE
postgres=# drop table t2;
DROP TABLE
postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname =
'pg_class';
 n_dead_tup

  2
(1 row)

postgres=# select * from pg_stat_replication;
 pid | usesysid | usename | application_name | client_addr |
client_hostname | client_port | backend_start | backend_xmin | state |
sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag |
replay_lag | sync_pri
ority | sync_state | reply_time
-+--+-+--+-+-+-+---+--+---+--+---+---++---+---++-
--++
(0 rows)

postgres=# insert into t1 select * from t1;
INSERT 0 2097152
postgres=# checkpoint;
CHECKPOINT
postgres=# select * from pg_replication_slots;
 slot_name |plugin | slot_type | datoid | database | temporary |
active | active_pid | xmin | catalog_xmin | restart_lsn |
confirmed_flush_lsn | wal_status | safe_wal_size | two_phase
---+---+---++--+---+++--+--+-+-++---+---
 s2| test_decoding | logical   |  5 | postgres | f | f
 ||  |  771 | | 0/304663A0
 | lost   |   | f
(1 row)

postgres=# vacuum pg_class;
VACUUM
postgres=# select n_dead_tup from pg_stat_all_tables where relname =
'pg_class';
 n_dead_tup

  0
(1 row)


>
> --
> With Regards,
> Amit Kapila.
>

Assertion failure in SnapBuildInitialSnapshot()

2022-11-10 Thread Amit Kapila

Hi,

Thomas has reported this failure in an email [1] and shared the
following links offlist with me:
https://cirrus-ci.com/task/5311549010083840
https://api.cirrus-ci.com/v1/artifact/task/5311549010083840/testrun/build/testrun/subscription/100_bugs/log/100_bugs_twoways.log
https://api.cirrus-ci.com/v1/artifact/task/5311549010083840/crashlog/crashlog-postgres.exe_1c40_2022-11-08_00-20-28-110.txt

The call stack is as follows:
0063`4edff670 7ff7`1922fcdf postgres!ExceptionalCondition(
char * conditionName = 0x7ff7`198f8050
"TransactionIdPrecedesOrEquals(safeXid, snap->xmin)",
char * fileName = 0x7ff7`198f8020
"../src/backend/replication/logical/snapbuild.c",
int lineNumber = 0n600)+0x78 [c:\cirrus\src\backend\utils\error\assert.c @ 67]
0063`4edff6b0 7ff7`192106df postgres!SnapBuildInitialSnapshot(
struct SnapBuild * builder = 0x0251`5b95bce8)+0x20f
[c:\cirrus\src\backend\replication\logical\snapbuild.c @ 600]
0063`4edff730 7ff7`1920d9f6 postgres!CreateReplicationSlot(
struct CreateReplicationSlotCmd * cmd = 0x0251`5b94d828)+0x40f
[c:\cirrus\src\backend\replication\walsender.c @ 1152]
0063`4edff870 7ff7`192bc9c4 postgres!exec_replication_command(
char * cmd_string = 0x0251`5b94ac68 "CREATE_REPLICATION_SLOT
"pg_16400_sync_16392_7163433409941550636" LOGICAL pgoutput (SNAPSHOT
'use')")+0x4a6 [c:\cirrus\src\backend\replication\walsender.c @ 1804]


As per my investigation based on the above logs, the failed test is
due to the following command in 100_bugs.pl:
$node_twoways->safe_psql('d2',
"CREATE SUBSCRIPTION testsub CONNECTION \$\$"
  . $node_twoways->connstr('d1')
  . "\$\$ PUBLICATION testpub WITH (create_slot=false, "
  . "slot_name='testslot')");

It failed while creating the table sync slot.

The failure happens because the xmin computed by the snap builder is
lesser than what is computed by GetOldestSafeDecodingTransactionId()
during initial snapshot creation for the tablesync slot by
SnapBuildInitialSnapshot.

To investigate, I tried to study how the values of "safeXid" and
"snap->xmin" are computed in SnapBuildInitialSnapshot(). There appear
to be four places in the code where we assign value to xmin
(builder-xmin) during the snapshot building process and then we assign
the same to snap->xmin. Those places are: (a) Two places in
SnapBuildFindSnapshot(), (b) One place in SnapBuildRestore(), and (c)
One place in SnapBuildProcessRunningXacts()

Seeing the LOGS, it appears to me that we find a consistent point from
the below code in SnapBuildFindSnapshot() and the following line
assigns builder->xmin.

...
if (running->oldestRunningXid == running->nextXid)
{
...
builder->xmin = running->nextXid;

The reason is we only see "logical decoding found consistent point at
..." in LOGs. If SnapBuildFindSnapshot() has to go through the entire
machinery of snapshot building then, we should have seen "logical
decoding found initial starting point at ..." and similar other LOGs.
The builder->xmin can't be assigned from any other place in (b) or
(c). From (c), it can't be assigned because we are building a full
snapshot here which won't restore any serialized snapshot. It can't be
assigned from (b) because while creating a slot we stop as soon as we
find the consistent point, see
DecodingContextFindStartpoint()->DecodingContextReady()

In the above code snippet, the running->nextXid in the above code
snippet has been assigned from ShmemVariableCache->nextXid in
GetRunningTransactionData().

The safeXid computed from GetOldestSafeDecodingTransactionId() uses
ShmemVariableCache->nextXid as the starting point and keeps the slot's
xmin as the safe Xid limit.

It seems to me due to SnapBuilder's initial_xmin_horizon, we won't set
(SnapBuilder's) xmin lesser than slot's effective_xmin computed in
CreateInitDecodingContext(). See SnapBuildFindSnapshot(). So, ideally,
SnapBuildInitialSnapshot() should never compute safeXid which is based
on the minimum of all slot's effective_xmin to be greater than
SnapBuilder's xmin (or snapshot's xmin). In short, the code as written
seems correct to me.

I have also tried to analyze if any recent commit (7f13ac8123) could
cause this problem but can't think of any reason because the changes
are related to the restart of decoding and the failed test is related
to creating a slot the very first time.

I don't have any good ideas on how to proceed with this. Any thoughts
on this would be helpful?

Note: I have briefly discussed this issue with Sawada-San and
Kuroda-San, so kept them in Cc.

[1] - 
https://www.postgresql.org/message-id/CA%2BhUKG%2BA_LyW%3DFAOi2ebA9Vr-1%3Desu%2BeBSm0dsVhU%3DEgfpipkg%40mail.gmail.com

--
With Regards,
Amit Kapila.

Re: Printing backtrace of postgres processes

2022-11-10 Thread Bharath Rupireddy

On Mon, Apr 18, 2022 at 9:10 AM vignesh C  wrote:
>
> The attached v21 patch has the changes for the same.

Thanks for the patch. Here are some comments:

1. I think there's a fundamental problem with this patch, that is, it
prints the backtrace when the interrupt is processed but not when
interrupt is received. This way, the backends/aux processes will
always have backtraces in their main loops or some processing loops
wherever CHECK_FOR_INTERRUPTS() is called [1]. What we need is
something more useful. What I think is the better way is to capture
the backtrace, call set_backtrace(), when the interrupt arrives and
then if we don't want to print it in the interrupt handler, perhaps
save it and print it in the next CFI() cycle. See some realistic and
useful backtrace [2] when I emitted the backtrace in the interrupt
handler.

2.
+specified process ID.  This function can send the request to
+backends and auxiliary processes except the logger and statistics
+collector.  The backtraces will be logged at LOG
There's no statistics collector process any more. Please remove it.
Also remove 'the' before 'logger' to be in sync with
pg_log_backend_memory_contexts docs.

3.
+configuration set (See  for
Lowercase 'see' to be in sync with pg_log_backend_memory_contexts docs.

4.
+You can obtain the file name and line number from the logged
details by using
How about 'One can obtain'?

5.
+postgres=# select pg_log_backtrace(pg_backend_pid());
+For example:
+
+2021-01-27 11:33:50.247 IST [111735] LOG:  current backtrace:
A more realistic example would be better here, say walwriter or
checkpointer or some other process backtrace will be good instead of a
backend logging its pg_log_backtrace()'s call stack?

6.
Don't we need the backtrace of walreceiver? I think it'll be good to
have in ProcessWalRcvInterrupts().

7.
+errmsg_internal("logging backtrace of PID %d", MyProcPid));
+elog(LOG_SERVER_ONLY, "current backtrace:%s", errtrace);
Can we get rid of "current backtrace:%s" and have something similar to
ProcessLogMemoryContextInterrupt() like below?

errmsg("logging backtrace of PID %d", MyProcPid)));
errmsg("%s", errtrace);

[1]
2022-11-10 09:55:44.691 UTC [1346443] LOG:  logging backtrace of PID 1346443
2022-11-10 09:55:44.691 UTC [1346443] LOG:  current backtrace:
postgres: checkpointer (set_backtrace+0x46) [0x5640df9849c6]
postgres: checkpointer (ProcessLogBacktraceInterrupt+0x16)
[0x5640df7fd326]
postgres: checkpointer (CheckpointerMain+0x1a3) [0x5640df77f553]
postgres: checkpointer (AuxiliaryProcessMain+0xc9) [0x5640df77d5e9]
postgres: checkpointer (+0x436a9a) [0x5640df783a9a]
postgres: checkpointer (PostmasterMain+0xd57) [0x5640df7877e7]
postgres: checkpointer (main+0x20f) [0x5640df4a6f1f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f4b9fe9dd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f4b9fe9de40]
postgres: checkpointer (_start+0x25) [0x5640df4a7265]
2022-11-10 09:56:05.032 UTC [1346448] LOG:  logging backtrace of PID 1346448
2022-11-10 09:56:05.032 UTC [1346448] LOG:  current backtrace:
postgres: logical replication launcher (set_backtrace+0x46)
[0x5640df9849c6]
postgres: logical replication launcher
(ProcessLogBacktraceInterrupt+0x16) [0x5640df7fd326]
postgres: logical replication launcher
(ApplyLauncherMain+0x11b) [0x5640df7a253b]
postgres: logical replication launcher
(StartBackgroundWorker+0x220) [0x5640df77e270]
postgres: logical replication launcher (+0x4369f4) [0x5640df7839f4]
postgres: logical replication launcher (+0x43771d) [0x5640df78471d]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f4b9feb6520]
/lib/x86_64-linux-gnu/libc.so.6(__select+0xbd) [0x7f4b9ff8f74d]
postgres: logical replication launcher (+0x43894d) [0x5640df78594d]
postgres: logical replication launcher (PostmasterMain+0xcb5)
[0x5640df787745]
postgres: logical replication launcher (main+0x20f) [0x5640df4a6f1f]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f4b9fe9dd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f4b9fe9de40]
postgres: logical replication launcher (_start+0x25) [0x5640df4a7265]

[2]
2022-11-10 10:25:20.021 UTC [1351953] LOG:  current backtrace:
postgres: ubuntu postgres [local] INSERT(set_backtrace+0x46)
[0x55c60ae069b6]
postgres: ubuntu postgres [local]
INSERT(procsignal_sigusr1_handler+0x1d8) [0x55c60ac7f4f8]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f75a395c520]
/lib/x86_64-linux-gnu/libc.so.6(+0x91104) [0x7f75a39ab104]
/lib/x86_64-linux-gnu/libc.so.6(+0x9ccf8) [0x7f75a39b6cf8]
postgres: ubuntu postgres [local] INSERT(PGSemaphoreLock+0x22)
[0x55c60abfb1f2]
postgres: ubuntu postgres [local] INSERT(LWLockAcquire+0x174)
[0x55c60ac8f384]
postgres: ubuntu postgres [local] INSERT(+0x1fb77

Re: Making Vars outer-join aware

2022-11-10 Thread Richard Guo

On Thu, Aug 25, 2022 at 6:27 PM Richard Guo  wrote:

> I'm not sure if I understand it correctly. If we are given the first
> order from the parser, the SpecialJoinInfo for the B/C join would have
> min_lefthand as containing both B and the A/B join. And this
> SpecialJoinInfo would make the B/C join be invalid, which is not what we
> want.
>

Now I see how this works from v6 patch.  Once we notice identity 3
applies, we will remove the lower OJ's ojrelid from the min_lefthand or
min_righthand so that the commutation is allowed.  So in this case, the
A/B join would be removed from B/C join's min_lefthand when we build the
SpecialJoinInfo for B/C join, and that makes the B/C join to be legal.

BTW, inner_join_rels can contain base Relids and OJ Relids.  Maybe we
can revise the comments a bit for it atop deconstruct_recurse and
make_outerjoininfo.  The same for the comments of qualscope, ojscope and
outerjoin_nonnullable atop distribute_qual_to_rels.

The README mentions restriction_is_computable_at(), I think it should be
clause_is_computable_at().

Thanks
Richard

Re: WIN32 pg_import_system_collations

2022-11-10 Thread Juan José Santamaría Flecha

On Wed, Nov 9, 2022 at 12:02 AM Juan José Santamaría Flecha <
juanjo.santama...@gmail.com> wrote:

> On Mon, Nov 7, 2022 at 4:08 PM Peter Eisentraut <
> peter.eisentr...@enterprisedb.com> wrote:
>
>>
>> This looks pretty good to me.  The refactoring of the non-Windows parts
>> makes sense.  The Windows parts look reasonable on manual inspection,
>> but again, I don't have access to Windows here, so someone else should
>> also look it over.
>>
>> I was going to say that at least it is getting tested on the CI, but I
> have found out that meson changes version(). That is fixed in this version.
>

Now is currently failing due to [1], so maybe we can leave this patch on
hold until that's addressed.

[1]
https://www.postgresql.org/message-id/CAC%2BAXB1wJEqfKCuVcNpoH%3Dgxd61N%3D7c2fR3Ew6YRPpSfEUA%3DyQ%40mail.gmail.com


Regards,

Juan José Santamaría Flecha

Meson doesn't define HAVE_LOCALE_T for mscv

2022-11-10 Thread Juan José Santamaría Flecha

Hello all,

Meson doesn't see the redefinition of locale_t done
in src/include/port/win32_port.h, so is not defining
HAVE_LOCALE_T, HAVE_WCSTOMBS_L nor HAVE_MBSTOWCS_L as the
current src/tools/msvc/build.pl script does.

Please find attached a patch for so.

Regards,

Juan José Santamaría Flecha


0001-meson-define-HAVE_LOCALE_T-for-mscv.patch
Description: Binary data

Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-10 Thread Bharath Rupireddy

On Thu, Nov 10, 2022 at 1:31 AM David Christensen
 wrote:
>

Thanks for providing the v7 patch, please see my comments and responses below.

> > 2. I'm unable to understand the use-case for --fixup-fpi option.
> > pg_waldump is supposed to be just WAL reader, and must not return any
> > modified information, with --fixup-fpi option, the patch violates this
> > principle i.e. it sets page LSN and returns. Without actually
> > replaying the WAL record on the page, how is it correct to just set
> > the LSN? How will it be useful? ISTM, we must ignore this option
> > unless there's a strong use-case.
>
> How I was envisioning this was for cases like extreme surgery for
> corrupted pages, where you extract the page from WAL but it has lsn
> and checksum set so you could do something like `dd if=fixup-block
> of=relation ...`, so it *simulates* the replay of said fullpage blocks
> in cases where for some reason you can't play the intermediate
> records; since this is always a fullpage block, it's capturing what
> would be the snapshot so you could manually insert somewhere as needed
> without needing to replay (say if dealing with an incomplete or
> corrupted WAL stream).

Recovery sets the page LSN after it replayed the WAL record on the
page right? Recovery does this - base_page/FPI +
apply_WAL_record_and_then_set_applied_WAL_record's_LSN =
new_version_of_page. Essentially, in your patch, you are just setting
the WAL record LSN with the page contents being the base page's. I'm
still not sure what's the real use-case here. We don't have an
independent function in postgres, given a base page and a WAL record
that just replays the WAL record and output's the new version of the
page, so I think what you do in the patch with fixup option seems
wrong to me.

> > 5.
> > +if (!RestoreBlockImage(record, block_id, page))
> > +continue;
> > +
> > +/* we have our extracted FPI, let's save it now */
> > After extracting the page from the WAL record, do we need to perform a
> > checksum on it?

I think you just need to do the following, this will ensure the
authenticity of the page that pg_waldump returns.
if ((PageHeader) page)->pd_checksum != pg_checksum_page((char *) page, blk))
{
pg_fatal("page checksum failed");
}

> > case 'W':
> > config.save_fpw_path = pg_strdup(optarg);
> > case 'X':
> >config.fixup_fpw = true;
> >config.save_fpw_path = pg_strdup(optarg);
>
> Like separate opt processing with their own `break` statement?
> Probably a bit more readable/conventional.

Yes.

Some more comments:

1.
+PGAlignedBlockzerobuf;
Essentially, it's not a zero buffer, please rename the variable to
something like 'buf' or 'page_buf' or someother?

2.
+if (pg_pwrite(fileno(OPF), page, BLCKSZ, 0) != BLCKSZ)
Replace pg_pwrite with fwrite() and avoid fileno() system calls that
should suffice here, AFICS, we don't need pg_pwrite.

3.
+if (config.save_fpw_path != NULL)
+{
+/* Create the dir if it doesn't exist */
+if (pg_mkdir_p(config.save_fpw_path, pg_dir_create_mode) < 0)
I think  you still need pg_check_dir() here, how about something like below?

if (pg_check_dir(config.save_fpw_path) == 0)
{
if (pg_mkdir_p(config.save_fpw_path, pg_dir_create_mode) < 0)
{
  /* error */
}
}

4.
+/* Create the dir if it doesn't exist */
+if (pg_mkdir_p(config.save_fpw_path, pg_dir_create_mode) < 0)
+{
+pg_log_error("could not create output directory \"%s\": %m",
+ config.save_fpw_path);
+goto bad_argument;
Why is the directory creation error a bad_argument? I think you need
just pg_fatal() here.

5.
+fsync(fileno(OPF));
+fclose(OPF);
I think just the fsync() isn't enough, you still need fsync_fname()
and/or fsync_parent_path(), perhaps after for (block_id = 0; block_id
<= XLogRecMaxBlockId(record); block_id++) loop.

6. Speaking of which, do we need to do fsync()'s optionally? If we
were to write many such FPI files, aren't there going to be more
fsync() calls and imagine this feature being used on a production
server and a lot of fsync() will definitely make running server
fsync() ops slower.  I think we need a new option whether pg_waldump
ever do fsync() or not, something similar to --no-sync of
pg_receivewal/pg_upgrade/pg_dump/pg_initdb/pg_checksums etc. I would
like it if the pg_waldump's --no-sync is coded as 0001 and 0002 can
make use of it.

7.
+pg_fatal("couldn't write out complete fullpage image to
file: %s", filename);
We typically use "full page image" in the output strings, please correct.

8.
+
+if (((PageHeader) page)->pd_checksum)
+((PageHeader) page)->pd_checksum =
pg_checksum_page((char *) page, blk);
Why do you need to set the page's checksum by yourself? I don't think
this is the right way, pg_waldump should just return what it sees in
the WAL record, of course, after verifying a few checks (like chec

Re: Unit tests for SLRU

2022-11-10 Thread Michael Paquier

On Thu, Nov 10, 2022 at 04:01:02PM +0900, Michael Paquier wrote:
> I have looked at what you have here..

The comment at the top of SimpleLruInit() for sync_handler has been
fixed as of 5ca3645, and backpatched.  Nice catch.
--
Michael


signature.asc
Description: PGP signature

Re: Avoid overhead open-close indexes (catalog updates)

2022-11-10 Thread Michael Paquier

On Thu, Sep 01, 2022 at 08:42:15AM -0300, Ranier Vilela wrote:
> Let's wait for the patch to be accepted and committed, so we can try to
> change it.

FWIW, I think that this switch is a good idea for cases where we
potentially update a bunch of tuples, especially based on what
CatalogTupleInsert() tells in its top comment.  Each code path updated
here needs a performance check to see if that's noticeable enough, but
I can get behind the one of CopyStatistics(), at least.

EnumValuesCreate() would matter less as this would require a large set
of values in an enum, but perhaps ORMs would care and that should be
measurable.  update_attstats() should lead to a measurable difference
with a relation that has a bunch of attributes with few tuples.
DefineTSConfiguration() is less of an issue, still fine to change.
AddRoleMems() should be equally measurable with a large DDL.  As a
whole, this looks pretty sane to me and a good idea to move on with.

I still need to check properly the code paths changed here, of
course..
-- 
Michael

signature.asc
Description: PGP signature

67 matches

Mail list logo