Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2022-12-12 Thread David Rowley
On Wed, 9 Nov 2022 at 14:58, David Rowley  wrote:
> v2 attached.

I've been looking at this again and this time around understand why
the * 1.5 pessimism factor was included in the incremental sort code.

If we create a table with a very large skew in the number of rows per
what will be our pre-sorted groups.

create table ab (a int not null, b int not null);
insert into ab select 0,x from generate_Series(1,999000)x union all
select x%1000+1,0 from generate_Series(999001,100)x;

Here the 0 group has close to 1 million rows, but the remaining groups
1-1000 have just 1 row each. The planner only knows there are about
1001 distinct values in "a" and assumes an even distribution of rows
between those values.

With:
explain (analyze, timing off) select * from ab order by a,b;

In master, the plan is:

   QUERY PLAN
-
 Sort  (cost=122490.27..124990.27 rows=100 width=8) (actual
rows=100 loops=1)
   Sort Key: a, b
   Sort Method: quicksort  Memory: 55827kB
   ->  Index Scan using ab_a_idx on ab  (cost=0.42..22832.42
rows=100 width=8) (actual rows=100 loops=1)
 Planning Time: 0.069 ms
 Execution Time: 155.469 ms

With the v2 patch it's:
   QUERY PLAN
-
 Incremental Sort  (cost=2767.38..109344.55 rows=100 width=8)
(actual rows=100 loops=1)
   Sort Key: a, b
   Presorted Key: a
   Full-sort Groups: 33  Sort Method: quicksort  Average Memory: 27kB
Peak Memory: 27kB
   Pre-sorted Groups: 1  Sort Method: quicksort  Average Memory:
55795kB  Peak Memory: 55795kB
   ->  Index Scan using ab_a_idx on ab  (cost=0.42..22832.42
rows=100 width=8) (actual rows=100 loops=1)
 Planning Time: 0.072 ms
 Execution Time: 163.614 ms

So there is a performance regression.

Sometimes teaching the planner new tricks means that it might use
those tricks at a bad time.  Normally we put in an off switch for
these situations to allow users an escape hatch. We have
enable_incremental_sort for this.  It seems like incremental sort has
tried to avoid this problem by always considering the same "Sort"
paths that we did prior to incremental sort, and also considers
incremental sort for pre-sorted paths with the 1.5 pessimism factor.
The v2 patch taking away the safety net.

I think what we need to do is:  Do our best to give incremental sort
the most realistic costs we can and accept that it might choose a
worse plan in some cases. Users can turn it off if they really have no
other means to convince the planner it's wrong.

Additionally, I think what we also need to add a GUC such as
enable_presorted_aggregate.  People can use that when their Index Scan
-> Incremental Sort -> Aggregate plan is worse than their previous Seq
Scan -> Sort -> Aggregate plan that they were getting in < 16.
Turning off enable_incremental_sort alone won't give them the same
aggregate plan that they had in pg15 as we always set the
query_pathkeys to request a sort order that will suit the order by /
distinct aggregates.

I'll draft up a patch for the enable_presorted_aggregate.

David




Use get_call_result_type() more widely

2022-12-12 Thread Bharath Rupireddy
Hi,

A review comment in another thread [1] by Michael Paquier about the
usage of get_call_result_type() instead of explicit building of
TupleDesc made me think about using it more widely. Actually, the
get_call_result_type() looks at the function definitions to figure the
column names and build the required TupleDesc, usage of which avoids
duplication of the column names between pg_proc.dat/function
definitions and source code. Also, it saves a good number of LOC ~415
[2] and the size of all the object files put together gets reduced by
~4MB, which means, the postgres binary becomes leaner by ~4MB [3]. I'm
attaching a patch for these changes.

While on this, I observed that BlessTupleDesc() is called in many
(~12) places right after get_call_result_type() which actually does
the job of BlessTupleDesc() before returning the TupleDesc. I think we
can get rid of BlessTupleDesc() after get_call_result_type(). I'm
attaching a patch for these changes too.

cirrus-ci members are happy with these patches, please see here
https://github.com/BRupireddy/postgres/tree/use_get_call_result_type()_more_widely_v1.

Thoughts?

[1] https://www.postgresql.org/message-id/Y41De5NnF2sxmJPI%40paquier.xyz

[2] 21 files changed, 97 insertions(+), 514 deletions(-)

[3] Source code is built with CFLAGS = -O3.
PATCHED:
   textdata bss dec hex filename
   1043   0   01043 413 contrib/old_snapshot/time_mapping.o
   7192   0   071921c18 contrib/pg_visibility/pg_visibility.o
   7144   0 12072641c60 src/backend/access/transam/commit_ts.o
  19681  24 248   199534df1 src/backend/access/transam/multixact.o
  20595   0  88   2068350cb src/backend/access/transam/twophase.o
   6162   0  246186182a src/backend/access/transam/xlogfuncs.o
  455402736   8   48284bc9c src/backend/catalog/objectaddress.o
   9943   0   0994326d7 src/backend/catalog/pg_publication.o
  18239   0  16   18255474f src/backend/commands/sequence.o
   6429   0   06429191d src/backend/tsearch/wparser.o
  470491840  52   48941bf2d src/backend/utils/adt/acl.o
  43066 168 784   44018abf2 src/backend/utils/adt/datetime.o
   6843   0   068431abb src/backend/utils/adt/genfile.o
   6904 120   070241b70 src/backend/utils/adt/lockfuncs.o
  105127008   0   175204470 src/backend/utils/adt/misc.o
   1569   0   01569 621 src/backend/utils/adt/partitionfuncs.o
  16266   0   0   162663f8a src/backend/utils/adt/pgstatfuncs.o
  40985   0   0   40985a019 src/backend/utils/adt/tsvector_op.o
   8322   0   083222082 src/backend/utils/misc/guc_funcs.o
   2109   0   02109 83d src/backend/utils/misc/pg_controldata.o
   2354   0   02354 932
src/test/modules/test_predtest/test_predtest.o
  9586047  226936  205536 10018519 98ded7 src/backend/postgres

HEAD:
   textdata bss dec hex filename
   1019   0   01019 3fb contrib/old_snapshot/time_mapping.o
   7159   0   071591bf7 contrib/pg_visibility/pg_visibility.o
   6655   0 12067751a77 src/backend/access/transam/commit_ts.o
  19636  24 248   199084dc4 src/backend/access/transam/multixact.o
  20663   0  88   20751510f src/backend/access/transam/twophase.o
   6206   0  2462301856 src/backend/access/transam/xlogfuncs.o
  457002736   8   48444bd3c src/backend/catalog/objectaddress.o
   9952   0   0995226e0 src/backend/catalog/pg_publication.o
  18487   0  16   185034847 src/backend/commands/sequence.o
   6143   0   0614317ff src/backend/tsearch/wparser.o
  471231840  52   49015bf77 src/backend/utils/adt/acl.o
  43099 168 784   44051ac13 src/backend/utils/adt/datetime.o
   7016   0   070161b68 src/backend/utils/adt/genfile.o
   7413 120   075331d6d src/backend/utils/adt/lockfuncs.o
  106987008   0   17706452a src/backend/utils/adt/misc.o
   1593   0   01593 639 src/backend/utils/adt/partitionfuncs.o
  17194   0   0   17194432a src/backend/utils/adt/pgstatfuncs.o
  40798   0   0   407989f5e src/backend/utils/adt/tsvector_op.o
   8871   0   0887122a7 src/backend/utils/misc/guc_funcs.o
   3918   0   03918 f4e src/backend/utils/misc/pg_controldata.o
   2636   0   02636 a4c
src/test/modules/test_predtest/test_predtest.o
  9589943  226936  205536 10022415 98ee0f src/backend/postgres

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From e1df78cc86e3d38a2814d6cd89f6b86a8de4a284 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Tue, 13 Dec 2022 07:04:22 +
Subject: 

Re: Add sub-transaction overflow status in pg_stat_activity

2022-12-12 Thread Julien Rouhaud
On Tue, Dec 13, 2022 at 5:09 AM Dilip Kumar  wrote:
>
> On Mon, Dec 12, 2022 at 11:21 PM Robert Haas  wrote:
> >
> > On Mon, Dec 12, 2022 at 12:42 PM Justin Pryzby  wrote:
> > > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > > index 4efa1d5fca0..ac15e2ce789 100644
> > > --- a/doc/src/sgml/monitoring.sgml
> > > +++ b/doc/src/sgml/monitoring.sgml
> > > @@ -5680,12 +5680,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
> > >  record
> > > 
> > > 
> > > -Returns a record of information about the backend's 
> > > subtransactions.
> > > -The fields returned are subxact_count 
> > > identifies
> > > -number of active subtransaction and subxact_overflow
> > > - shows whether the backend's subtransaction cache is
> > > -overflowed or not.
> > > -   
> > > +Returns a record of information about the subtransactions of the 
> > > backend
> > > +with the specified ID.
> > > +The fields returned are subxact_count, 
> > > which
> > > +identifies the number of active subtransaction and
> > > +subxact_overflow, which shows whether the
> > > +backend's subtransaction cache is overflowed or not.
> > > 
> > >
> >
> > Makes sense.
>
> +1

+1




Re: refactor ExecGrant_*() functions

2022-12-12 Thread Peter Eisentraut

On 12.12.22 10:44, Antonin Houska wrote:

Peter Eisentraut  wrote:


On 06.12.22 09:41, Antonin Houska wrote:

Attached are my proposals for improvements. One is to avoid memory leak, the
other tries to improve readability a little bit.


I added the readability improvement to my v2 patch.  The pfree() calls aren't
necessary AFAICT.


It's something to consider, but since this is a refactoring patch and 
the old code didn't do it either, I think it's out of scope.



I see that memory contexts exist and that the amount of memory freed is not
huge, but my style is to free the memory explicitly if it's allocated in a
loop.

v2 looks good to me.


Committed, thanks.





Re: Force streaming every change in logical decoding

2022-12-12 Thread Peter Smith
On Tue, Dec 13, 2022 at 2:33 PM Peter Smith  wrote:
>
> On Tue, Dec 6, 2022 at 5:23 PM shiy.f...@fujitsu.com
>  wrote:
> >
> > Hi hackers,
> >
> > In logical decoding, when logical_decoding_work_mem is exceeded, the 
> > changes are
> > sent to output plugin in streaming mode. But there is a restriction that the
> > minimum value of logical_decoding_work_mem is 64kB. I tried to add a GUC to
> > allow sending every change to output plugin without waiting until
> > logical_decoding_work_mem is exceeded.
> >
> > This helps to test streaming mode. For example, to test "Avoid streaming the
> > transaction which are skipped" [1], it needs many XLOG_XACT_INVALIDATIONS
> > messages. With the new option, it can be tested with fewer changes and in 
> > less
> > time. Also, this new option helps to test more scenarios for "Perform 
> > streaming
> > logical transactions by background workers" [2].
> >
> > [1] 
> > https://www.postgresql.org/message-id/CAFiTN-tHK=7lzfrps8fbt2ksrojgqbzywcgxst2bm9-rjja...@mail.gmail.com
> > [2] 
> > https://www.postgresql.org/message-id/flat/CAA4eK1%2BwyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw%40mail.gmail.com
> >
>
> Hi, I've been doing some testing that makes use of this new developer
> GUC `force_stream_mode`.
>
> IIUC this GUC is used by the walsender during the logic of the
> ReorderBufferCheckMemoryLimit(). Also, AFAIK the only way that the
> walsender is going to know this GUC value is by inheritance from the
> parent publisher at the time the walsender process gets launched.
>
> I may be overthinking this. Isn't there potential for this to become
> quite confusing depending on the timing of when this GUC is modified?
>
> E.g.1 When the walsender is launched, it will use whatever is the
> current value of this GUC.
> E.g.2 But if the GUC is changed after the walsender is already
> launched, then that will have no effect on the already running
> walsender.
>
> Is that understanding correct?
>

I think I was mistaken above. It looks like even the already-launched
walsender gets the updated GUC value via a SIGHUP on the parent
publisher.

2022-12-13 16:31:33.453 AEDT [1902] LOG:  received SIGHUP, reloading
configuration files
2022-12-13 16:31:33.455 AEDT [1902] LOG:  parameter
"force_stream_mode" changed to "true"

Sorry for the noise.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: [PoC] Federated Authn/z with OAUTHBEARER

2022-12-12 Thread Andrey Chudnovsky
> The server hook is the right place to check the scopes, yes, but I think
> the DBA should be able to specify what those scopes are to begin with.
> The provider of the extension shouldn't be expected by the architecture
> to hardcode those decisions, even if Azure AD chooses to short-circuit
> that choice and provide magic instead.

Hardcode is definitely not expected, but customization for identity
provider specific, I think, should be allowed.
I can provide a couple of advanced use cases which happen in the cloud
deployments world, and require per-role management:
- Multi-tenant deployments, when root provider URL would be different
for different roles, based on which tenant they come from.
- Federation to multiple providers. Solutions like Amazon Cognito
which offer a layer of abstraction with several providers
transparently supported.

If your concern is extension not honoring the DBA configured values:
Would a server-side logic to prefer HBA value over extension-provided
resolve this concern?
We are definitely biased towards the cloud deployment scenarios, where
direct access to .hba files is usually not offered at all.
Let's find the middle ground here.

A separate reason for creating this pre-authentication hook is further
extensibility to support more metadata.
Specifically when we add support for OAUTH flows to libpq, server-side
extensions can help bridge the gap between the identity provider
implementation and OAUTH/OIDC specs.
For example, that could allow the Github extension to provide an OIDC
discovery document.

I definitely see identity providers as institutional actors here which
can be given some power through the extension hooks to customize the
behavior within the framework.

> I maintain that the hook doesn't need to hand back artifacts to the
> client for a second PQconnect call. It can just use those artifacts to
> obtain the access token and hand that right back to libpq. (I think any
> requirement that clients be rewritten to call PQconnect twice will
> probably be a sticking point for adoption of an OAuth patch.)

Obtaining a token is an asynchronous process with a human in the loop.
Not sure if expecting a hook function to return a token synchronously
is the best option here.
Can that be an optional return value of the hook in cases when a token
can be obtained synchronously?

On Thu, Dec 8, 2022 at 4:41 PM Jacob Champion  wrote:
>
> On Wed, Dec 7, 2022 at 3:22 PM Andrey Chudnovsky
>  wrote:
> >
> >> I think it's okay to have the extension and HBA collaborate to
> >> provide discovery information. Your proposal goes further than
> >> that, though, and makes the server aware of the chosen client flow.
> >> That appears to be an architectural violation: why does an OAuth
> >> resource server need to know the client flow at all?
> >
> > Ok. It may have left there from intermediate iterations. We did
> > consider making extension drive the flow for specific grant_type,
> > but decided against that idea. For the same reason you point to. Is
> > it correct that your main concern about use of grant_type was that
> > it's propagated to the server? Then yes, we will remove sending it
> > to the server.
>
> Okay. Yes, that was my primary concern.
>
> >> Ideally, yes, but that only works if all identity providers
> >> implement the same flows in compatible ways. We're already seeing
> >> instances where that's not the case and we'll necessarily have to
> >> deal with that up front.
> >
> > Yes, based on our analysis OIDC spec is detailed enough, that
> > providers implementing that one, can be supported with generic code
> > in libpq / client. Github specifically won't fit there though.
> > Microsoft Azure AD, Google, Okta (including Auth0) will.
> > Theoretically discovery documents can be returned from the extension
> > (server-side) which is provider specific. Though we didn't plan to
> > prioritize that.
>
> As another example, Google's device authorization grant is incompatible
> with the spec (which they co-authored). I want to say I had problems
> with Azure AD not following that spec either, but I don't remember
> exactly what they were. I wouldn't be surprised to find more tiny
> departures once we get deeper into implementation.
>
> >> That seems to be restating the goal of OAuth and OIDC. Can you
> >> explain how the incompatible change allows you to accomplish this
> >> better than standard implementations?
> >
> > Do you refer to passing grant_type to the server? Which we will get
> > rid of in the next iteration. Or other incompatible changes as well?
>
> Just the grant type, yeah.
>
> >> Why? I claim that standard OAUTHBEARER can handle all of that.
> >> What does your proposed architecture (the third diagram) enable
> >> that my proposed hook (the second diagram) doesn't?
> >
> > The hook proposed on the 2nd diagram effectively delegates all Oauth
> > flows implementations to the client. We propose libpq takes care of
> > pulling OpenId discovery and coordination. Which 

Remove SHA256_HMAC_B from scram-common.h

2022-12-12 Thread Michael Paquier
Hi all,

While doing some hackery on SCRAM, I have noticed $subject giving the
attached.  I guess that this is not going to cause any objections, but
feel free to comment just in case.

Thanks,
--
Michael
diff --git a/src/include/common/scram-common.h b/src/include/common/scram-common.h
index e1f5e786e0..4acf2a78ad 100644
--- a/src/include/common/scram-common.h
+++ b/src/include/common/scram-common.h
@@ -23,9 +23,6 @@
 /* Length of SCRAM keys (client and server) */
 #define SCRAM_KEY_LENPG_SHA256_DIGEST_LENGTH
 
-/* length of HMAC */
-#define SHA256_HMAC_BPG_SHA256_BLOCK_LENGTH
-
 /*
  * Size of random nonce generated in the authentication exchange.  This
  * is in "raw" number of bytes, the actual nonces sent over the wire are


signature.asc
Description: PGP signature


RE: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Takamichi Osumi (Fujitsu)
On Tuesday, December 13, 2022 1:27 PM Kyotaro Horiguchi 
 wrote:
> At Tue, 13 Dec 2022 02:28:49 +, "Takamichi Osumi (Fujitsu)"
>  wrote in
> > On Wednesday, December 7, 2022 12:00 PM Kyotaro Horiguchi
>  wrote:
> >
> > We couldn't reproduce this failure and find the same type of failure
> > on the cfbot from the past failures.
> > It seems no subtests run in your environment.
> 
> Very sorry for that. The test script is found to be a left-over file in a 
> git-reset'ed
> working tree. Please forget about it.
> 
> FWIW, the latest patch passed make-world for me on Rocky8/x86_64.
Hi,


No problem at all.
Also, thank you for your testing and confirming the latest one!


Best Regards,
Takamichi Osumi





Re: Progress report of CREATE INDEX for nested partitioned tables

2022-12-12 Thread Justin Pryzby
On Mon, Dec 12, 2022 at 11:39:23PM +0400, Ilya Gladyshev wrote:
> 
> > Could you check what I've written as a counter-proposal ?
> 
> I think that this might be a good solution to start with, it gives us the 
> opportunity to improve the granularity later without any surprising changes 
> for the end user. We could use this patch for previous versions and make more 
> granular output in the latest. What do you think?

Somehow, it hadn't occured to me that my patch "lost granularity" by
incrementing the progress bar by more than one...  Shoot.

> I actually think that the progress view would be better off without the total 
> number of partitions, 

Just curious - why ?

> With this in mind, I think your proposal to exclude catalog-only indexes 
> sounds reasonable to me, but I feel like the docs are off in this case, 
> because the attached indexes are not created, but we pretend like they are in 
> this metric, so we should fix one or the other.

I agree that the docs should indicate whether we're counting "all
partitions", "direct partitions", and whether or not that includes
partitioned partitions, or just leaf partitions.

I have another proposal: since the original patch 3.5 years ago didn't
consider or account for sub-partitions, let's not start counting them
now.  It was never defined whether they were included or not (and I
guess that they're not common) so we can take this opportunity to
clarify the definition.

Alternately, if it's okay to add nparts_done to the IndexStmt, then
that's easy.

-- 
Justin
>From 2e93bd37ca3c8add1f8e3e44a9f3906c332b83f2 Mon Sep 17 00:00:00 2001
From: Ilya Gladyshev 
Date: Fri, 9 Dec 2022 23:17:29 +0400
Subject: [PATCH] report top parent progress for CREATE INDEX

! add asserts, avoid global var
! do not count intermediate/nested/sub-partitions
---
 src/backend/commands/indexcmds.c  | 30 ++-
 src/backend/utils/activity/backend_progress.c | 79 +++
 src/include/nodes/parsenodes.h|  2 +
 3 files changed, 107 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b5b860c3abf..6e6bba9d3a9 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1218,8 +1218,28 @@ DefineIndex(Oid relationId,
 			Relation	parentIndex;
 			TupleDesc	parentDesc;
 
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
-		 nparts);
+			if (!OidIsValid(parentIndexId))
+			{
+int			nleaves = 0;
+List		*childs;
+ListCell	*lc;
+
+/*
+ * Count the number of physical children, excluding foreign
+ * tables, intermediate partitioned tables, as well as the
+ * partitioned index itself.
+ */
+childs = find_all_inheritors(relationId, NoLock, NULL);
+foreach(lc, childs)
+{
+	Oid		partrelid = lfirst_oid(lc);
+	if (RELKIND_HAS_STORAGE(get_rel_relkind(partrelid)))
+		nleaves++;
+}
+
+pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
+			 nleaves);
+			}
 
 			/* Make a local copy of partdesc->oids[], just for safety */
 			memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
@@ -1431,8 +1451,10 @@ DefineIndex(Oid relationId,
 		   child_save_sec_context);
 }
 
-pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_DONE,
-			 i + 1);
+if (RELKIND_HAS_STORAGE(get_rel_relkind(childRelid)))
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_DONE,
+ ++stmt->nparts_done);
+
 free_attrmap(attmap);
 			}
 
diff --git a/src/backend/utils/activity/backend_progress.c b/src/backend/utils/activity/backend_progress.c
index f29199725b7..c661ad94782 100644
--- a/src/backend/utils/activity/backend_progress.c
+++ b/src/backend/utils/activity/backend_progress.c
@@ -10,6 +10,7 @@
  */
 #include "postgres.h"
 
+#include "commands/progress.h"
 #include "port/atomics.h"		/* for memory barriers */
 #include "utils/backend_progress.h"
 #include "utils/backend_status.h"
@@ -37,6 +38,82 @@ pgstat_progress_start_command(ProgressCommandType cmdtype, Oid relid)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+static void
+pgstat_progress_asserts(void)
+{
+	volatile PgBackendStatus *beentry = MyBEEntry;
+	volatile int64		*a = beentry->st_progress_param;
+
+	/*
+	 * If the command fails due to interrupt or error, the values may be
+	 * less than rather than equal to expected, final value.
+	 */
+
+	switch (beentry->st_progress_command)
+	{
+	case PROGRESS_COMMAND_VACUUM:
+		Assert(a[PROGRESS_VACUUM_HEAP_BLKS_SCANNED] <=
+a[PROGRESS_VACUUM_TOTAL_HEAP_BLKS]);
+		Assert(a[PROGRESS_VACUUM_HEAP_BLKS_VACUUMED] <=
+a[PROGRESS_VACUUM_TOTAL_HEAP_BLKS]);
+		Assert(a[PROGRESS_VACUUM_NUM_DEAD_TUPLES] <=
+a[PROGRESS_VACUUM_MAX_DEAD_TUPLES]);
+		break;
+
+	case PROGRESS_COMMAND_ANALYZE:
+		Assert(a[PROGRESS_ANALYZE_BLOCKS_DONE] <=
+a[PROGRESS_ANALYZE_BLOCKS_TOTAL]);
+		/* Extended stats can be skipped */
+		

Re: Add index scan progress to pg_stat_progress_vacuum

2022-12-12 Thread Imseih (AWS), Sami
Thanks for the feedback. I agree with the feedback, except
for 

>need to have ParallelVacuumProgress. I see
>parallel_vacuum_update_progress() uses this value but I think it's
>better to pass ParallelVacuumState to via IndexVacuumInfo.

I was trying to avoid passing a pointer to
ParallelVacuumState in IndexVacuuminfo.

ParallelVacuumProgress is implemented in the same
way as VacuumSharedCostBalance and 
VacuumActiveNWorkers. See vacuum.h

These values are reset at the start of a parallel vacuum cycle
and reset at the end of an index vacuum cycle.

This seems like a better approach and less invasive.
What would be a reason not to go with this approach?


> parallel_vacuum_update_progress() is typically called every 1GB so I
> think we don't need to worry about unnecessary update. Also, I think
> this code doesn't work when pgstat_track_activities is false. Instead,
> I think that in parallel_wait_for_workers_to_finish(), we can check
> the value of pvs->nindexes_completed and update the progress if there
> is an update or it's first time.

I agree that we don’t need to worry about unnecessary updates
in parallel_vacuum_update_progress since we are calling
every 1GB. I also don't think we should do anything additional
in parallel_wait_for_workers_to_finish since here we are only
updating every 1 second.

Thanks,

Sami Imseih
Amazon Web Services



Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Kyotaro Horiguchi
At Tue, 13 Dec 2022 02:28:49 +, "Takamichi Osumi (Fujitsu)" 
 wrote in 
> On Wednesday, December 7, 2022 12:00 PM Kyotaro Horiguchi 
>  wrote:
>
> We couldn't reproduce this failure and
> find the same type of failure on the cfbot from the past failures.
> It seems no subtests run in your environment.

Very sorry for that. The test script is found to be a left-over file
in a git-reset'ed working tree. Please forget about it.

FWIW, the latest patch passed make-world for me on Rocky8/x86_64.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add sub-transaction overflow status in pg_stat_activity

2022-12-12 Thread Dilip Kumar
On Mon, Dec 12, 2022 at 11:21 PM Robert Haas  wrote:
>
> On Mon, Dec 12, 2022 at 12:42 PM Justin Pryzby  wrote:
> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > index 4efa1d5fca0..ac15e2ce789 100644
> > --- a/doc/src/sgml/monitoring.sgml
> > +++ b/doc/src/sgml/monitoring.sgml
> > @@ -5680,12 +5680,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
> >  record
> > 
> > 
> > -Returns a record of information about the backend's 
> > subtransactions.
> > -The fields returned are subxact_count 
> > identifies
> > -number of active subtransaction and subxact_overflow
> > - shows whether the backend's subtransaction cache is
> > -overflowed or not.
> > -   
> > +Returns a record of information about the subtransactions of the 
> > backend
> > +with the specified ID.
> > +The fields returned are subxact_count, which
> > +identifies the number of active subtransaction and
> > +subxact_overflow, which shows whether the
> > +backend's subtransaction cache is overflowed or not.
> > 
> >
>
> Makes sense.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Force streaming every change in logical decoding

2022-12-12 Thread Peter Smith
On Tue, Dec 6, 2022 at 5:23 PM shiy.f...@fujitsu.com
 wrote:
>
> Hi hackers,
>
> In logical decoding, when logical_decoding_work_mem is exceeded, the changes 
> are
> sent to output plugin in streaming mode. But there is a restriction that the
> minimum value of logical_decoding_work_mem is 64kB. I tried to add a GUC to
> allow sending every change to output plugin without waiting until
> logical_decoding_work_mem is exceeded.
>
> This helps to test streaming mode. For example, to test "Avoid streaming the
> transaction which are skipped" [1], it needs many XLOG_XACT_INVALIDATIONS
> messages. With the new option, it can be tested with fewer changes and in less
> time. Also, this new option helps to test more scenarios for "Perform 
> streaming
> logical transactions by background workers" [2].
>
> [1] 
> https://www.postgresql.org/message-id/CAFiTN-tHK=7lzfrps8fbt2ksrojgqbzywcgxst2bm9-rjja...@mail.gmail.com
> [2] 
> https://www.postgresql.org/message-id/flat/CAA4eK1%2BwyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw%40mail.gmail.com
>

Hi, I've been doing some testing that makes use of this new developer
GUC `force_stream_mode`.

IIUC this GUC is used by the walsender during the logic of the
ReorderBufferCheckMemoryLimit(). Also, AFAIK the only way that the
walsender is going to know this GUC value is by inheritance from the
parent publisher at the time the walsender process gets launched.

I may be overthinking this. Isn't there potential for this to become
quite confusing depending on the timing of when this GUC is modified?

E.g.1 When the walsender is launched, it will use whatever is the
current value of this GUC.
E.g.2 But if the GUC is changed after the walsender is already
launched, then that will have no effect on the already running
walsender.

Is that understanding correct?

--
Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: Support logical replication of DDLs

2022-12-12 Thread li jie
I noticed that the issue of ownership seems to have not been considered.
For example, if a user 'a' from the publishing side creates a table t1,
the owner of t1 is not user 'a' after it is replicated to the subscribing side.
This is a situation that has not been encountered in previous DML replication.

I think the ownership relationship should not be lost,
and we should perhaps add it back,
like pg_dump "ALTER TABLE public.t1 OWNER TO a;",
even though we do not currently support the replication of USER.


Thought?  li jie.




RE: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Takamichi Osumi (Fujitsu)
On Wednesday, December 7, 2022 12:00 PM Kyotaro Horiguchi 
 wrote:
> At Tue, 6 Dec 2022 11:08:43 -0800, Andres Freund  wrote
> in
> > Hi,
> >
> > The tests fail on cfbot:
> > https://cirrus-ci.com/task/4533866329800704
> >
> > They only seem to fail on 32bit linux.
> >
> > https://api.cirrus-ci.com/v1/artifact/task/4533866329800704/testrun/bu
> > ild-32/testrun/subscription/032_apply_delay/log/regress_log_032_apply_
> > delay
> > [06:27:10.628](0.138s) ok 2 - check if the new rows were applied to
> > subscriber timed out waiting for match: (?^:logical replication apply 
> > delay) at
> /tmp/cirrus-ci-build/src/test/subscription/t/032_apply_delay.pl line 124.
> 
> It fails for me on 64bit Linux.. (Rocky 8.7)
> 
> > t/032_apply_delay.pl ... Dubious, test returned 29 (wstat
> > 7424, 0x1d00) No subtests run
> ..
> > t/032_apply_delay.pl (Wstat: 7424 Tests: 0 Failed: 0)
> >   Non-zero exit status: 29
> >   Parse errors: No plan found in TAP output
Hi, Horiguchi-san


Sorry for being late.

We couldn't reproduce this failure and
find the same type of failure on the cfbot from the past failures.
It seems no subtests run in your environment.

Could you please share the log files, if you have
or when you can reproduce this ?

FYI, the latest patch is attached in [1].


[1] - 
https://www.postgresql.org/message-id/TYCPR01MB83730C23CB7D29E57368BECDEDE29%40TYCPR01MB8373.jpnprd01.prod.outlook.com



Best Regards,
Takamichi Osumi





Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Kyotaro Horiguchi
At Mon, 12 Dec 2022 18:10:00 +0530, Amit Kapila  wrote 
in 
> On Mon, Dec 12, 2022 at 1:04 PM Hayato Kuroda (Fujitsu)
>  wrote:
> > once and apply later. Our basic design is as follows:
> >
> > * All transactions areserialized to files if min_apply_delay is set to 
> > non-zero.
> > * After receiving the commit message and spending time, workers reads and
> >   applies spooled messages
> >
> 
> I think this may be more work than required because in some cases
> doing I/O just to delay xacts will later lead to more work. Can't we
> send some ping to walsender to communicate that walreceiver is alive?
> We already seem to be sending a ping in LogicalRepApplyLoop if we
> haven't heard anything from the server for more than
> wal_receiver_timeout / 2. Now, it is possible that the walsender is
> terminated due to some other reason and we need to see if we can
> detect that or if it will only be detected once the walreceiver's
> delay time is over.

FWIW, I thought the same thing with Amit.

What we should do here is logrep workers notifying to walsender that
it's living and the communication in-between is fine, and maybe the
worker's status. Spontaneous send_feedback() calls while delaying will
be sufficient for this purpose. We might need to supress extra forced
feedbacks instead. In contrast the worker doesn't need to bother to
know whether the peer is living until it receives the next data. But
we might need to adjust the wait_time in LogicalRepApplyLoop().

But, I'm not sure what will happen when walsender is blocked by
buffer-full for a long time.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Kyotaro Horiguchi
Hello.

At Mon, 12 Dec 2022 07:42:30 +, "Takamichi Osumi (Fujitsu)" 
 wrote in 
> On Monday, December 12, 2022 2:54 PM Kyotaro Horiguchi 
>  wrote:
> > I asked about unexpected walsender termination caused by this feature but I
> > think I didn't received an answer for it and the behavior is still exists.
..
> Thank you so much for your report!
> Yes. Currently, how to deal with the timeout issue is under discussion.
> Some analysis about the root cause are also there.
> 
> Kindly have a look at [1].
> 
> 
> [1] - 
> https://www.postgresql.org/message-id/TYAPR01MB58669394A67F2340B82E42D1F5E29%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Oops. Thank you for the pointer. Will visit there.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Tree-walker callbacks vs -Wdeprecated-non-prototype

2022-12-12 Thread Thomas Munro
On Mon, Dec 12, 2022 at 4:43 PM Thomas Munro  wrote:
> On Mon, Dec 12, 2022 at 4:07 PM Tom Lane  wrote:
> > I'm for "turn the warning off".  Per previous discussion, adhering
> > strictly to that rule would make our code worse (less legible AND
> > less safe), not better.
>
> Alright, this seems to do the trick here.

That did fix that problem.  But... seawasp also just recompiled its
compiler and picked up new opaque pointer API changes.  So no green
today.  I have more work to do to fix that, which might take some time
to get back to.




Re: New strategies for freezing, advancing relfrozenxid early

2022-12-12 Thread Peter Geoghegan
On Mon, Dec 12, 2022 at 3:47 PM Jeff Davis  wrote:
> Freezing is driven by a need to keep the age of the oldest
> transaction ID in a table to less than ~2B; and also the need to
> truncate the clog (and reduce lookups of really old xids). It's fine to
> give a brief explanation about why we can't track very old xids, but
> it's more of an internal detail and not the main point.

I agree that that's the conventional definition. What I am proposing
is that we revise that definition a little. We should start the
discussion of freezing in the user level docs by pointing out that
freezing also plays a role at the level of individual pages. An
all-frozen page is self-contained, now and forever (or until it gets
dirtied again, at least). Even on a standby we will reliably avoid
having to do clog lookups for a page that happens to have all of its
tuples frozen.

I don't want to push back too much here. I just don't think that it
makes terribly much sense for the docs to start the conversation about
freezing by talking about the worst consequences of not freezing for
an extended period of time. That's relevant, and it's probably going
to end up as the aspect of freezing that we spend most time on, but it
still doesn't seem like a useful starting point to me.

To me this seems related to the fallacy that relfrozenxid age is any
kind of indicator about how far behind we are on freezing. I think
that there is value in talking about freezing as a maintenance task
for physical heap pages, and only then talking about relfrozenxid and
the circular XID space. The 64-bit XID patch doesn't get rid of
freezing at all, because it is still needed to break the dependency of
tuples stored in heap pages on the pg_xact, and other SLRUs -- which
suggests that you can talk about freezing and advancing relfrozenxid
as different (though still closely related) concepts.

> * I'm still having a hard time with vacuum_freeze_strategy_threshold.
> Part of it is the name, which doesn't seem to convey the meaning.

I chose the name long ago, and never gave it terribly much thought.
I'm happy to go with whatever name you prefer.

> But the heuristic also seems off to me. What if you have lots of partitions
> in an append-only range-partitioned table? That would tend to use the
> lazy freezing strategy (because each partition is small), but that's
> not what you want. I understand heuristics aren't perfect, but it feels
> like we could do something better.

It is at least vastly superior to vacuum_freeze_min_age in cases like
this. Not that that's hard -- vacuum_freeze_min_age just doesn't ever
trigger freezing in any autovacuum given a table like pgbench_history
(barring during aggressive mode), due to how it interacts with the
visibility map. So we're practically guaranteed to do literally all
freezing for an append-only table in an aggressive mode VACUUM.

Worst of all, that happens on a timeline that has nothing to do with
the physical characteristics of the table itself (like the number of
unfrozen heap pages or something). In fact, it doesn't even have
anything to do with how many distinct XIDs modified that particular
table -- XID age works at the system level.

By working at the heap rel level (which means the partition level if
it's a partitioned table), and by being based on physical units (table
size), vacuum_freeze_strategy_threshold at least manages to limit the
accumulation of unfrozen heap pages in each individual relation. This
is the fundamental unit at which VACUUM operates. So even if you get
very unlucky and accumulate many unfrozen heap pages that happen to be
distributed across many different tables, you can at least vacuum each
table independently, and in parallel. The really big problems all seem
to involve concentration of unfrozen tables in one particular table
(usually the events table, the largest table in the system by a couple
of orders of magnitude).

That said, I agree that the system-level picture of debt (the system
level view of the number of unfrozen heap pages) is relevant, and that
it isn't directly considered by the patch. I think that that can be
treated as work for a future release. In fact, I think that there is a
great deal that we could teach autovacuum.c about the system level
view of things -- this is only one.

> Also, another purpose of this seems
> to be to achieve v15 behavior (if v16 behavior causes a problem for
> some workload), which seems like a good idea, but perhaps we should
> have a more direct setting for that?

Why, though? I think that it happens to make sense to do both with one
setting. Not because it's better to have 2 settings than 1 (though it
is) -- just because it makes sense here, given these specifics.

> * The comment above lazy_scan_strategy() is phrased in terms of the
> "traditional approach". It would be more clear if you described the
> current strategies and how they're chosen. The pre-16 behavior was as
> lazy as possible, so that's easy enough to describe without 

Re: slab allocator performance issues

2022-12-12 Thread David Rowley
Thanks for testing the patch.

On Mon, 12 Dec 2022 at 20:14, John Naylor  wrote:
> v13-0001 to 0005:

>  2.60%  postgres  postgres [.] SlabFree

> + v4 slab:

>4.98%  postgres  postgres  [.] SlabFree
>
> While allocation is markedly improved, freeing looks worse here. The 
> proportion is surprising because only about 2% of nodes are freed during the 
> load, but doing that takes up 10-40% of the time compared to allocating.

I've tried to reproduce this with the v13 patches applied and I'm not
really getting the same as you are. To run the function 100 times I
used:

select x, a.* from generate_series(1,100) x(x), lateral (select * from
bench_load_random_int(500 * 1000 * (1+x-x))) a;

(I had to add the * (1+x-x) to add a lateral dependency to stop the
function just being executed once)

v13-0001 - 0005 gives me:

  37.71%  postgres [.] rt_set
  19.24%  postgres [.] SlabAlloc
   8.73%  [kernel] [k] clear_page_rep
   5.21%  postgres [.] rt_node_insert_inner.isra.0
   2.63%  [kernel] [k] asm_exc_page_fault
   2.24%  postgres [.] SlabFree

and fairly consistently 122 ms runtime per call.

Applying v4 slab patch I get:

  41.06%  postgres [.] rt_set
  10.84%  postgres [.] SlabAlloc
   9.01%  [kernel] [k] clear_page_rep
   6.49%  postgres [.] rt_node_insert_inner.isra.0
   2.76%  postgres [.] SlabFree

and fairly consistently 112 ms per call.

I wonder if you can consistently get the same result on another
compiler or after patching something like master~50 or master~100.
Maybe it's just a code alignment thing.

Looking at the annotation of perf report for SlabFree with the patched
version I see:

  │
  │ /* push this chunk onto the head of the free list */
  │ *(MemoryChunk **) pointer = block->freehead;
 0.09 │   mov 0x10(%r8),%rax
  │ slab = block->slab;
59.15 │   mov (%r8),%rbp
  │ *(MemoryChunk **) pointer = block->freehead;
 9.43 │   mov %rax,(%rdi)
  │ block->freehead = chunk;
  │
  │ block->nfree++;

I think what that's telling me is that dereferencing the block's
memory is slow, likely due to that particular cache line not being
cached any longer. I tried running the test with 10,000 ints instead
of 500,000 so that there would be less CPU cache pressure. I see:

 29.76 │   mov (%r8),%rbp
   │ *(MemoryChunk **) pointer = block->freehead;
 12.72 │   mov %rax,(%rdi)
   │ block->freehead = chunk;
   │
   │ block->nfree++;
   │   mov 0x8(%r8),%eax
   │ block->freehead = chunk;
  4.27 │   mov %rdx,0x10(%r8)
   │ SlabBlocklistIndex():
   │ index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift;
   │   mov $0x1,%edx
   │ SlabFree():
   │ block->nfree++;
   │   lea 0x1(%rax),%edi
   │   mov %edi,0x8(%r8)
   │ SlabBlocklistIndex():
   │ int32   blocklist_shift = slab->blocklist_shift;
   │   mov 0x70(%rbp),%ecx
   │ index = (nfree + (1 << blocklist_shift) - 1) >> blocklist_shift;
  8.46 │   shl %cl,%edx

various other instructions in SlabFree are proportionally taking
longer now. For example the bitshift at the end was insignificant
previously. That indicates to me that this is due to caching effects.
We must fetch the block in SlabFree() in both versions. It's possible
that something is going on in SlabAlloc() that is causing more useful
cachelines to be evicted, but (I think) one of primary design goals
Andres was going for was to reduce that. For example not having to
write out the freelist for an entire block when the block is first
allocated means not having to load possibly all cache lines for the
entire block anymore.

I tried looking at perf stat during the run.

Without slab changes:

drowley@amd3990x:~$ sudo perf stat --pid=74922 sleep 2
 Performance counter stats for process id '74922':

  2,000.74 msec task-clock#1.000 CPUs utilized
 4  context-switches  #1.999 /sec
 0  cpu-migrations#0.000 /sec
   578,139  page-faults   #  288.963 K/sec
 8,614,687,392  cycles#4.306 GHz
   (83.21%)
   682,574,688  stalled-cycles-frontend   #7.92% frontend
cycles idle (83.33%)
 4,822,904,271  stalled-cycles-backend#   55.98% backend
cycles idle  (83.41%)
11,447,124,105  instructions  #1.33  insn per cycle
  #0.42  stalled
cycles per insn  (83.41%)
 1,947,647,575  branches  #  973.464 M/sec
   (83.41%)
13,914,897  branch-misses #0.71% of all
branches   

Re: Date-Time dangling unit fix

2022-12-12 Thread Joseph Koshakow
On Mon, Dec 12, 2022 at 10:55 AM Joseph Koshakow  wrote:
>
> I just found another class of this bug that the submitted patch does
> not fix. If the units are at the beginning of the string, then they are
> also ignored. For example, `date 'm d y2020m11d3'` is also valid. I
> think the fix here is to check and make sure that ptype is 0 before
> reassigning the value to a non-zero number. I'll send an updated patch
> with this tonight.

Attached is the described patch.

- Joe Koshakow
From af72736bb4149afa629281e27da2141635a93cac Mon Sep 17 00:00:00 2001
From: Joseph Koshakow 
Date: Sat, 10 Dec 2022 18:59:26 -0500
Subject: [PATCH] Handle dangling units in date-time input

DecodeDateTime and DecodeTimeOnly allowed dangling unit types at the
beginning and end of inputs without returning an error. For example,
`date '1995-08-06 m y d'` and
`timestamp 'y m s d y2001m02d04 h04mm17s34'` were considered a valid
date and the dangling units were ignored. This commit fixes this issue
so an error is returned instead.
---
 src/backend/utils/adt/datetime.c  | 21 -
 src/test/regress/expected/date.out| 10 ++
 src/test/regress/expected/time.out|  5 +
 src/test/regress/expected/timestamp.out   | 10 ++
 src/test/regress/expected/timestamptz.out | 10 ++
 src/test/regress/expected/timetz.out  |  5 +
 src/test/regress/sql/date.sql |  5 +
 src/test/regress/sql/time.sql |  3 +++
 src/test/regress/sql/timestamp.sql|  5 +
 src/test/regress/sql/timestamptz.sql  |  5 +
 src/test/regress/sql/timetz.sql   |  3 +++
 11 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index b5b117a8ca..ebd7caff08 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -1509,6 +1509,9 @@ DecodeDateTime(char **field, int *ftype, int nf,
 
 	case UNITS:
 		tmask = 0;
+		/* prevent consecutive unhandled units */
+		if (ptype != 0)
+			return DTERR_BAD_FORMAT;
 		ptype = val;
 		break;
 
@@ -1535,7 +1538,9 @@ DecodeDateTime(char **field, int *ftype, int nf,
 			 ftype[i + 1] != DTK_TIME &&
 			 ftype[i + 1] != DTK_DATE))
 			return DTERR_BAD_FORMAT;
-
+		/* prevent consecutive unhandled units */
+		if (ptype != 0)
+			return DTERR_BAD_FORMAT;
 		ptype = val;
 		break;
 
@@ -1566,6 +1571,10 @@ DecodeDateTime(char **field, int *ftype, int nf,
 		fmask |= tmask;
 	}			/* end loop over fields */
 
+	/* prefix type was dangling and never handled */
+	if (ptype != 0)
+		return DTERR_BAD_FORMAT;
+
 	/* do final checking/adjustment of Y/M/D fields */
 	dterr = ValidateDate(fmask, isjulian, is2digits, bc, tm);
 	if (dterr)
@@ -2367,6 +2376,9 @@ DecodeTimeOnly(char **field, int *ftype, int nf,
 
 	case UNITS:
 		tmask = 0;
+		/* prevent consecutive unhandled units */
+		if (ptype != 0)
+			return DTERR_BAD_FORMAT;
 		ptype = val;
 		break;
 
@@ -2385,6 +2397,9 @@ DecodeTimeOnly(char **field, int *ftype, int nf,
 			 ftype[i + 1] != DTK_DATE))
 			return DTERR_BAD_FORMAT;
 
+		/* prevent consecutive unhandled units */
+		if (ptype != 0)
+			return DTERR_BAD_FORMAT;
 		ptype = val;
 		break;
 
@@ -2415,6 +2430,10 @@ DecodeTimeOnly(char **field, int *ftype, int nf,
 		fmask |= tmask;
 	}			/* end loop over fields */
 
+	/* prefix type was dangling and never handled */
+	if (ptype != 0)
+		return DTERR_BAD_FORMAT;
+
 	/* do final checking/adjustment of Y/M/D fields */
 	dterr = ValidateDate(fmask, isjulian, is2digits, bc, tm);
 	if (dterr)
diff --git a/src/test/regress/expected/date.out b/src/test/regress/expected/date.out
index f8f83e40e9..f4239a5402 100644
--- a/src/test/regress/expected/date.out
+++ b/src/test/regress/expected/date.out
@@ -1526,3 +1526,13 @@ select make_time(10, 55, 100.1);
 ERROR:  time field value out of range: 10:55:100.1
 select make_time(24, 0, 2.1);
 ERROR:  time field value out of range: 24:00:2.1
+-- test error on dangling units
+SELECT date '1995-08-06 m';
+ERROR:  invalid input syntax for type date: "1995-08-06 m"
+LINE 1: SELECT date '1995-08-06 m';
+^
+SET datestyle = ISO;
+SELECT date 'y m s d y2001m02d04';
+ERROR:  invalid input syntax for type date: "y m s d y2001m02d04"
+LINE 1: SELECT date 'y m s d y2001m02d04';
+^
diff --git a/src/test/regress/expected/time.out b/src/test/regress/expected/time.out
index a44caededd..5f7058eca8 100644
--- a/src/test/regress/expected/time.out
+++ b/src/test/regress/expected/time.out
@@ -229,3 +229,8 @@ SELECT date_part('epoch',   TIME '2020-05-26 13:30:25.575401');
  48625.575401
 (1 row)
 
+-- test error on dangling units
+SELECT time '12:30:15 d';
+ERROR:  invalid input syntax for type time: "12:30:15 d"
+LINE 1: SELECT time '12:30:15 d';
+^
diff --git 

Re: [PATCH] Add native windows on arm64 support

2022-12-12 Thread Michael Paquier
On Mon, Dec 12, 2022 at 01:38:37PM +, Niyas Sait wrote:
> On 05/12/2022 18:14, Andres Freund wrote:
> I think the old build system specific part is really minimal in the patch. I
> can strip out those if that's preferred.

Removing all the changes from src/tools/msvc/ is an approach that
works for me.

>> Why does this need to be hardcoded? The compiler probe should just work for
>> msvc.
> 
> There are couple of minor issues in the code probe with MSVC such as
> arm_acle.h needs to be removed and requires an explicit import of intrin.h.
> But even with those fixes, USE_ARMV8_CRC32C would be set and no runtime CRC
> extension check will be performed. Since CRC extension is optional in ARMv8,
> It would be better to use the CRC variant with runtime check. So I end up
> following the x64 variant and hardcoded the flags in case of ARM64 and MSVC.

Hm..  Andres, do you have access to Windows hosts with ARM64
underneath to validate any of that?  No way to blame Cirrus for not
having this option, they already do a lot :)
--
Michael


signature.asc
Description: PGP signature


Re: New strategies for freezing, advancing relfrozenxid early

2022-12-12 Thread Jeff Davis
On Sat, 2022-12-10 at 18:11 -0800, Peter Geoghegan wrote:
> On Tue, Dec 6, 2022 at 1:45 PM Peter Geoghegan  wrote:
> > v9 will also address some of the concerns you raised in your review
> > that weren't covered by v8, especially about the VM snapshotting
> > infrastructure. But also your concerns about the transition from
> > lazy
> > strategies to eager strategies.
> 
> Attached is v9. Highlights:

Comments:

* The documentation shouldn't have a heading like "Managing the 32-bit
Transaction ID address space". We already have a concept of "age"
documented, and I think that's all that's needed in the relevant
section. Freezing is driven by a need to keep the age of the oldest
transaction ID in a table to less than ~2B; and also the need to
truncate the clog (and reduce lookups of really old xids). It's fine to
give a brief explanation about why we can't track very old xids, but
it's more of an internal detail and not the main point.

* I'm still having a hard time with vacuum_freeze_strategy_threshold.
Part of it is the name, which doesn't seem to convey the meaning. But
the heuristic also seems off to me. What if you have lots of partitions
in an append-only range-partitioned table? That would tend to use the
lazy freezing strategy (because each partition is small), but that's
not what you want. I understand heuristics aren't perfect, but it feels
like we could do something better. Also, another purpose of this seems
to be to achieve v15 behavior (if v16 behavior causes a problem for
some workload), which seems like a good idea, but perhaps we should
have a more direct setting for that?

* The comment above lazy_scan_strategy() is phrased in terms of the
"traditional approach". It would be more clear if you described the
current strategies and how they're chosen. The pre-16 behavior was as
lazy as possible, so that's easy enough to describe without referring
to history.

* "eager skipping behavior" seems like a weird phrasing because it's
not immediately clear if that means "skip more pages" (eager to skip
pages and lazy to process them) or "skip fewer pages" (lazy to skip the
pages and eager to process the pages).

* The skipping behavior is for all-visible pages is binary: skip them
all, or skip none. That makes sense in the context of relfrozenxid
advancement. But how does that avoid IO spikes? It would seem perfectly
reasonable to me, if relfrozenxid advancement is not a pressing
problem, to process some fraction of the all-visible pages (or perhaps
process enough of them to freeze some fraction). That would ensure that
each VACUUM makes a payment on the deferred costs of freezing. I think
this has already been discussed but it keeps reappearing in my mind, so
maybe we can settle this with a comment (and/or docs)?

* I'm wondering whether vacuum_freeze_min_age makes sense anymore. It
doesn't take effect unless the page is not skipped, which is confusing
from a usability standpoint, and we have better heuristics to decide if
the whole page should be frozen or not anyway (i.e. if an FPI was
already taken then freezing is cheaper).


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-12 Thread Peter Smith
Some minor review comments for v58-0001

==

.../replication/logical/applyparallelworker.c

1. pa_can_start

+ /*
+ * Don't start a new parallel worker if user has set skiplsn as it's
+ * possible that user want to skip the streaming transaction. For streaming
+ * transaction, we need to serialize the transaction to a file so that we
+ * can get the last LSN of the transaction to judge whether to skip before
+ * starting to apply the change.
+ */
+ if (!XLogRecPtrIsInvalid(MySubscription->skiplsn))
+ return false;


"that user want" -> "that they want"

"For streaming transaction," -> "For streaming transactions,"

~~~

2. pa_free_worker_info

+ /* Remove from the worker pool. */
+ ParallelApplyWorkerPool = list_delete_ptr(ParallelApplyWorkerPool,
+winfo);

Unnecessary wrapping

~~~

3. pa_set_stream_apply_worker

+/*
+ * Set the worker that required to apply the current streaming transaction.
+ */
+void
+pa_set_stream_apply_worker(ParallelApplyWorkerInfo *winfo)
+{
+ stream_apply_worker = winfo;
+}

Comment wording seems wrong.

==

src/include/replication/worker_internal.h

4. ParallelApplyWorkerShared

+ * XactLastCommitEnd from the parallel apply worker. This is required to
+ * update the lsn_mappings by leader worker.
+ */
+ XLogRecPtr last_commit_end;
+} ParallelApplyWorkerShared;


"This is required to update the lsn_mappings by leader worker." -->
did you mean "This is required by the leader worker so it can update
the lsn_mappings." ??

--
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Perform streaming logical transactions by background workers and parallel apply

2022-12-12 Thread Peter Smith
FYI - a rebase is needed.

This patch is currently failing in cfbot [1], probably due to recent
logical replication documentation updates [2].

--
[1] cfbot failing for v59 - http://cfbot.cputube.org/patch_41_3621.log
[2] PGDOCS updated -
https://github.com/postgres/postgres/commit/a8500750ca0acf6bb95cf9d1ac7f421749b22db7

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: PGDOCS - Logical replication GUCs - added some xrefs

2022-12-12 Thread Peter Smith
On Tue, Dec 13, 2022 at 6:25 AM Alvaro Herrera  wrote:
>
> On 2022-Dec-07, samay sharma wrote:
>
> > On Tue, Dec 6, 2022 at 11:12 PM Peter Smith  wrote:
>
> > > OK. I copied the tablesync note back to config.sgml definition of
> > > 'max_replication_slots' and removed the link as suggested. Frankly, I
> > > also thought it is a bit strange that the max_replication_slots in the
> > > “Sending Servers” section was describing this parameter for
> > > “Subscribers”. OTOH, I did not want to split the definition in half so
> > > instead, I’ve added another Subscriber  that just refers
> > > back to this place. It looks like an improvement to me.
> >
> > Hmm, I agree this is a tricky scenario. However, to me, it seems odd to
> > mention the parameter twice as this chapter of the docs just lists each
> > parameter and describes them. So, I'd probably remove the reference to it
> > in the subscriber section. We should describe it's usage in different
> > places in the logical replication part of the docs (as we do).
>
> I agree this is tricky.  However, because they essentially have
> completely different behaviors on each side, and because we're
> documenting each side separately, to me it makes more sense to document
> each behavior separately, so I've split it.  I also added mention at
> each side that the other one exists.  My rationale is that a user is
> likely going to search for stuff to set on one side first, then for
> stuff to set on the other side.  So doing it this way maximizes
> helpfulness (or so I hope anyway).  I also added a separate index entry.
>

LGTM. Thank you for pushing this.

--
Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: ALTER TABLE uses a bistate but not for toast tables

2022-12-12 Thread Nikita Malakhov
Hi!

Found this discussion for our experiments with TOAST, I'd have to check it
under [1].
I'm not sure, what behavior is expected when the main table is unpinned,
bulk insert
to the TOAST table is in progress, and the second query with a heavy bulk
insert to
the same TOAST table comes in?

Thank you!

[1]
https://www.postgresql.org/message-id/flat/224711f9-83b7-a307-b17f-4457ab73a...@sigaev.ru

On Sun, Nov 27, 2022 at 11:15 PM Justin Pryzby  wrote:

> On Wed, Sep 07, 2022 at 10:48:39AM +0200, Drouvot, Bertrand wrote:
> > Hi,
> >
> > On 6/22/22 4:38 PM, Justin Pryzby wrote:
> > > ATRewriteTable() calls table_tuple_insert() with a bistate, to avoid
> clobbering
> > > and polluting the buffers.
> > >
> > > But heap_insert() then calls
> > > heap_prepare_insert() >
> > > heap_toast_insert_or_update >
> > > toast_tuple_externalize >
> > > toast_save_datum >
> > > heap_insert(toastrel, toasttup, mycid, options, NULL /* without
> bistate:( */);
> >
> > What do you think about creating earlier a new dedicated bistate for the
> > toast table?
>
> Yes, but I needed to think about what data structure to put it in...
>
> Here, I created a 2nd bistate for toast whenever creating a bistate for
> heap.  That avoids the need to add arguments to tableam's
> table_tuple_insert(), in addition to the 6 other functions in the call
> stack.
>
> I also updated rewriteheap.c to handle the same problem in CLUSTER:
>
> postgres=# DROP TABLE t; CREATE TABLE t AS SELECT i,
> repeat((555+i)::text, 123456)t FROM generate_series(1,)i;
> postgres=# VACUUM FULL VERBOSE t ; SELECT COUNT(1), datname,
> coalesce(c.relname,b.relfilenode::text), d.relname FROM pg_buffercache b
> LEFT JOIN pg_class c ON b.relfilenode=pg_relation_filenode(c.oid) LEFT JOIN
> pg_class d ON d.reltoastrelid=c.oid LEFT JOIN pg_database db ON
> db.oid=b.reldatabase GROUP BY 2,3,4 ORDER BY 1 DESC LIMIT 22;
>
> Unpatched:
>   5000 | postgres | pg_toast_96188840   | t
>   => 40MB of shared buffers
>
> Patched:
>   2048 | postgres | pg_toast_17097  | t
>
> Note that a similar problem seems to exist in COPY ... but I can't see
> how to fix that one.
>
> --
> Justin
>


-- 
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/


Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-12-12 Thread Tom Lane
Peter Eisentraut  writes:
> On 25.11.22 18:06, Tom Lane wrote:
>> In my mind, this is waiting for Peter to opine on whether it satisfies
>> his concern.

> The case I was working on is the same as Israel's.  He has confirmed 
> that this fixes the issue we have been working on.

OK, I'll make this happen soon.

regards, tom lane




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-12 Thread Nathan Bossart
On Mon, Dec 12, 2022 at 12:04:27PM -0800, Nathan Bossart wrote:
> Patch attached.  I ended up reverting some parts of the VACUUM/ANALYZE
> patch that were no longer needed (i.e., if the user doesn't have permission
> to VACUUM, we don't need to separately check whether the user has
> permission to ANALYZE).  Otherwise, I don't think there's anything
> tremendously different between v1 and v2 besides the fact that all the
> privileges are grouped together.

Here is a v3 of the patch that fixes a typo in the docs.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From bd290e4fa404169342b226d2b5673369d96b71b6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 7 Dec 2022 11:20:01 -0800
Subject: [PATCH v3 1/1] add grantable MAINTAIN privilege

---
 doc/src/sgml/ddl.sgml |  41 ++---
 doc/src/sgml/func.sgml|   3 +-
 .../sgml/ref/alter_default_privileges.sgml|   4 +-
 doc/src/sgml/ref/analyze.sgml |   9 +-
 doc/src/sgml/ref/cluster.sgml |   8 +-
 doc/src/sgml/ref/grant.sgml   |   5 +-
 .../sgml/ref/refresh_materialized_view.sgml   |   5 +-
 doc/src/sgml/ref/reindex.sgml |  13 +-
 doc/src/sgml/ref/revoke.sgml  |   2 +-
 doc/src/sgml/ref/vacuum.sgml  |   9 +-
 doc/src/sgml/user-manag.sgml  |  18 +-
 src/backend/catalog/aclchk.c  |  35 ++--
 src/backend/commands/analyze.c|   2 +-
 src/backend/commands/cluster.c|  15 +-
 src/backend/commands/indexcmds.c  |  35 ++--
 src/backend/commands/tablecmds.c  |  13 +-
 src/backend/commands/vacuum.c |  15 +-
 src/backend/parser/gram.y |   7 -
 src/backend/utils/adt/acl.c   |  22 +--
 src/bin/pg_dump/dumputils.c   |   3 +-
 src/bin/pg_dump/t/002_pg_dump.pl  |   2 +-
 src/bin/psql/tab-complete.c   |   5 +-
 src/include/catalog/pg_authid.dat |   9 +-
 src/include/nodes/parsenodes.h|   5 +-
 src/include/utils/acl.h   |   7 +-
 src/test/regress/expected/create_index.out|   4 +-
 src/test/regress/expected/dependency.out  |  20 +--
 src/test/regress/expected/privileges.out  | 160 --
 src/test/regress/expected/rowsecurity.out |  32 ++--
 src/test/regress/expected/vacuum.out  |   6 -
 src/test/regress/sql/dependency.sql   |   2 +-
 src/test/regress/sql/privileges.sql   | 102 +--
 32 files changed, 276 insertions(+), 342 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 38618de01c..b2baff2ce1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1692,8 +1692,7 @@ ALTER TABLE products RENAME TO items;
TRUNCATE, REFERENCES, TRIGGER,
CREATE, CONNECT, TEMPORARY,
EXECUTE, USAGE, SET,
-   ALTER SYSTEM, VACUUM, and
-   ANALYZE.
+   ALTER SYSTEM, and MAINTAIN.
The privileges applicable to a particular
object vary depending on the object's type (table, function, etc.).
More detail about the meanings of these privileges appears below.
@@ -1985,19 +1984,12 @@ REVOKE ALL ON accounts FROM PUBLIC;
 
 

-VACUUM
+MAINTAIN
 
  
-  Allows VACUUM on a relation.
- 
-
-   
-
-   
-ANALYZE
-
- 
-  Allows ANALYZE on a relation.
+  Allows VACUUM, ANALYZE,
+  CLUSTER, REFRESH MATERIALIZED VIEW
+  and REINDEX on a relation.
  
 

@@ -2151,13 +2143,8 @@ REVOKE ALL ON accounts FROM PUBLIC;
   PARAMETER
  
  
-  VACUUM
-  v
-  TABLE
- 
- 
-  ANALYZE
-  z
+  MAINTAIN
+  m
   TABLE
  
  
@@ -2250,7 +2237,7 @@ REVOKE ALL ON accounts FROM PUBLIC;
  
  
   TABLE (and table-like objects)
-  arwdDxtvz
+  arwdDxtm
   none
   \dp
  
@@ -2308,12 +2295,12 @@ GRANT SELECT (col1), UPDATE (col1) ON mytable TO miriam_rw;
would show:
 
 = \dp mytable
-   Access privileges
- Schema |  Name   | Type  |Access privileges|   Column privileges   | Policies
-+-+---+-+---+--
- public | mytable | table | miriam=arwdDxtvz/miriam+| col1:+|
-| |   | =r/miriam  +|   miriam_rw=rw/miriam |
-| |   | admin=arw/miriam|   |
+  Access privileges
+ Schema |  Name   | Type  |   Access privileges|   Column privileges   | Policies
++-+---++---+--
+ public | mytable | table | miriam=arwdDxtm/miriam+| col1:+|
+| |   | =r/miriam +|   miriam_rw=rw/miriam |
+| |   | admin=arw/miriam   |

Re: Why does L Blink Tree need lock coupling?

2022-12-12 Thread Peter Geoghegan
On Mon, Dec 12, 2022 at 11:43 AM Oliver Yang  wrote:
> In a sense, this isn't a fundamental issue and L
> paper could be easily tweaked to track downlink so that it
> doesn't require coupling lock in moveright.

I suppose that's true, but having 3 locks at a time is such a rare
case for classic L that it hardly matters (it's far rarer than
having only 2, which is itself very rare). They emphasized it because
it is the highest number of locks, not because it's necessarily
important.

I'm confused about whether you're talking about what L could do, in
theory, or what nbtree could do, in theory, or what nbtree should
actually be doing. These are 3 different subjects, really.

> However, it's hard to see why coupling lock is needed during
> ascent from child level to parent level in the L setting.  What can
> go wrong if L's algorithm releases lock on child page before
> acquiring lock on its parent?  The correctness proof in L
> doesn't use the assumption of coupling lock anywhere.  It appears
> that a lock at a time is sufficient in principle.

That may be true, but the words "in principle" are doing a lot of work
for you here. Travelling at a speed that approaches the speed of light
is also possible in principle.

Imagine that there is no lock coupling at all. What happens when there
is a page split that doesn't complete its second phase due to a hard
crash? Do inserters complete the split in passing, like in nbtree? Now
you have to think about incomplete splits across multiple levels. That
gets very complicated, but it needs to be addressed in a real system
like Postgres. Academic papers can just ignore corner cases like this.

Why do you think that consistently only holding one lock (as opposed
to only holding one lock at a time during 99%+ of all inserts) is
truly valuable in a practical setting? Maybe it is valuable, but
that's rather unclear. It is not an area that I would choose to work
on, given that uncertainty.

> The direct quote from section 1.2 of Lanin & Shasha
> paper: "Although it is not apparent in itself, the B-link
> structure allows inserts and searches to lock only one node at a
> time."  It seems to be an assertion on the property of the L
> algorithm.  It doesn't seem to be related the optimistic approach
> employed in Lanin & Shasha own algorithm.

I don't know how you can reach that conclusion. It directly
contradicts the claim made by the L paper about requiring at most 3
locks. And they even say "although it's not apparent in itself",
presenting it as new information.

They seem to be saying that the same basic B-Link data structure (or
one like it, with the addition of outlinks) could do that -- but
that's not the same as the L algorithm (the original design). That
does seem possible, though I doubt that it would be particularly
compelling, since L/nbtree don't need to do lock coupling for the
vast majority of individual inserts or searches.

I don't think that the L paper is particularly clear, or
particularly well written. It needs to be interpreted in its original
context, which is quite far removed from the current concerns of
nbtree. It's a 41 year old paper.

-- 
Peter Geoghegan




Re: allow granting CLUSTER, REFRESH MATERIALIZED VIEW, and REINDEX

2022-12-12 Thread Nathan Bossart
On Sat, Dec 10, 2022 at 12:41:09PM -0800, Nathan Bossart wrote:
> On Sat, Dec 10, 2022 at 12:07:12PM -0800, Jeff Davis wrote:
>> It seems like the discussion on VACUUM/CLUSTER/REINDEX privileges is
>> happening in the other thread. What would you like to accomplish in
>> this thread?
> 
> Given the feedback in the other thread [0], I was planning to rewrite this
> patch to create a MAINTAIN privilege and a pg_maintain_all_tables
> predefined role that allowed VACUUM, ANALYZE, CLUSTER, REFRESH MATERIALIZED
> VIEW, and REINDEX.

Patch attached.  I ended up reverting some parts of the VACUUM/ANALYZE
patch that were no longer needed (i.e., if the user doesn't have permission
to VACUUM, we don't need to separately check whether the user has
permission to ANALYZE).  Otherwise, I don't think there's anything
tremendously different between v1 and v2 besides the fact that all the
privileges are grouped together.

Since there are only 15 privilege bits used after this patch is applied,
presumably we could revert widening AclMode to 64 bits.  However, I imagine
that will still be necessary at some point in the near future, so I don't
see a strong reason to revert it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From b5892ef88b0807392535c66b6216b10f7d556940 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 7 Dec 2022 11:20:01 -0800
Subject: [PATCH v2 1/1] add grantable MAINTAIN privilege

---
 doc/src/sgml/ddl.sgml |  41 ++---
 doc/src/sgml/func.sgml|   3 +-
 .../sgml/ref/alter_default_privileges.sgml|   4 +-
 doc/src/sgml/ref/analyze.sgml |   9 +-
 doc/src/sgml/ref/cluster.sgml |   8 +-
 doc/src/sgml/ref/grant.sgml   |   5 +-
 .../sgml/ref/refresh_materialized_view.sgml   |   5 +-
 doc/src/sgml/ref/reindex.sgml |  13 +-
 doc/src/sgml/ref/revoke.sgml  |   2 +-
 doc/src/sgml/ref/vacuum.sgml  |   9 +-
 doc/src/sgml/user-manag.sgml  |  18 +-
 src/backend/catalog/aclchk.c  |  35 ++--
 src/backend/commands/analyze.c|   2 +-
 src/backend/commands/cluster.c|  15 +-
 src/backend/commands/indexcmds.c  |  35 ++--
 src/backend/commands/tablecmds.c  |  13 +-
 src/backend/commands/vacuum.c |  15 +-
 src/backend/parser/gram.y |   7 -
 src/backend/utils/adt/acl.c   |  22 +--
 src/bin/pg_dump/dumputils.c   |   3 +-
 src/bin/pg_dump/t/002_pg_dump.pl  |   2 +-
 src/bin/psql/tab-complete.c   |   5 +-
 src/include/catalog/pg_authid.dat |   9 +-
 src/include/nodes/parsenodes.h|   5 +-
 src/include/utils/acl.h   |   7 +-
 src/test/regress/expected/create_index.out|   4 +-
 src/test/regress/expected/dependency.out  |  20 +--
 src/test/regress/expected/privileges.out  | 160 --
 src/test/regress/expected/rowsecurity.out |  32 ++--
 src/test/regress/expected/vacuum.out  |   6 -
 src/test/regress/sql/dependency.sql   |   2 +-
 src/test/regress/sql/privileges.sql   | 102 +--
 32 files changed, 276 insertions(+), 342 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 38618de01c..b2baff2ce1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1692,8 +1692,7 @@ ALTER TABLE products RENAME TO items;
TRUNCATE, REFERENCES, TRIGGER,
CREATE, CONNECT, TEMPORARY,
EXECUTE, USAGE, SET,
-   ALTER SYSTEM, VACUUM, and
-   ANALYZE.
+   ALTER SYSTEM, and MAINTAIN.
The privileges applicable to a particular
object vary depending on the object's type (table, function, etc.).
More detail about the meanings of these privileges appears below.
@@ -1985,19 +1984,12 @@ REVOKE ALL ON accounts FROM PUBLIC;
 
 

-VACUUM
+MAINTAIN
 
  
-  Allows VACUUM on a relation.
- 
-
-   
-
-   
-ANALYZE
-
- 
-  Allows ANALYZE on a relation.
+  Allows VACUUM, ANALYZE,
+  CLUSTER, REFRESH MATERIALIZED VIEW
+  and REINDEX on a relation.
  
 

@@ -2151,13 +2143,8 @@ REVOKE ALL ON accounts FROM PUBLIC;
   PARAMETER
  
  
-  VACUUM
-  v
-  TABLE
- 
- 
-  ANALYZE
-  z
+  MAINTAIN
+  m
   TABLE
  
  
@@ -2250,7 +2237,7 @@ REVOKE ALL ON accounts FROM PUBLIC;
  
  
   TABLE (and table-like objects)
-  arwdDxtvz
+  arwdDxtm
   none
   \dp
  
@@ -2308,12 +2295,12 @@ GRANT SELECT (col1), UPDATE (col1) ON mytable TO miriam_rw;
would show:
 
 = \dp mytable
-   Access privileges
- Schema |  Name   | Type  |Access privileges|   Column privileges   | Policies
-+-+---+-+---+--
- public | mytable | 

Re: PGDOCS - Logical replication GUCs - added some xrefs

2022-12-12 Thread Alvaro Herrera
On 2022-Dec-11, Tom Lane wrote:

> Ian is still working on closing out the November 'fest :-(.
> I suspect that in a day or so that one will get moved, and
> you will have duplicate entries in the January 'fest.

I've marked both as committed.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Here's a general engineering tip: if the non-fun part is too complex for you
to figure out, that might indicate the fun part is too ambitious." (John Naylor)
https://postgr.es/m/CAFBsxsG4OWHBbSDM%3DsSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g%40mail.gmail.com




Re: Why does L Blink Tree need lock coupling?

2022-12-12 Thread Oliver Yang
On Sun, Dec 11, 2022 at 6:01 PM Peter Geoghegan  wrote:
>
> On Sun, Dec 11, 2022 at 5:38 PM Oliver Yang  wrote:
> > The README in nbtree mentions that L algorithm must couple
> > locks when moving right during ascent for insertion.  However,
> > it's hard to see why that's necessary.
>
> You're not the first person to ask about this exact point in the last
> few years. The last time somebody asked me about it via private email.
> I'm surprised that there is even that level of interest.
>
> It's not necessary to couple locks on internal levels of the tree in
> nbtree, of course. Which is why we don't couple locks in
> _bt_getstackbuf().
>
> Note that we have two "move right" functions here --
> _bt_getstackbuf(), and _bt_moveright(). Whereas the L pseudocode has
> one function for both things. Perhaps just because it's only abstract
> pseudocode -- it needs to be understood in context.
>
> You have to be realistic about how faithfully a real world system can
> be expected to implement something like L Some of the assumptions
> made by the paper just aren't reasonable, especially the assumption
> about atomic page reads (I always thought that that assumption was
> odd). Plus nbtree does plenty of things that L don't consider,
> including things that are obviously related to the stuff they do talk
> about. For example, nbtree has left links, while L does not (nor
> does Lanin & Shasha, though they do have something weirdly similar
> that they call "out links").
>
> > Since L mostly
> > discussed concurrent insertions and searches, what can go wrong
> > if inserters only acquire one lock at a time?
>
> You have to consider that we don't match on the separator key during
> the ascent of the B-tree structure following a split. That's another a
> big difference between nbtree and the paper -- we store a block number
> in our descent stack instead.
>
> Imagine if PostgreSQL's nbtree did match on a separator key, like
> Lehman and Yao. Think about this scenario:
>
> * We descend the tree and follow a downlink, while remembering the
> associated separator key on our descent stack. It is a simple 3 level
> B-Tree.
>
> * We split a leaf page, and have to relocate the separator 1 level up
> (in level 1).
>
> * The high key of the internal page on level 1 exactly matches our
> separator key -- so it must be that the separator key is to the right.
>
> * We release our lock on the original parent, and then lock and read
> its right sibling. But where do we insert our new separator key?
>
> We cannot match the original separator key because it doesn't exist in
> this other internal page to the right -- there is no first separator
> key in any internal page, including when it isn't the root of the
> tree. Actually, you could say that there is a separator key associated
> with the downlink, but it's only a negative infinity sentinel key.
> Negative infinity keys represent "absolute negative infinity" when
> they're from the root of the entire B-Tree, but in any other internal
> page it represents "relative negative infinity" -- it's only lower
> than everything in that particular subtree only.
>
> At the very least it seems risky to assume that it's safe to match on
> the separator key without lock coupling so that we see that the
> downlink really does come after the matches-descent-stack high key
> separator in the original parent page. You could probably make it work
> if you had to, but it's annoying to explain, and not actually that
> valuable -- moving right within _bt_getstackbuf() is a rare case in
> general (and trying to relocate the downlink that happens to become
> the first downlink following a concurrent internal page split is even
> rarer).
>
> In PostgreSQL it's not annoying to understand why it's okay, because
> it's obviously okay to just match on the downlink/block number
> directly, which is how it has always worked. It only becomes a problem
> when you try to understand what Lehman and Yao meant. It's unfortunate
> that they say "at most 3 locks", and therefore draw attention to this
> not-very-important issue. Lehman and Yao probably found it easier to
> say "let's try to keep our paper simple by making the move right
> routine couple locks in very rare cases where it is actually necessary
> to move right".

As you suggested, the coupling lock during moveright could be
avoided by tracking downlink instead of the separator key during
descent.  In a sense, this isn't a fundamental issue and L
paper could be easily tweaked to track downlink so that it
doesn't require coupling lock in moveright.

However, it's hard to see why coupling lock is needed during
ascent from child level to parent level in the L setting.  What can
go wrong if L's algorithm releases lock on child page before
acquiring lock on its parent?  The correctness proof in L
doesn't use the assumption of coupling lock anywhere.  It appears
that a lock at a time is sufficient in principle.

> > The Lanin paper cited in README also agrees that 

Re: Progress report of CREATE INDEX for nested partitioned tables

2022-12-12 Thread Ilya Gladyshev

> Could you check what I've written as a counter-proposal ?

I think that this might be a good solution to start with, it gives us the 
opportunity to improve the granularity later without any surprising changes for 
the end user. We could use this patch for previous versions and make more 
granular output in the latest. What do you think?

> As long as we're changing partitions_done to include nested
> sub-partitions, it seems to me like we should exclude intermediate
> "catalog-only" partitioned indexes, and count only physical leaf
> partitions.  Should it alo exclude any children with matching indexes,
> which will also be catalog-only changes?  Probably not.
> 
> The docs say:
> |When creating an index on a partitioned table, this column is set to the
> |total number of partitions on which the index is to be created. This
> |field is 0 during a REINDEX.

I agree with you on catalog-only partitioned indexes, but I think that in the 
perfect world we should exclude all the relations where the index isn’t 
actually created, so that means excluding attached indexes as well. However, 
IMO doing it this way will require too much of a code rewrite for quite a minor 
feature (but we could do it, ofc). I actually think that the progress view 
would be better off without the total number of partitions, but I’m not sure we 
have this option now. With this in mind, I think your proposal to exclude 
catalog-only indexes sounds reasonable to me, but I feel like the docs are off 
in this case, because the attached indexes are not created, but we pretend like 
they are in this metric, so we should fix one or the other.

> 
>> I changed current behaviour to report the total number of partitions in
>> the inheritance tree and fixed recursion in the attached patch. I used
>> a static variable to keep the counter to avoid ABI breakage of
>> DefineIndex, so that we could backpatch this to previous versions.
> 
> I wrote a bunch of assertions for this, which seems to have uncovered an
> similar issue with COPY progress reporting, dating to 8a4f618e7.  I'm
> not sure the assertions are okay.  I imagine they may break other
> extensions, as with file_fdw.
> 
> -- 
> Justin
> <0001-fix-progress-reporting-of-nested-partitioned-indexes.patch>



Re: PGDOCS - Logical replication GUCs - added some xrefs

2022-12-12 Thread Alvaro Herrera
On 2022-Dec-07, samay sharma wrote:

> On Tue, Dec 6, 2022 at 11:12 PM Peter Smith  wrote:

> > OK. I copied the tablesync note back to config.sgml definition of
> > 'max_replication_slots' and removed the link as suggested. Frankly, I
> > also thought it is a bit strange that the max_replication_slots in the
> > “Sending Servers” section was describing this parameter for
> > “Subscribers”. OTOH, I did not want to split the definition in half so
> > instead, I’ve added another Subscriber  that just refers
> > back to this place. It looks like an improvement to me.
> 
> Hmm, I agree this is a tricky scenario. However, to me, it seems odd to
> mention the parameter twice as this chapter of the docs just lists each
> parameter and describes them. So, I'd probably remove the reference to it
> in the subscriber section. We should describe it's usage in different
> places in the logical replication part of the docs (as we do).

I agree this is tricky.  However, because they essentially have
completely different behaviors on each side, and because we're
documenting each side separately, to me it makes more sense to document
each behavior separately, so I've split it.  I also added mention at
each side that the other one exists.  My rationale is that a user is
likely going to search for stuff to set on one side first, then for
stuff to set on the other side.  So doing it this way maximizes
helpfulness (or so I hope anyway).  I also added a separate index entry.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I love the Postgres community. It's all about doing things _properly_. :-)"
(David Garamond)




Re: BUG #17434: CREATE/DROP DATABASE can be executed in the same transaction with other commands

2022-12-12 Thread Peter Eisentraut

On 25.11.22 18:06, Tom Lane wrote:

Israel Barth Rubio  writes:

It would be great if we can back-patch this to all supported versions,
as the issue itself is currently affecting them all.


In my mind, this is waiting for Peter to opine on whether it satisfies
his concern.


The case I was working on is the same as Israel's.  He has confirmed 
that this fixes the issue we have been working on.






Re: Re[2]: [PATCH] Optional OR REPLACE in CREATE OPERATOR statement

2022-12-12 Thread Nikita Malakhov
Hi,

Svetlana, yes, Tom means that CREATE OR REPLACE should always produce
the same result no matter which branch actually worked - CREATE or REPLACE.
REPLACE case must produce exactly the same result as you've mentioned -
DROP and CREATE.

As for IF NOT EXISTS option I agree, it seems a reasonable addition to
simplify
error handling in scripts, go on.


On Wed, Jul 6, 2022 at 3:01 PM Svetlana Derevyanko <
s.derevya...@postgrespro.ru> wrote:

>
>
> Вторник, 5 июля 2022, 18:29 +03:00 от Tom Lane :
>
> =?UTF-8?B?U3ZldGxhbmEgRGVyZXZ5YW5rbw==?=  > writes:
> > It seems useful to have [OR REPLACE] option in CREATE OPERATOR
> statement, as in CREATE FUNCTION. This option may be good for
> writing extension update scripts, to avoid errors with re-creating the same
> operator.
>
> No, that's not acceptable. CREATE OR REPLACE should always produce
> exactly the same final state of the object, but in this case we cannot
> change the underlying function if the operator already exists.
>
> (At least, not without writing a bunch of infrastructure to update
> existing views/rules that might use the operator; which among other
> things would create a lot of deadlock risks.)
>
> regards, tom lane
>
> Hello,
>
> > CREATE OR REPLACE should always produce exactly the same final state of
> the object,
> > but in this case we cannot change the underlying function if the
> operator already exists.
>
> Do you mean that for existing operator CREATE OR REPLACE should be the
> same as DROP OPERATOR and CREATE OPERATOR,  with relevant re-creation of
> existing view/rules/..., using this operator? If yes, what exactly is wrong
> with  changing only RESTRICT and JOIN parameters (or is the problem in
> possible user`s confusion with attempts to change something more?). If no,
> could you, please, clarify what "final state" here means?
>
> Also, if OR REPLACE is unacceptable, then what do you think about IF NOT
> EXISTS option?
>
> Thanks,
>
> --
> Svetlana Derevyanko
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>


-- 
Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/


Re: Add sub-transaction overflow status in pg_stat_activity

2022-12-12 Thread Robert Haas
On Mon, Dec 12, 2022 at 12:42 PM Justin Pryzby  wrote:
> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 4efa1d5fca0..ac15e2ce789 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -5680,12 +5680,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
>  record
> 
> 
> -Returns a record of information about the backend's subtransactions.
> -The fields returned are subxact_count 
> identifies
> -number of active subtransaction and subxact_overflow
> - shows whether the backend's subtransaction cache is
> -overflowed or not.
> -   
> +Returns a record of information about the subtransactions of the 
> backend
> +with the specified ID.
> +The fields returned are subxact_count, which
> +identifies the number of active subtransaction and
> +subxact_overflow, which shows whether the
> +backend's subtransaction cache is overflowed or not.
> 
>

Makes sense.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Minimal logical decoding on standbys

2022-12-12 Thread Robert Haas
On Sat, Dec 10, 2022 at 3:09 AM Drouvot, Bertrand
 wrote:
> Attaching V30, mandatory rebase due to 66dcb09246.

It's a shame that this hasn't gotten more attention, because the topic
is important, but I'm as guilty of being too busy to spend a lot of
time on it as everyone else.

Anyway, while I'm not an expert on this topic, I did spend a little
time looking at it today, especially 0001. Here are a few comments:

I think that it's not good for IndexIsAccessibleInLogicalDecoding and
RelationIsAccessibleInLogicalDecoding to both exist. Indexes and
tables are types of relations, so this invites confusion: when the
object in question is an index, it would seem that either one can be
applied, based on the names. I think the real problem here is that
RelationIsAccessibleInLogicalDecoding is returning *the wrong answer*
when the relation is a user-catalog table. It does so because it
relies on RelationIsUsedAsCatalogTable, and that macro relies on
checking whether the reloptions include user_catalog_table.

But here we can see where past thinking of this topic has been,
perhaps, a bit fuzzy. If that option were called user_catalog_relation
and had to be set on both tables and indexes as appropriate, then
RelationIsAccessibleInLogicalDecoding would already be doing the right
thing, and consequently there would be no need to add
IndexIsAccessibleInLogicalDecoding. I think we should explore the idea
of making the existing macro return the correct answer rather than
adding a new one. It's probably too late to redefine the semantics of
user_catalog_table, although if anyone wants to argue that we could
require logical decoding plugins to set this for both indexes and
tables, and/or rename to say relation instead of table, and/or add a
parallel reloption called user_catalog_index, then let's talk about
that.

Otherwise, I think we can consider adjusting the definition of
RelationIsUsedAsCatalogTable. The simplest way to do that would be to
make it check indisusercatalog for indexes and do what it does already
for tables. Then IndexIsUserCatalog and
IndexIsAccessibleInLogicalDecoding go away and
RelationIsAccessibleInLogicalDecoding returns the right answer in all
cases.

But I also wonder if a new pg_index column is really the right
approach here. One fairly obvious alternative is to try to use the
user_catalog_table reloption in both places. We could try to propagate
that reloption from the table to its indexes; whenever it's set or
unset on the table, push that down to each index. We'd have to take
care not to let the property be changed independently on indexes,
though. This feels a little grotty to me, but it does have the
advantage of symmetry. Another way to get symmetry is to go the other
way and add a new column pg_class.relusercatalog which gets used
instead of putting user_catalog_table in the reloptions, and
propagated down to indexes. But I have a feeling that the reloptions
code is not very well-structured to allow reloptions to be stored any
place but in pg_class.reloptions, so this may be difficult to
implement. Yet a third way is to have the index fetch the flag from
the associated table, perhaps when the relcache entry is built. But I
see no existing precedent for that in RelationInitIndexAccessInfo,
which I think is where it would be if we had it -- and that makes me
suspect that there might be good reasons why this isn't actually safe.
So while I do not really like the approach of storing the same
property in different ways for tables and for indexes, it's also not
really obvious to me how to do better.

Regarding the new flags that have been added to various WAL records, I
am a bit curious as to whether there's some way that we can avoid the
need to carry this information through the WAL, but I don't understand
why we don't need that now and do need that with this patch so it's
hard for me to think about that question in an intelligent way. If we
do need it, I think there might be cases where we should do something
smarter than just adding bool onCatalogAccessibleInLogicalDecoding to
the beginning of a whole bunch of WAL structs. In most cases we try to
avoid having padding bytes in the WAL struct. If we can, we try to lay
out the struct to avoid padding bytes. If we can't, we put the fields
requiring less alignment at the end of the struct and then have a
SizeOf macro that is defined to not include the length of
any trailing padding which the compiler would insert. See, for
example, SizeOfHeapDelete. This patch doesn't do any of that, and it
should. It should also consider whether there's a way to avoid adding
any new bytes at all, e.g. it adds
onCatalogAccessibleInLogicalDecoding to xl_heap_visible, but that
struct has unused bits in 'flags'.

It would be very helpful if there were some place to refer to that
explained the design decisions here, like why the feature we're trying
to get requires this infrastructure around indexes to be added. It
could be in the commit messages, an email message, 

Re: Add sub-transaction overflow status in pg_stat_activity

2022-12-12 Thread Justin Pryzby
On Mon, Dec 12, 2022 at 09:33:51AM -0800, Nathan Bossart wrote:
> On Mon, Dec 12, 2022 at 11:15:43AM -0500, Robert Haas wrote:
> > Any strenuous objections?
> 
> Nope.  In fact, +1.  Until more work is done to alleviate the performance
> issues, this information will likely prove useful.

The docs could use a bit of attention.  Otherwise +1.

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4efa1d5fca0..ac15e2ce789 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5680,12 +5680,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
 record


-Returns a record of information about the backend's subtransactions.
-The fields returned are subxact_count identifies
-number of active subtransaction and subxact_overflow
- shows whether the backend's subtransaction cache is
-overflowed or not.
-   
+Returns a record of information about the subtransactions of the 
backend
+with the specified ID.
+The fields returned are subxact_count, which
+identifies the number of active subtransaction and
+subxact_overflow, which shows whether the
+backend's subtransaction cache is overflowed or not.

   
 




Re: Add sub-transaction overflow status in pg_stat_activity

2022-12-12 Thread Nathan Bossart
On Mon, Dec 12, 2022 at 11:15:43AM -0500, Robert Haas wrote:
> Any strenuous objections?

Nope.  In fact, +1.  Until more work is done to alleviate the performance
issues, this information will likely prove useful.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: generic plans and "initial" pruning

2022-12-12 Thread Alvaro Herrera
On 2022-Dec-12, Amit Langote wrote:

> I started feeling like putting all the new logic being added
> by this patch into plancache.c at the heart of GetCachedPlan() and
> tweaking its API in kind of unintuitive ways may not have been such a
> good idea to begin with.  So I started thinking again about your
> GetRunnablePlan() wrapper idea and thought maybe we could do something
> with it.  Let's say we name it GetCachedPlanLockPartitions() and put
> the logic that does initial pruning with the new
> ExecutorDoInitialPruning() in it, instead of in the normal
> GetCachedPlan() path.  Any callers that call GetCachedPlan() instead
> call GetCachedPlanLockPartitions() with either the List ** parameter
> as now or some container struct if that seems better.  Whether
> GetCachedPlanLockPartitions() needs to do anything other than return
> the CachedPlan returned by GetCachedPlan() can be decided by the
> latter setting, say, CachedPlan.has_unlocked_partitions.  That will be
> done by AcquireExecutorLocks() when it sees containsInitialPrunnig in
> any of the PlannedStmts it sees, locking only the
> PlannedStmt.minLockRelids set (which is all relations where no pruning
> is needed!), leaving the partition locking to
> GetCachedPlanLockPartitions().

Hmm.  This doesn't sound totally unreasonable, except to the point David
was making that perhaps we may want this container struct to accomodate
other things in the future than just the partition pruning results, so I
think its name (and that of the function that produces it) ought to be a
little more generic than that.

(I think this also answers your question on whether a List ** is better
than a container struct.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)




Re: add \dpS to psql

2022-12-12 Thread Nathan Bossart
On Mon, Dec 12, 2022 at 07:01:01AM -0500, Andrew Dunstan wrote:
> On 2022-12-09 Fr 13:44, Nathan Bossart wrote:
>> Any thoughts on $SUBJECT?
> 
> Yeah, the discussion got way off into the weeds here. I think the
> original proposal seems reasonable. Please add it to the next CF if you
> haven't already.

Here it is: https://commitfest.postgresql.org/41/4043/

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Allow parallel plan for referential integrity checks?

2022-12-12 Thread Frédéric Yhuel




On 12/11/22 06:29, Ian Lawrence Barwick wrote:

2022年7月26日(火) 20:58 Frédéric Yhuel :




On 4/14/22 14:25, Frédéric Yhuel wrote:



On 3/19/22 01:57, Imseih (AWS), Sami wrote:

I looked at your patch and it's a good idea to make foreign key
validation
use parallel query on large relations.

It would be valuable to add logging to ensure that the ActiveSnapshot
and TransactionSnapshot
is the same for the leader and the workers. This logging could be
tested in the TAP test.

Also, inside RI_Initial_Check you may want to set max_parallel_workers to
max_parallel_maintenance_workers.

Currently the work_mem is set to maintenance_work_mem. This will also
require
a doc change to call out.

/*
   * Temporarily increase work_mem so that the check query can be
executed
   * more efficiently.  It seems okay to do this because the query
is simple
   * enough to not use a multiple of work_mem, and one typically
would not
   * have many large foreign-key validations happening
concurrently.  So
   * this seems to meet the criteria for being considered a
"maintenance"
   * operation, and accordingly we use maintenance_work_mem.
However, we



Hello Sami,

Thank you for your review!

I will try to do as you say, but it will take time, since my daily job
as database consultant takes most of my time and energy.



Hi,

As suggested by Jacob, here is a quick message to say that I didn't find
enough time to work further on this patch, but I didn't completely
forget it either. I moved it to the next commitfest. Hopefully I will
find enough time and motivation in the coming months :-)


Hi Frédéric

This patch has been carried forward for a couple more commitfests since
your message; do you think you'll be able to work on it further during this
release cycle?



Hi Ian,

I've planned to work on it full time on week 10 (6-10 March), if you 
agree to bear with me. The idea would be to bootstrap my brain on it, 
and then continue to work on it from time to time.


Best regards,
Frédéric




Re: Add sub-transaction overflow status in pg_stat_activity

2022-12-12 Thread Robert Haas
On Wed, Nov 30, 2022 at 11:01 AM Robert Haas  wrote:
> That's not responsive to the need that I have. I need users to be able
> to figure out which backend(s) are overflowing their snapshots -- and
> perhaps how badly and how often --- not which backends are incurring
> an expense as a result. There may well be a use case for the latter
> thing but it's a different problem.

So ... I want to go ahead and commit Dilip's v4 patch, or something
very like it. Most people were initially supportive. Tom expressed
some opposition, but it sounds like that was mostly to the discussion
going on and on rather than the idea per se. Andres also expressed
some concerns, but I really think the problem he's worried about is
something slightly different and need not block this work. I note also
that the v4 patch is designed in such a way that it does not change
any view definitions, so the compatibility impact of committing it is
basically nil.

Any strenuous objections?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-12 Thread Masahiko Sawada
On Mon, Dec 12, 2022 at 7:14 PM John Naylor
 wrote:
>
>
> On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada  wrote:
> >
> > On Fri, Dec 9, 2022 at 5:53 PM John Naylor  
> > wrote:
> > >
>
> > > I don't think that'd be very controversial, but I'm also not sure why 
> > > we'd need 4MB -- can you explain in more detail what exactly we'd need so 
> > > that the feature would work? (The minimum doesn't have to work *well* 
> > > IIUC, just do some useful work and not fail).
> >
> > The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> > the radix tree is using dsa_get_total_size(). If the size returned by
> > dsa_get_total_size() (+ some memory used by TIDStore meta information)
> > exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> > and heap vacuum. However, when allocating DSA memory for
> > radix_tree_control at creation, we allocate 1MB
> > (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> > radix_tree_control from it. das_get_total_size() returns 1MB even if
> > there is no TID collected.
>
> 2MB makes sense.
>
> If the metadata is small, it seems counterproductive to count it towards the 
> total. We want the decision to be driven by blocks allocated. I have an idea 
> on that below.
>
> > > Remember when we discussed how we might approach parallel pruning? I 
> > > envisioned a local array of a few dozen kilobytes to reduce contention on 
> > > the tidstore. We could use such an array even for a single worker (always 
> > > doing the same thing is simpler anyway). When the array fills up enough 
> > > so that the next heap page *could* overflow it: Stop, insert into the 
> > > store, and check the store's memory usage before continuing.
> >
> > Right, I think it's no problem in slab cases. In DSA cases, the new
> > segment size follows a geometric series that approximately doubles the
> > total storage each time we create a new segment. This behavior comes
> > from the fact that the underlying DSM system isn't designed for large
> > numbers of segments.
>
> And taking a look, the size of a new segment can get quite large. It seems we 
> could test if the total DSA area allocated is greater than half of 
> maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, 
> then the area will contain just under a power of two the last time it passes 
> the test. The next segment will bring it to about 3/4 full, like this:
>
> maintenance work mem = 256MB, so stop if we go over 128MB:
>
> 2*(1+2+4+8+16+32) = 126MB -> keep going
> 126MB + 64 = 190MB-> stop
>
> That would be a simple way to be conservative with the memory limit. The 
> unfortunate aspect is that the last segment would be mostly wasted, but it's 
> paradise compared to the pessimistically-sized single array we have now (even 
> with Peter G.'s VM snapshot informing the allocation size, I imagine).

Right. In this case, even if we allocate 64MB, we will use only 2088
bytes at maximum. So I think the memory space used for vacuum is
practically limited to half.

>
> And as for minimum possible maintenance work mem, I think this would work 
> with 2MB, if the community is okay with technically going over the limit by a 
> few bytes of overhead if a buildfarm animal set to that value. I imagine it 
> would never go over the limit for realistic (and even most unrealistic) 
> values. Even with a VM snapshot page in memory and small local arrays of 
> TIDs, I think with this scheme we'll be well under the limit.

Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree. Templating whether or not to count
the memory usage might help avoid the overheads.

> After this feature is complete, I think we should consider a follow-on patch 
> to get rid of vacuum_work_mem, since it would no longer be needed.

I think you meant autovacuum_work_mem. Yes, I also think we can get rid of it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Date-Time dangling unit fix

2022-12-12 Thread Joseph Koshakow
I just found another class of this bug that the submitted patch does
not fix. If the units are at the beginning of the string, then they are
also ignored. For example, `date 'm d y2020m11d3'` is also valid. I
think the fix here is to check and make sure that ptype is 0 before
reassigning the value to a non-zero number. I'll send an updated patch
with this tonight.


Re: sendFileWithContent() does not advance the source pointer

2022-12-12 Thread Robert Haas
On Thu, Dec 8, 2022 at 2:43 PM Antonin Houska  wrote:
> When checking something else in the base backup code, I've noticed that
> sendFileWithContent() does not advance the 'content' pointer. The sink buffer
> is large enough (32kB) so that the first iteration usually processes the whole
> file (only special files are processed by this function), and thus that the
> problem is hidden.
>
> However it's possible to hit the issue: if there are too many tablespaces,
> pg_basebackup generates corrupted tablespace_map. Instead of writing all the
> tablespace paths it writes only some and then starts to write the contents
> from the beginning again.

Thanks for the report, analysis, and fix. I have committed your patch
and back-patched to v15.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Raising the SCRAM iteration count

2022-12-12 Thread Jonathan S. Katz

On 12/9/22 7:15 PM, Andres Freund wrote:

Hi,

On 2022-12-09 11:55:07 +0100, Daniel Gustafsson wrote:

Our current hardcoded value for iteration count is 4096, which is based on a
recommendation from RFC 7677.  This is however the lower end of the scale, and
is related to computing power in 2015 generation handheld devices.  The
relevant paragraph in section 4 of RFC 7677 [1] reads:

"As a rule of thumb, the hash iteration-count should be such that a modern
machine will take 0.1 seconds to perform the complete algorithm; however,
this is unlikely to be practical on mobile devices and other relatively low-
performance systems.  At the time this was written, the rule of thumb gives
around 15,000 iterations required; however, a hash iteration- count of 4096
takes around 0.5 seconds on current mobile handsets."

It goes on to say:

"..the recommendation of this specification is that the hash iteration- 
count
SHOULD be at least 4096, but careful consideration ought to be given to
using a significantly higher value, particularly where mobile use is less
important."

Selecting 4096 was thus a conservative take already in 2015, and is now very
much so.  On my 2020-vintage Macbook I need ~200k iterations to consume 0.1
seconds (in a build with assertions).  Calculating tens of thousands of hashes
per second on a consumer laptop at a 4096 iteration count is no stretch.  A
brief look shows that MongoDB has a minimum of 5000 with a default of 15000
[2]; Kafka has a minimum of 4096 [3].

Making the iteration count a configurable setting would allow installations to
raise the iteration count to strengthen against brute force attacks, while
still supporting those with lower end clients who prefer the trade-off of
shorter authentication times.

The attached introduces a scram_iteration_count GUC with a default of 15000
(still conservative, from RFC7677) and a minimum of 4096.  Since the iterations
are stored per secret it can be altered with backwards compatibility.


To throw on a bit of paint, if we do change it, we should likely follow 
what would come out in a RFC.


While the SCRAM-SHA-512 RFC is still in draft[1], the latest draft it 
contains a "SHOULD" recommendation of 1, which was bumped up from 
4096 in an earlier version of the draft:


==snip==
Therefore, the recommendation of this specification is that the hash 
iteration- count SHOULD be at least 1, but careful consideration 
ought to be given to using a significantly higher value, particularly 
where mobile use is less important.¶

==snip==

I'm currently ambivalent (+0) on changing the default. I think giving 
the user more control over iterations ([2], and follow up work to make 
it easier to set iteration account via client) can help with this.


However, I do like the idea of a GUC.


I am extremely doubtful it's a good idea to increase the default (if anything
the opposite). 0.1 seconds is many times the connection establishment
overhead, even over network.  I've seen users complain about postgres
connection establishment overhead being high, and it just turned out to be due
to scram - yes, they ended up switching to md5, because that was the only
viable alternative.


Ugh, I'd be curious to know how often that is the case. That said, I 
think some of the above work could help with that.


Thanks,

Jonathan

[1] https://datatracker.ietf.org/doc/html/draft-melnikov-scram-sha-512
[2] https://postgr.es/m/fce7228e-d0d6-64a1-3dcb-bba85c2fa...@postgresql.org/


OpenPGP_signature
Description: OpenPGP digital signature


Re: Add PL/pgSQL extra check no_data_found

2022-12-12 Thread Pavel Stehule
po 12. 12. 2022 v 14:16 odesílatel Мельников Игорь 
napsal:

> I know, know.
> But ora2pg NOT convert source code in application tier anonymouse block
> and dynamic SQL in server side pl/sql.
> This part of application need to be rewrite manually.
>
> "no_data_found" for the  plpgsql.extra_errors and plpgsql.extra_warnings
> will be reduce this part of work.
>
> Also, in my opinion, it looks strange that there too_many_rows is in 
> plpgsql.extra_errors
> and plpgsql.extra_warnings, but no_data_found NOT.
> Why?
>

The extra checks are not designed for compatibility with Oracle. It is
designed to implement some common checks that are harder or slower to
implement in plpgsql.

no_data_found issue can be simply checked by variable FOUND. On the second
hand, too many rows is more complex (a little bit). You need to use the GET
DIAGNOSTICS command and IF.

Extra checks were designed to check some less frequent but nasty errors to
write safer code. It is not designed for better portability from Oracle.

Regards

Pavel



> Thanx
>
> Best Regards
> Igor Melnikov
>
>
>
> Понедельник, 12 декабря 2022, 16:01 +03:00 от Pavel Stehule <
> pavel.steh...@gmail.com >:
>
> Hi
>
> po 12. 12. 2022 v 13:37 odesílatel Мельников Игорь  > napsal:
>
> Hi!
>
> This new feature will be in demand for customers who migrate their
> largeapplications (having millions of lines of PL/SQL code) from Oracle to
> PostreSQL.
> It will reduce the amount of work on rewriting the code will provide an
> opportunity to reduce budgets for the migration project.
>
> Yes, in case the part of the code that handles no_data_found is executed
> very often, this will cause performance loss.
> During the testing phase, this will be discovered and the customer will
> rewrite these problem areas of the code - add the phrase STRICT.
> He will not need to change all the code at the very beginning, as it
> happens now, without this feature.
>
>
> ora2pg does this work by default. It is great tool and reduces lot of work
>
> https://ora2pg.darold.net/
>
> Regards
>
> Pavel
>
>
>
>
> *I am convinced that this functionality will attract even more customers
> to PostgreSQL - it will increase the popularity of the PostgeSQL DBMS.*
>
> Thank you!
>
> Best Regards
> Igor Melnikov
>
>
>
> Понедельник, 12 декабря 2022, 15:23 +03:00 от Pavel Stehule <
> pavel.steh...@gmail.com
> >:
>
>
>
> čt 8. 12. 2022 v 12:29 odesílatel Sergey Shinderuk <
> s.shinde...@postgrespro.ru
> >
> napsal:
>
> Hello,
>
> I propose to add a new value "no_data_found" for the
> plpgsql.extra_errors and plpgsql.extra_warnings parameters [1].
>
> With plpgsql.extra_errors=no_data_found SELECT INTO raises no_data_found
> exception when the result set is empty. With
> plpgsql.extra_errors=too_many_rows,no_data_found SELECT INTO behaves
> like SELECT INTO STRICT [2]. This could simplify migration from PL/SQL
> and may be just more convenient.
>
> One potential downside is that plpgsql.extra_errors=no_data_found could
> break existing functions expecting to get null or checking IF found
> explicitly. This is also true for the too_many_rows exception, but
> arguably it's a programmer error, while no_data_found switches to a
> different convention for handling (or not handling) an empty result with
> SELECT INTO.
>
> Otherwise the patch is straightforward.
>
> What do you think?
>
>
> I am not against it. It makes sense.
>
> I don't like the idea about possible replacement of INTO STRICT by INTO +
> extra warnings.
>
> Handling exceptions is significantly more expensive than in Oracle, and
> using INTO without STRICT with the next test IF NOT FOUND THEN can save one
> safepoint and one handling an exception. It should be mentioned in the
> documentation. Using this very common Oracle's pattern can have a very
> negative impact on performance in Postgres. If somebody does port from
> Oracle, and wants compatible behavior then he should use INTO STRICT. I
> think it is counterproductive to hide syntax differences when there is a
> significant difference in performance (and will be).
>
> Regards
>
> Pavel
>
>
>
>
>
> --
> Sergey Shinderukhttps://postgrespro.com/
>
>
> [1]
>
> https://www.postgresql.org/docs/devel/plpgsql-development-tips.html#PLPGSQL-EXTRA-CHECKS
> [2]
>
> https://www.postgresql.org/docs/devel/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW
>
>
>
> С уважением,
> Мельников Игорь
> melnikov...@mail.ru
> 
>
>
>
>
> С уважением,
> Мельников Игорь
> melnikov...@mail.ru 
>
>


Re: Order getopt arguments

2022-12-12 Thread Peter Eisentraut

On 05.12.22 18:04, Robert Haas wrote:

On Mon, Dec 5, 2022 at 11:51 AM Tom Lane  wrote:

Robert Haas  writes:

I was only talking about the actual argument to getopt(), not the
order of the code stanzas. I'm not sure what we ought to do about the
latter.


100% agreed that the getopt argument should just be alphabetical.
But the bulk of Peter's patch is rearranging switch cases to agree
with that, and if you want to do that then you have to also think
about long options, which are not in the getopt argument.  I'm
not entirely convinced that reordering the switch cases is worth
troubling over.


I'm not particularly sold on that either, but neither am I
particularly opposed to it.


I have committed it as posted.





Re: [PATCH] Add native windows on arm64 support

2022-12-12 Thread Niyas Sait

Hi,

On 05/12/2022 18:14, Andres Freund wrote:


With meson gaining in maturity, perhaps that's not the most urgent
thing as we will likely remove src/tools/msvc/ soon but I'd rather do
that right anyway as much as I can to avoid an incorrect state in the
tree at any time in its history.


I'd actually argue that we should just not add win32 support to
src/tools/msvc/.




I think the old build system specific part is really minimal in the 
patch. I can strip out those if that's preferred.




--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -708,13 +708,21 @@ typedef LONG slock_t;
 #define SPIN_DELAY() spin_delay()
 
 /* If using Visual C++ on Win64, inline assembly is unavailable.

- * Use a _mm_pause intrinsic instead of rep nop.
+ * Use _mm_pause (x64) or __isb (arm64) intrinsic instead of rep nop.
  */
 #if defined(_WIN64)
 static __forceinline void
 spin_delay(void)
 {
+#ifdef _M_ARM64
+   /*
+* See spin_delay aarch64 inline assembly definition above for details
+* ref: 
https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions
+   */
+   __isb(_ARM64_BARRIER_SY);
+#else
_mm_pause();
+#endif
 }
 #else
 static __forceinline void


This looks somewhat wrong to me. We end up with some ifdefs on the function
defintion level, and some others inside the function body. I think it should
be either or.



Ok, I can add an MSVC/ARM64 specific function.


diff --git a/meson.build b/meson.build
index 725e10d815..e354ad7650 100644
--- a/meson.build
+++ b/meson.build
@@ -1944,7 +1944,13 @@ int main(void)
  
  elif host_cpu == 'arm' or host_cpu == 'aarch64'
  
-  prog = '''

+  if cc.get_id() == 'msvc'
+cdata.set('USE_ARMV8_CRC32C', false)
+cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
+have_optimized_crc = true
+  else
+
+prog = '''
  #include 
  
  int main(void)

Why does this need to be hardcoded? The compiler probe should just work for
msvc.



There are couple of minor issues in the code probe with MSVC such as 
arm_acle.h needs to be removed and requires an explicit import of 
intrin.h. But even with those fixes, USE_ARMV8_CRC32C would be set and 
no runtime CRC extension check will be performed. Since CRC extension is 
optional in ARMv8, It would be better to use the CRC variant with 
runtime check. So I end up following the x64 variant and hardcoded the 
flags in case of ARM64 and MSVC.



--
Niyas




Re: Add PL/pgSQL extra check no_data_found

2022-12-12 Thread Мельников Игорь

I know, know.
But ora2pg NOT convert source code in application tier anonymouse block and 
dynamic SQL in server side pl/sql.
This part of application need to be rewrite manually.
 
"no_data_found" for the  plpgsql.extra_errors and plpgsql.extra_warnings will 
be reduce this part of work.
 
Also, in my opinion, it looks strange that there too_many_rows is in  
plpgsql.extra_errors and plpgsql.extra_warnings, but no_data_found NOT.
Why?
 
Thanx
 
Best Regards
Igor Melnikov

 
>Понедельник, 12 декабря 2022, 16:01 +03:00 от Pavel Stehule < 
>pavel.steh...@gmail.com >:
> 
>Hi  
>po 12. 12. 2022 v 13:37 odesílatel Мельников Игорь < melnikov...@mail.ru > 
>napsal:
>>Hi!
>> 
>>This new feature will be in demand for customers who migrate their 
>>largeapplications (having millions of lines of PL/SQL code) from Oracle to 
>>PostreSQL.
>>It will reduce the amount of work on rewriting the code will provide an 
>>opportunity to reduce budgets for the migration project.
>> 
>>Yes, in case the part of the code that handles no_data_found is executed very 
>>often, this will cause performance loss.
>>During the testing phase, this will be discovered and the customer will 
>>rewrite these problem areas of the code - add the phrase STRICT.
>>He will not need to change all the code at the very beginning, as it happens 
>>now, without this feature.
> 
>ora2pg does this work by default. It is great tool and reduces lot of work
> 
>https://ora2pg.darold.net/
> 
>Regards
> 
>Pavel
> 
> 
>> 
>>I am convinced that this functionality will attract even more customers to 
>>PostgreSQL - it will increase the popularity of the PostgeSQL DBMS.
>> 
>>Thank you!
>> 
>>Best Regards
>>Igor Melnikov
>>
>>  
>>>Понедельник, 12 декабря 2022, 15:23 +03:00 от Pavel Stehule < 
>>>pavel.steh...@gmail.com >:
>>> 
>>>   
>>>čt  8. 12. 2022 v 12:29 odesílatel Sergey Shinderuk < 
>>>s.shinde...@postgrespro.ru > napsal: 
Hello,

I propose to add a new value "no_data_found" for the
plpgsql.extra_errors and plpgsql.extra_warnings parameters [1].

With plpgsql.extra_errors=no_data_found SELECT INTO raises no_data_found
exception when the result set is empty. With
plpgsql.extra_errors=too_many_rows,no_data_found SELECT INTO behaves
like SELECT INTO STRICT [2]. This could simplify migration from PL/SQL
and may be just more convenient.

One potential downside is that plpgsql.extra_errors=no_data_found could
break existing functions expecting to get null or checking IF found
explicitly. This is also true for the too_many_rows exception, but
arguably it's a programmer error, while no_data_found switches to a
different convention for handling (or not handling) an empty result with
SELECT INTO.

Otherwise the patch is straightforward.

What do you think?
>>> 
>>>I am not against it. It makes sense.
>>> 
>>>I don't like the idea about possible replacement of INTO STRICT by INTO + 
>>>extra warnings.
>>> 
>>>Handling exceptions is significantly more expensive than in Oracle, and 
>>>using INTO without STRICT with the next test IF NOT FOUND THEN can save one 
>>>safepoint and one handling an exception. It should be mentioned in the 
>>>documentation. Using this very common Oracle's pattern can have a very 
>>>negative impact on performance in Postgres. If somebody does port from 
>>>Oracle, and wants compatible behavior then he should use INTO STRICT. I 
>>>think it is counterproductive to hide syntax differences when there is a 
>>>significant difference in performance (and will be).
>>> 
>>>Regards
>>> 
>>>Pavel
>>> 
>>> 
>>>    
--
Sergey Shinderuk                 https://postgrespro.com/


[1]
https://www.postgresql.org/docs/devel/plpgsql-development-tips.html#PLPGSQL-EXTRA-CHECKS
[2]
https://www.postgresql.org/docs/devel/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW
 
>> 
>> 
>>С уважением,
>>Мельников Игорь
>>melnikov...@mail.ru
>>  
 
 
С уважением,
Мельников Игорь
melnikov...@mail.ru
 

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-12-12 Thread Önder Kalacı
Hi,

Thanks for the heads-up.


> Needs another rebase, I think:
>
> https://cirrus-ci.com/task/559244463758
>
> [05:44:22.102] FAILED:
> src/backend/postgres_lib.a.p/replication_logical_worker.c.o
> [05:44:22.102] ccache cc -Isrc/backend/postgres_lib.a.p -Isrc/include
> -I../src/include -Isrc/include/storage -Isrc/include/utils
> -Isrc/include/catalog -Isrc/include/nodes -fdiagnostics-color=always -pipe
> -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -g -fno-strict-aliasing -fwrapv
> -fexcess-precision=standard -D_GNU_SOURCE -Wmissing-prototypes
> -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute
> -Wimplicit-fallthrough=3 -Wcast-function-type -Wshadow=compatible-local
> -Wformat-security -Wdeclaration-after-statement -Wno-format-truncation
> -Wno-stringop-truncation -fPIC -pthread -DBUILDING_DLL -MD -MQ
> src/backend/postgres_lib.a.p/replication_logical_worker.c.o -MF
> src/backend/postgres_lib.a.p/replication_logical_worker.c.o.d -o
> src/backend/postgres_lib.a.p/replication_logical_worker.c.o -c
> ../src/backend/replication/logical/worker.c
> [05:44:22.102] ../src/backend/replication/logical/worker.c: In function
> ‘get_usable_indexoid’:
> [05:44:22.102] ../src/backend/replication/logical/worker.c:2101:36: error:
> ‘ResultRelInfo’ has no member named ‘ri_RootToPartitionMap’
> [05:44:22.102]  2101 |   TupleConversionMap *map =
> relinfo->ri_RootToPartitionMap;
> [05:44:22.102]   |^~
>
>
Yes, it seems the commit (fb958b5da86da69651f6fb9f540c2cfb1346cdc5) broke
the build and commit(a61b1f74823c9c4f79c95226a461f1e7a367764b) broke the
tests. But the fixes were trivial. All tests pass again.

Attached v22.

Onder KALACI


v22_0001_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data


Re: Infinite Interval

2022-12-12 Thread Ashutosh Bapat
Hi Joseph,
I stumbled upon this requirement a few times. So I started working on
this support in my spare time as a hobby project to understand
horology code in PostgreSQL. This was sitting in my repositories for
more than an year. Now that I have someone else showing an interest,
it's time for it to face the world. Rebased it, fixed conflicts.

PFA patch implementing infinite interval. It's still WIP, there are
TODOs in the code and also the commit message lists things that are
known to be incomplete. You might want to assess expected output
carefully

On Sun, Dec 11, 2022 at 12:51 AM Joseph Koshakow  wrote:>
> The proposed design from the most recent thread was to reserve
> INT32_MAX months for infinity and INT32_MIN months for negative
> infinity. As pointed out in the thread, these are currently valid
> non-infinite intervals, but they are out of the documented range.

The patch uses both months and days together to avoid this problem.

Please feel free to complete the patch, work on review comments etc. I
will help as and when I find time.

-- 
Best Wishes,
Ashutosh Bapat
From 09a8fef224a3682059fe1adf42957f693a41d242 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Thu, 30 Apr 2020 10:06:49 +0530
Subject: [PATCH] Support infinite interval

This is WIP.

Following things are supported
1. Accepts '+/-infinity' as a valid string input for interval type.
2. Support interval_pl, interval_div
3. Tests in interval.sql for comparison operators working fine.

TODOs
1. Various TODOs in code
2. interval_pl: how to handle infinite values with opposite signs
3. timestamp, timestamptz, date and time arithmetic
4. Fix horology test.

Ashutosh Bapat
---
 src/backend/utils/adt/datetime.c   |   2 +
 src/backend/utils/adt/timestamp.c  | 166 -
 src/test/regress/expected/interval.out |  80 ++--
 src/test/regress/sql/interval.sql  |   8 ++
 4 files changed, 242 insertions(+), 14 deletions(-)

diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index b5b117a8ca..1e98c6dc78 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -3634,6 +3634,8 @@ DecodeInterval(char **field, int *ftype, int nf, int range,
 			case DTK_STRING:
 			case DTK_SPECIAL:
 type = DecodeUnits(i, field[i], );
+if (type == UNKNOWN_FIELD)
+	type = DecodeSpecial(i, field[i], );
 if (type == IGNORE_DTF)
 	continue;
 
diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 3f2508c0c4..0c7286b06e 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -79,6 +79,12 @@ static bool AdjustIntervalForTypmod(Interval *interval, int32 typmod,
 static TimestampTz timestamp2timestamptz(Timestamp timestamp);
 static Timestamp timestamptz2timestamp(TimestampTz timestamp);
 
+static void EncodeSpecialInterval(Interval *interval, char *str);
+static void interval_noend(Interval *interval);
+static bool interval_is_noend(Interval *interval);
+static void interval_nobegin(Interval *interval);
+static bool interval_is_nobegin(Interval *interval);
+static bool interval_not_finite(Interval *interval);
 
 /* common code for timestamptypmodin and timestamptztypmodin */
 static int32
@@ -943,6 +949,14 @@ interval_in(PG_FUNCTION_ARGS)
 		 errmsg("interval out of range")));
 			break;
 
+		case DTK_LATE:
+			interval_noend(result);
+			break;
+
+		case DTK_EARLY:
+			interval_nobegin(result);
+			break;
+
 		default:
 			elog(ERROR, "unexpected dtype %d while parsing interval \"%s\"",
  dtype, str);
@@ -965,8 +979,13 @@ interval_out(PG_FUNCTION_ARGS)
 			   *itm = 
 	char		buf[MAXDATELEN + 1];
 
-	interval2itm(*span, itm);
-	EncodeInterval(itm, IntervalStyle, buf);
+	if (interval_not_finite(span))
+		EncodeSpecialInterval(span, buf);
+	else
+	{
+		interval2itm(*span, itm);
+		EncodeInterval(itm, IntervalStyle, buf);
+	}
 
 	result = pstrdup(buf);
 	PG_RETURN_CSTRING(result);
@@ -1352,6 +1371,13 @@ AdjustIntervalForTypmod(Interval *interval, int32 typmod,
 		INT64CONST(0)
 	};
 
+	/*
+	 * Infinite interval after being subjected to typmod conversion remains
+	 * infinite.
+	 */
+	if (interval_not_finite(interval))
+		return;
+
 	/*
 	 * Unspecified range and precision? Then not necessary to adjust. Setting
 	 * typmod to -1 is the convention for all data types.
@@ -1545,6 +1571,17 @@ EncodeSpecialTimestamp(Timestamp dt, char *str)
 		elog(ERROR, "invalid argument for EncodeSpecialTimestamp");
 }
 
+static void
+EncodeSpecialInterval(Interval *interval, char *str)
+{
+	if (interval_is_nobegin(interval))
+		strcpy(str, EARLY);
+	else if (interval_is_noend(interval))
+		strcpy(str, LATE);
+	else		/* shouldn't happen */
+		elog(ERROR, "invalid argument for EncodeSpecialInterval");
+}
+
 Datum
 now(PG_FUNCTION_ARGS)
 {
@@ -2080,10 +2117,12 @@ timestamp_finite(PG_FUNCTION_ARGS)
 	PG_RETURN_BOOL(!TIMESTAMP_NOT_FINITE(timestamp));
 }
 
+/* TODO: modify this to check finite-ness */
 

Re: Add PL/pgSQL extra check no_data_found

2022-12-12 Thread Pavel Stehule
Hi

po 12. 12. 2022 v 13:37 odesílatel Мельников Игорь 
napsal:

> Hi!
>
> This new feature will be in demand for customers who migrate their
> largeapplications (having millions of lines of PL/SQL code) from Oracle to
> PostreSQL.
> It will reduce the amount of work on rewriting the code will provide an
> opportunity to reduce budgets for the migration project.
>
> Yes, in case the part of the code that handles no_data_found is executed
> very often, this will cause performance loss.
> During the testing phase, this will be discovered and the customer will
> rewrite these problem areas of the code - add the phrase STRICT.
> He will not need to change all the code at the very beginning, as it
> happens now, without this feature.
>

ora2pg does this work by default. It is great tool and reduces lot of work

https://ora2pg.darold.net/

Regards

Pavel



>
> *I am convinced that this functionality will attract even more customers
> to PostgreSQL - it will increase the popularity of the PostgeSQL DBMS.*
>
> Thank you!
>
> Best Regards
> Igor Melnikov
>
>
>
> Понедельник, 12 декабря 2022, 15:23 +03:00 от Pavel Stehule <
> pavel.steh...@gmail.com>:
>
>
>
> čt 8. 12. 2022 v 12:29 odesílatel Sergey Shinderuk <
> s.shinde...@postgrespro.ru
> > napsal:
>
> Hello,
>
> I propose to add a new value "no_data_found" for the
> plpgsql.extra_errors and plpgsql.extra_warnings parameters [1].
>
> With plpgsql.extra_errors=no_data_found SELECT INTO raises no_data_found
> exception when the result set is empty. With
> plpgsql.extra_errors=too_many_rows,no_data_found SELECT INTO behaves
> like SELECT INTO STRICT [2]. This could simplify migration from PL/SQL
> and may be just more convenient.
>
> One potential downside is that plpgsql.extra_errors=no_data_found could
> break existing functions expecting to get null or checking IF found
> explicitly. This is also true for the too_many_rows exception, but
> arguably it's a programmer error, while no_data_found switches to a
> different convention for handling (or not handling) an empty result with
> SELECT INTO.
>
> Otherwise the patch is straightforward.
>
> What do you think?
>
>
> I am not against it. It makes sense.
>
> I don't like the idea about possible replacement of INTO STRICT by INTO +
> extra warnings.
>
> Handling exceptions is significantly more expensive than in Oracle, and
> using INTO without STRICT with the next test IF NOT FOUND THEN can save one
> safepoint and one handling an exception. It should be mentioned in the
> documentation. Using this very common Oracle's pattern can have a very
> negative impact on performance in Postgres. If somebody does port from
> Oracle, and wants compatible behavior then he should use INTO STRICT. I
> think it is counterproductive to hide syntax differences when there is a
> significant difference in performance (and will be).
>
> Regards
>
> Pavel
>
>
>
>
>
> --
> Sergey Shinderukhttps://postgrespro.com/
>
>
> [1]
>
> https://www.postgresql.org/docs/devel/plpgsql-development-tips.html#PLPGSQL-EXTRA-CHECKS
> [2]
>
> https://www.postgresql.org/docs/devel/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW
>
>
>
> С уважением,
> Мельников Игорь
> melnikov...@mail.ru
>
>


Re: Add PL/pgSQL extra check no_data_found

2022-12-12 Thread Мельников Игорь

Hi!
 
This new feature will be in demand for customers who migrate their 
largeapplications (having millions of lines of PL/SQL code) from Oracle to 
PostreSQL.
It will reduce the amount of work on rewriting the code will provide an 
opportunity to reduce budgets for the migration project.
 
Yes, in case the part of the code that handles no_data_found is executed very 
often, this will cause performance loss.
During the testing phase, this will be discovered and the customer will rewrite 
these problem areas of the code - add the phrase STRICT.
He will not need to change all the code at the very beginning, as it happens 
now, without this feature.
 
I am convinced that this functionality will attract even more customers to 
PostgreSQL - it will increase the popularity of the PostgeSQL DBMS.
 
Thank you!
 
Best Regards
Igor Melnikov

  
>Понедельник, 12 декабря 2022, 15:23 +03:00 от Pavel Stehule 
>:
> 
>   
>čt  8. 12. 2022 v 12:29 odesílatel Sergey Shinderuk < 
>s.shinde...@postgrespro.ru > napsal: 
>>Hello,
>>
>>I propose to add a new value "no_data_found" for the
>>plpgsql.extra_errors and plpgsql.extra_warnings parameters [1].
>>
>>With plpgsql.extra_errors=no_data_found SELECT INTO raises no_data_found
>>exception when the result set is empty. With
>>plpgsql.extra_errors=too_many_rows,no_data_found SELECT INTO behaves
>>like SELECT INTO STRICT [2]. This could simplify migration from PL/SQL
>>and may be just more convenient.
>>
>>One potential downside is that plpgsql.extra_errors=no_data_found could
>>break existing functions expecting to get null or checking IF found
>>explicitly. This is also true for the too_many_rows exception, but
>>arguably it's a programmer error, while no_data_found switches to a
>>different convention for handling (or not handling) an empty result with
>>SELECT INTO.
>>
>>Otherwise the patch is straightforward.
>>
>>What do you think?
> 
>I am not against it. It makes sense.
> 
>I don't like the idea about possible replacement of INTO STRICT by INTO + 
>extra warnings.
> 
>Handling exceptions is significantly more expensive than in Oracle, and using 
>INTO without STRICT with the next test IF NOT FOUND THEN can save one 
>safepoint and one handling an exception. It should be mentioned in the 
>documentation. Using this very common Oracle's pattern can have a very 
>negative impact on performance in Postgres. If somebody does port from Oracle, 
>and wants compatible behavior then he should use INTO STRICT. I think it is 
>counterproductive to hide syntax differences when there is a significant 
>difference in performance (and will be).
> 
>Regards
> 
>Pavel
> 
> 
>    
>>--
>>Sergey Shinderuk                 https://postgrespro.com/
>>
>>
>>[1]
>>https://www.postgresql.org/docs/devel/plpgsql-development-tips.html#PLPGSQL-EXTRA-CHECKS
>>[2]
>>https://www.postgresql.org/docs/devel/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW
>> 
 
 
С уважением,
Мельников Игорь
melnikov...@mail.ru
 

Re: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Amit Kapila
On Mon, Dec 12, 2022 at 1:04 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> This is a reply for later part of your e-mail.
>
> > > (2) About the timeout issue
> > >
> > > When having a look at the physical replication internals,
> > > it conducts sending feedback and application of delay separately on 
> > > different
> > processes.
> > > OTOH, the logical replication needs to achieve those within one process.
> > >
> > > When we want to apply delay and avoid the timeout,
> > > we should not store all the transactions data into memory.
> > > So, one approach for this is to serialize the transaction data and after 
> > > the delay,
> > > we apply the transactions data.
> > >
> >
> > It is not clear to me how this will avoid a timeout.
>
> At first, the reason why the timeout occurs is that while delaying the apply
> worker neither reads messages from the walsender nor replies to it.
> The worker's last_recv_timeout will be not updated because it does not receive
> messages. This leads to wal_receiver_timeout. Similarly, the walsender's
> last_processing will be not updated and exit due to the timeout because the
> worker does not reply to upstream.
>
> Based on the above, we thought that workers must receive and handle messages
> evenif they are delaying applying transactions. In more detail, workers must
> iterate the outer loop in LogicalRepApplyLoop().
>
> If workers receive transactions but they need to delay applying, they must 
> keep
> messages somewhere. So we came up with the idea that workers serialize changes
> once and apply later. Our basic design is as follows:
>
> * All transactions areserialized to files if min_apply_delay is set to 
> non-zero.
> * After receiving the commit message and spending time, workers reads and
>   applies spooled messages
>

I think this may be more work than required because in some cases
doing I/O just to delay xacts will later lead to more work. Can't we
send some ping to walsender to communicate that walreceiver is alive?
We already seem to be sending a ping in LogicalRepApplyLoop if we
haven't heard anything from the server for more than
wal_receiver_timeout / 2. Now, it is possible that the walsender is
terminated due to some other reason and we need to see if we can
detect that or if it will only be detected once the walreceiver's
delay time is over.

-- 
With Regards,
Amit Kapila.




Re: Support logical replication of DDLs

2022-12-12 Thread Ajin Cherian
On Tue, Nov 15, 2022 at 10:57 AM Peter Smith  wrote:
>
> Here are some review comments for v32-0002
>
> ==
>
> 1. Commit message
>
> Comment says:
> While creating a publication, we register a command end
> trigger that deparses the DDL as a JSON blob, and WAL logs it. The event
> trigger is automatically removed at the time of drop publication.
>
> SUGGESTION (uppercase the SQL)
> During CREATE PUBLICATION we register a command end trigger that
> deparses the DDL as a JSON blob, and WAL logs it. The event
> trigger is automatically removed at the time of DROP PUBLICATION.
>
> ~~~

fixed.

>
> 2.
>
> Comment says:
> This is a POC patch to show how using event triggers and DDL deparsing
> facilities we can implement DDL replication. So, the implementation is
> restricted to CREATE TABLE/ALTER TABLE/DROP TABLE commands.
>
> ~
>
> Still correct or old comment gone stale?
>

Removed.

> ~~~
>
> 3.
>
> Comment says:
> Note that the replication for ALTER INDEX command is still under
> progress.
>
> ~
>
> Still correct or old comment gone stale?
>

Removed.

> ==
>
> 4. GENERAL - Patch order.
>
> Somehow, I feel this v32-0002 patch and the v32-0001 patch should be
> swapped. IIUC this one seems to me to be the "core" framework for the
> DDL message replication but the other 0001 was more like just the
> implements of all the supported different *kinds* of DDL JSON blobs.
> So actually this patch seems more like the mandatory one and the other
> one can just evolve as it gets more supported JSON.
>

I think there is a big patch reordering planned in future versions
based on this comment
and Alvaro's comment. Skipping this for now.

> ~~~
>
> 5. GENERAL - naming
>
> The DDL suffix 'msg' or 'message' seemed sometimes unnecessary because
> there is no ambiguity that this is a message for DDL replication, so
> the shorter name conveys the same amount of information, doesn't it?
>
> e.g. Maybe reconsider some of these ones (probably there are others)...
>
> src/include/replication/decode.h
> logicalddlmsg_decode -> Why not call this function logicalddl_decode?
>
> src/include/replication/logicalproto.h:
> LOGICAL_REP_MSG_DDLMESSAGE -> Why not call it 'LOGICAL_REP_MSG_DDL'?
> logicalrep_write_ddlmessage -> Why not call this function 
> logicalrep_write_ddl?
> logicalrep_read_ddlmessage -> Why not call this function logicalrep_read_ddl?
>
> src/include/replication/output_plugin.h:
> 'ddlmessage_cb' -> Why not call it 'ddl_cb'?
> 'stream_ddlmessage_cb' -> Why not call it 'stream_ddl_cb'?
>
> src/include/replication/reorderbuffer.h:
> - 'REORDER_BUFFER_CHANGE_DDL' --> Why not call it 'REORDER_BUFFER_CHANGE_DDL'?
> - 'ddlmsg' -> Why not call it 'ddl'?
> - 'ddlmessage' -> Why not call it 'ddl'?
> - 'stream_ddlmessage' -> Why not call it 'stream_ddl'?
>

Fixed.

> ==
>
> src/backend/access/rmgrdesc/Makefile
>
> 6.
>
> @@ -19,6 +19,7 @@ OBJS = \
>   hashdesc.o \
>   heapdesc.o \
>   logicalmsgdesc.o \
> + logicalddlmsgdesc.o \
>
> Change should be in alphabetical order.
>

Fixed.

> ==
>
> src/backend/access/rmgrdesc/logicalddlmsgdesc.c
>
> 7. logicalddlmsg_identify
>
> +const char *
> +logicalddlmsg_identify(uint8 info)
> +{
> + if ((info & ~XLR_INFO_MASK) == XLOG_LOGICAL_DDL_MESSAGE)
> + return "DDL MESSAGE";
> +
> + return NULL;
> +}
>
> The logicalrep_message_type (see below) said "DDL", so maybe this
> should also just say "DDL" instead of "DDL MESSAGE"
>
> @@ -1218,6 +1264,8 @@ logicalrep_message_type(LogicalRepMsgType action)
>   return "TYPE";
>   case LOGICAL_REP_MSG_MESSAGE:
>   return "MESSAGE";
> + case LOGICAL_REP_MSG_DDLMESSAGE:
> + return "DDL";
>

Fixed.

> ==
>
> src/backend/commands/event_trigger.c
>
> 8. start/end
>
> +/*
> + * publication_deparse_ddl_command_start
> + *
> + * Deparse the ddl command and log it.
> + */
> +Datum
> +publication_deparse_ddl_command_start(PG_FUNCTION_ARGS)
> ...
> +/*
> + * publication_deparse_ddl_command_end
> + *
> + * Deparse the ddl command and log it.
> + */
> +Datum
> +publication_deparse_ddl_command_end(PG_FUNCTION_ARGS)
>
> The start/end function comments are the same -- there should be some
> more explanation to say what they are for.
>

Updated with a more detailed explanation.

> ~~~
>
> 9. publication_deparse_ddl_command_start
>
> + char*command = psprintf("Drop table command start");
>
> Huh? So this function is only for this specific case of DROP TABLE? If
> correct, then I think that should be commented on or asserted
> somewhere.
>

Updated the comments specifying this.

> ~
>
> 10.
>
> + /* extract the relid from the parse tree */
> + foreach(cell1, stmt->objects)
>
> Uppercase comment
>

Fixed.

> ~
>
> 11.
>
> + if (relpersist == RELPERSISTENCE_TEMP)
> + {
> + table_close(relation, NoLock);
> + continue;
> + }
> +
> + LogLogicalDDLMessage("deparse", address.objectId, DCT_TableDropStart,
> + command, strlen(command) + 1);
> +
> + if (relation)
> + table_close(relation, NoLock);
>
> This code looks overly complex. Can't it just be like 

Re: Force streaming every change in logical decoding

2022-12-12 Thread Amit Kapila
On Sat, Dec 10, 2022 at 11:18 AM Dilip Kumar  wrote:
>
> On Wed, Dec 7, 2022 at 5:16 AM Peter Smith  wrote:
>
> +1 for the idea
>
> >
> > There is potential for lots of developer GUCs for testing/debugging in
> > the area of logical replication but IMO it might be better to keep
> > them all separated. Putting everything into a single
> > 'logical_replication_mode' might cause difficulties later when/if you
> > want combinations of the different modes.
> >
> > For example, instead of
> >
> > logical_replication_mode = XXX/YYY/ZZZ
> >
> > maybe something like below will give more flexibility.
> >
> > logical_replication_dev_XXX = true/false
> > logical_replication_dev_YYY = true/false
> > logical_replication_dev_ZZZ = true/false
> >
>
> Even I agree that usability wise keeping them independent is better.
>

But OTOH, doesn't introducing multiple GUCs (one to allow streaming
each change, another to allow serialization, and a third one to
probably test subscriber-side work) for the purpose of testing, and
debugging logical replication code sound a bit more?

-- 
With Regards,
Amit Kapila.




Re: add \dpS to psql

2022-12-12 Thread Andrew Dunstan


On 2022-12-09 Fr 13:44, Nathan Bossart wrote:
> On Fri, Dec 09, 2022 at 10:40:55AM -0800, Nathan Bossart wrote:
>> On Fri, Dec 09, 2022 at 01:36:03PM +0900, Michael Paquier wrote:
>>> On Thu, Dec 08, 2022 at 09:15:03AM -0800, Nathan Bossart wrote:
 The main idea behind this work is breaking out privileges into more
 granular pieces.  If I want to create a role that only runs VACUUM on some
 tables on the weekend, why ѕhould I have to also give it the ability to
 ANALYZE, REFRESH, CLUSTER, and REINDEX?  IMHO we should really let the user
 decide what set of privileges makes sense for their use-case.  I'm unsure
 the grouping all these privileges together serves much purpose besides
 preserving ACL bits.
>>> Hmm.  I'd like to think that we should keep a frugal mind here.  More
>>> bits are now available, but it does not strike me as a good idea to
>>> force their usage more than necessary, so grouping all these no-quite
>>> DDL commands into the same bag does not sound that bad to me.
>> Okay, it seems I am outnumbered.  I will work on updating the patch to add
>> an ACL_MAINTAIN bit and a pg_maintain_all_tables predefined role.
> Any thoughts on $SUBJECT?


Yeah, the discussion got way off into the weeds here. I think the
original proposal seems reasonable. Please add it to the next CF if you
haven't already.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





RE: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Takamichi Osumi (Fujitsu)
Hi,

On Saturday, December 10, 2022 12:08 AM Takamichi Osumi (Fujitsu) 
 wrote:
> On Friday, December 9, 2022 3:38 PM Kuroda, Hayato/黒田 隼人
>  wrote:
> > Thanks for reporting! I have analyzed the problem and found the root cause.
> >
> > This feature seemed not to work on 32-bit OSes. This was because the
> > calculation of delay_time was wrong. The first argument of this should
> > be TimestampTz datatype, not Datum:
> >
> > ```
> > +   /* Set apply delay */
> > +   delay_until =
> > TimestampTzPlusMilliseconds(TimestampTzGetDatum(ts),
> > +
> > + MySubscription->applydelay);
> > ```
> >
> > In more detail, the datum representation of int64 contains the value
> > itself on 64-bit OSes, but it contains the pointer to the value on 32-bit.
> >
> > After modifying the issue, this will work on 32-bit environments.
> Thank you for your analysis.
> 
> Yeah, it seems we conduct addition of values to the pointer value, which is
> returned from the call of TimestampTzGetDatum(), on 32-bit machine by
> mistake.
> 
> I'll remove the call in my next version.
Applied this fix in the last version, shared in [1].


[1] - 
https://www.postgresql.org/message-id/TYCPR01MB83730C23CB7D29E57368BECDEDE29%40TYCPR01MB8373.jpnprd01.prod.outlook.com


Best Regards,
Takamichi Osumi





Re: generic plans and "initial" pruning

2022-12-12 Thread Amit Langote
On Fri, Dec 9, 2022 at 8:37 PM Alvaro Herrera  wrote:
> On 2022-Dec-09, Amit Langote wrote:
>
> > Pruning will be done afresh on every fetch of a given cached plan when
> > CheckCachedPlan() is called on it, so the part_prune_results_list part
> > will be discarded and rebuilt as many times as the plan is executed.
> > You'll find a description around CachedPlanSavePartitionPruneResults()
> > that's in v12.
>
> I see.
>
> In that case, a separate container struct seems warranted.

I thought about this today and played around with some container struct ideas.

Though, I started feeling like putting all the new logic being added
by this patch into plancache.c at the heart of GetCachedPlan() and
tweaking its API in kind of unintuitive ways may not have been such a
good idea to begin with.  So I started thinking again about your
GetRunnablePlan() wrapper idea and thought maybe we could do something
with it.  Let's say we name it GetCachedPlanLockPartitions() and put
the logic that does initial pruning with the new
ExecutorDoInitialPruning() in it, instead of in the normal
GetCachedPlan() path.  Any callers that call GetCachedPlan() instead
call GetCachedPlanLockPartitions() with either the List ** parameter
as now or some container struct if that seems better.  Whether
GetCachedPlanLockPartitions() needs to do anything other than return
the CachedPlan returned by GetCachedPlan() can be decided by the
latter setting, say, CachedPlan.has_unlocked_partitions.  That will be
done by AcquireExecutorLocks() when it sees containsInitialPrunnig in
any of the PlannedStmts it sees, locking only the
PlannedStmt.minLockRelids set (which is all relations where no pruning
is needed!), leaving the partition locking to
GetCachedPlanLockPartitions().  If the CachedPlan is invalidated
during the partition locking phase, it calls GetCachedPlan() again;
maybe some refactoring is needed to avoid too much useless work in
such cases.

Thoughts?

-- 
Thanks, Amit Langote
EDB: http://www.enterprisedb.com




RE: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Takamichi Osumi (Fujitsu)
On Friday, November 25, 2022 5:43 AM Peter Smith  wrote:
> On Fri, Nov 25, 2022 at 2:15 AM Takamichi Osumi (Fujitsu)
>  wrote:
> >
> > On Wednesday, October 5, 2022 6:42 PM Peter Smith
>  wrote:
> ...
> >
> > > ==
> > >
> > > 5. src/backend/commands/subscriptioncmds.c - SubOpts
> > >
> > > @@ -89,6 +91,7 @@ typedef struct SubOpts
> > >   bool disableonerr;
> > >   char*origin;
> > >   XLogRecPtr lsn;
> > > + int64 min_apply_delay;
> > >  } SubOpts;
> > >
> > > I feel it would be better to be explicit about the storage units. So
> > > call this member ‘min_apply_delay_ms’. E.g. then other code in
> > > parse_subscription_options will be more natural when you are
> > > converting using and assigning them to this member.
> > I don't think we use such names including units explicitly.
> > Could you please tell me a similar example for this ?
> >
> 
> Regex search "\..*_ms[e\s]" finds some members where the unit is in the
> member name.
> 
> e.g. delay_ms (see EnableTimeoutParams in timeout.h) e.g. interval_in_ms (see
> timeout_paramsin timeout.c)
> 
> Regex search ".*_ms[e\s]" finds many local variables where the unit is in the
> variable name
> 
> > > ==
> > >
> > > 16. src/include/catalog/pg_subscription.h
> > >
> > > + int64 subapplydelay; /* Replication apply delay */
> > > +
> > >
> > > Consider renaming this as 'subapplydelayms' to make the units perfectly
> clear.
> > Similar to the 5th comments, I can't find any examples for this.
> > I'd like to keep it general, which makes me feel it is more aligned
> > with existing codes.
Hi, thank you for sharing this info.

I searched the codes where I could feel the merits to add "ms"
at the end of the variable names.
Adding the unites would help to calculate or convert some time related values.
In this patch there is only a couple of functions, like maybe_delay_apply()
or for conversion of time, parse_subscription_options.

I feel changing just a couple of structures might be awkward,
while changing all internal structures is too much. So, I keep the names
as those were after some modifications shared in [1].
If you have any better idea, please let me know.


[1] - 
https://www.postgresql.org/message-id/TYCPR01MB83730C23CB7D29E57368BECDEDE29%40TYCPR01MB8373.jpnprd01.prod.outlook.com



Best Regards,
Takamichi Osumi



RE: Time delayed LR (WAS Re: logical replication restrictions)

2022-12-12 Thread Takamichi Osumi (Fujitsu)
On Tuesday, December 6, 2022 5:00 PM Peter Smith  wrote:
> Here are some review comments for patch v9-0001:
Hi, thank you for your reviews !

> 
> ==
> 
> GENERAL
> 
> 1. min_ prefix?
> 
> What's the significance of the "min_" prefix for this parameter? I'm guessing 
> the
> background is that at one time it was considered to be a GUC so took a name
> similar to GUC recovery_min_apply_delay (??)
> 
> But in practice, I think it is meaningless and/or misleading. For example,
> suppose the user wants to defer replication by 1hr. IMO it is more natural to
> just say "defer replication by 1 hr" (aka
> apply_delay='1hr') Clearly it means replication will take place about
> 1 hr into the future. OTHO saying "defer replication by a MINIMUM of 1 hr" 
> (aka
> min_apply_delay='1hr')  is quite vague because then it is equally valid if the
> replication gets delayed by 1 hr or 2 hrs or 5 days or 3 weeks since all of 
> those
> satisfy the minimum delay. The implementation could hardwire a delay of
> INT_MAX ms but clearly, that's not really what the user would expect.
> 
> ~
> 
> So, I think this parameter should be renamed just as 'apply_delay'.
> 
> But, if you still decide to keep it as 'min_apply_delay' then there is a lot 
> of other
> code that ought to be changed to be consistent with that name.
> e.g.
> - subapplydelay in catalogs.sgml --> subminapplydelay
> - subapplydelay in system_views.sql --> subminapplydelay
> - subapplydelay in pg_subscription.h --> subminapplydelay
> - subapplydelay in dump.h --> subminapplydelay
> - i_subapplydelay in pg_dump.c --> i_subminapplydelay
> - applydelay member name of Form_pg_subscription --> minapplydelay
> - "Apply Delay" for the column name displayed by describe.c --> "Min apply
> delay"
I followed the suggestion to keep the "min_" prefix in [1].
Fixed. 


> - more...
> 
> (IMO the fact that so much code does not currently say 'min' at all is just
> evidence that the 'min' prefix really didn't really mean much in the first 
> place)
> 
> 
> ==
> 
> doc/src/sgml/catalogs.sgml
> 
> 2. Section 31.2 Subscription
> 
> +  
> +   Time delayed replica of subscription is available by indicating
> +   min_apply_delay. See
> +for details.
> +  
> 
> How about saying like:
> 
> SUGGESTION
> The subscriber replication can be instructed to lag behind the publisher side
> changes by specifying the min_apply_delay subscription
> parameter. See XXX for details.
Fixed.


> ==
> 
> doc/src/sgml/ref/create_subscription.sgml
> 
> 3. min_apply_delay
> 
> + 
> +  By default, subscriber applies changes as soon as possible. As with
> +  the physical replication feature
> +  (), it can be useful
> to
> +  have a time-delayed logical replica. This parameter allows you to
> +  delay the application of changes by a specified amount of time. If
> +  this value is specified without units, it is taken as milliseconds.
> +  The default is zero, adding no delay.
> + 
> 
> "subscriber applies" -> "the subscriber applies"
> 
> "allows you" -> "lets the user"
> 
> "The default is zero, adding no delay." -> "The default is zero (no delay)."
Fixed.


> ~
> 
> 4.
> 
> +  larger than the time deviations between servers. Note that
> +  in the case when this parameter is set to a long value, the
> +  replication may not continue if the replication slot falls behind 
> the
> +  current LSN by more than
> max_slot_wal_keep_size.
> +  See more details in .
> + 
> 
> 4a.
> SUGGESTION
> Note that if this parameter is set to a long delay, the replication will stop 
> if the
> replication slot falls behind the current LSN by more than
> max_slot_wal_keep_size.
Fixed.

> ~
> 
> 4b.
> When it is rendered (like below) it looks a bit repetitive:
> ... if the replication slot falls behind the current LSN by more than
> max_slot_wal_keep_size. See more details in max_slot_wal_keep_size.
Thanks! Fixed the redundancy.


> ~
> 
> IMO the previous sentence should include the link.
> 
> SUGGESTION
> if the replication slot falls behind the current LSN by more than  linkend =
> "guc-max-slot-wal-keep-size">max_slot_wal_keep_size k>.
Fixed.

> ~
> 
> 5.
> 
> +   
> +Synchronous replication is affected by this setting when
> +synchronous_commit is set to
> +remote_write; every COMMIT
> +will need to wait to be applied.
> +   
> 
> Yes, this deserves a big warning -- but I am just not quite sure of the 
> details. I
> think this impacts more than just "remote_rewrite" -- e.g. the same problem
> would happen if "synchronous_standby_names" is non-empty.
> 
> I think this warning needs to be more generic to cover everything.
> Maybe something like below
> 
> SUGGESTION:
> Delaying the replication can mean there is a much longer time between making
> a change on the publisher, and that change being committed on the subscriber.
> This can 

Re: Checksum errors in pg_stat_database

2022-12-12 Thread Drouvot, Bertrand




On 12/12/22 8:15 AM, Drouvot, Bertrand wrote:



On 12/12/22 5:09 AM, Michael Paquier wrote:

On Sun, Dec 11, 2022 at 08:48:15PM -0500, Tom Lane wrote:

I think a stats table indexed solely by relfilenode wouldn't be a great
idea, because you'd lose all the stats about a table as soon as you
do vacuum full/cluster/rewriting-alter-table/etc.  Can we create two
index structures over the same stats table entries, so you can look
up by either relfilenode or OID?  I'm not quite sure how to manage
rewrites, where you transiently have two relfilenodes for "the
same" table ... maybe we could allow multiple pointers to the same
stats entry??


FWIW, I am not sure that I would care much if we were to dropped the
stats associated to a relfilenode on a rewrite.  In terms of checksum
failures, tuples are deformed so if there is one checksum failure a
rewrite would just not happen.  The potential complexity is not really
appealing compared to the implementation simplicity and its gains, and
rewrites are lock-heavy so I'd like to think that people avoid them
(cough)..


Agree that this is less "problematic" for the checksum use case.
On the other hand, losing IO stats (as the ones we could add later on, 
suggested by Andres up-thread) looks more of a concern to me.



One option could be to have a dedicated PgStat_StatRelFileNodeEntry and 
populate the related PgStat_StatTabEntry when calling the new to be created 
pgstat_relfilenode_flush_cb()? (That's what we are doing currently to
flush some of the table stats to the database stats for example).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-12 Thread John Naylor
On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada 
wrote:
>
> On Fri, Dec 9, 2022 at 5:53 PM John Naylor 
wrote:
> >

> > I don't think that'd be very controversial, but I'm also not sure why
we'd need 4MB -- can you explain in more detail what exactly we'd need so
that the feature would work? (The minimum doesn't have to work *well* IIUC,
just do some useful work and not fail).
>
> The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
> the radix tree is using dsa_get_total_size(). If the size returned by
> dsa_get_total_size() (+ some memory used by TIDStore meta information)
> exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
> and heap vacuum. However, when allocating DSA memory for
> radix_tree_control at creation, we allocate 1MB
> (DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
> radix_tree_control from it. das_get_total_size() returns 1MB even if
> there is no TID collected.

2MB makes sense.

If the metadata is small, it seems counterproductive to count it towards
the total. We want the decision to be driven by blocks allocated. I have an
idea on that below.

> > Remember when we discussed how we might approach parallel pruning? I
envisioned a local array of a few dozen kilobytes to reduce contention on
the tidstore. We could use such an array even for a single worker (always
doing the same thing is simpler anyway). When the array fills up enough so
that the next heap page *could* overflow it: Stop, insert into the store,
and check the store's memory usage before continuing.
>
> Right, I think it's no problem in slab cases. In DSA cases, the new
> segment size follows a geometric series that approximately doubles the
> total storage each time we create a new segment. This behavior comes
> from the fact that the underlying DSM system isn't designed for large
> numbers of segments.

And taking a look, the size of a new segment can get quite large. It seems
we could test if the total DSA area allocated is greater than half of
maintenance_work_mem. If that parameter is a power of two (common) and
>=8MB, then the area will contain just under a power of two the last time
it passes the test. The next segment will bring it to about 3/4 full, like
this:

maintenance work mem = 256MB, so stop if we go over 128MB:

2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB-> stop

That would be a simple way to be conservative with the memory limit. The
unfortunate aspect is that the last segment would be mostly wasted, but
it's paradise compared to the pessimistically-sized single array we have
now (even with Peter G.'s VM snapshot informing the allocation size, I
imagine).

And as for minimum possible maintenance work mem, I think this would work
with 2MB, if the community is okay with technically going over the limit by
a few bytes of overhead if a buildfarm animal set to that value. I imagine
it would never go over the limit for realistic (and even most unrealistic)
values. Even with a VM snapshot page in memory and small local arrays of
TIDs, I think with this scheme we'll be well under the limit.

After this feature is complete, I think we should consider a follow-on
patch to get rid of vacuum_work_mem, since it would no longer be needed.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: A problem about ParamPathInfo for an AppendPath

2022-12-12 Thread Richard Guo
On Tue, Dec 6, 2022 at 5:00 PM Richard Guo  wrote:

> As we can see, MemoizePath can be generated for partitioned AppendPath
> but not for union-all AppendPath.
>
> For the fix I think we can relax the check in create_append_path and
> always use get_baserel_parampathinfo if the parent is a baserel.
>

BTW, IIUC currently we don't generate any parameterized MergeAppend
paths, as explained in generate_orderedappend_paths.  So the codes that
gather information from a MergeAppend path's param_info for run-time
partition pruning in create_merge_append_plan seem unnecessary.

Attached is a patch for this change and the changes described upthread.

Thanks
Richard


v1-0001-Fix-ParamPathInfo-for-AppendPath.patch
Description: Binary data


Re: refactor ExecGrant_*() functions

2022-12-12 Thread Antonin Houska
Peter Eisentraut  wrote:

> On 06.12.22 09:41, Antonin Houska wrote:
> > Attached are my proposals for improvements. One is to avoid memory leak, the
> > other tries to improve readability a little bit.
> 
> I added the readability improvement to my v2 patch.  The pfree() calls aren't
> necessary AFAICT.

I see that memory contexts exist and that the amount of memory freed is not
huge, but my style is to free the memory explicitly if it's allocated in a
loop.

v2 looks good to me.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Checksum errors in pg_stat_database

2022-12-12 Thread Magnus Hagander
On Mon, Dec 12, 2022 at 12:40 AM Michael Paquier 
wrote:

> On Sun, Dec 11, 2022 at 09:18:42PM +0100, Magnus Hagander wrote:
> > It would be less of a concern yes, but I think it still would be a
> concern.
> > If you have a large amount of corruption you could quickly get to
> millions
> > of rows to keep track of which would definitely be a problem in shared
> > memory as well, wouldn't it?
>
> Yes.  I have discussed this item with Bertrand off-list and I share
> the same concern.  This would lead to an lot of extra workload on a
> large seqscan for a corrupted relation when the stats are written
> (shutdown delay) while bloating shared memory with potentially
> millions of items even if variable lists are handled through a dshash
> and DSM.
>
> > But perhaps we could keep a list of "the last 100 checksum failures" or
> > something like that?
>
> Applying a threshold is one solution.  Now, a second thing I have seen
> in the past is that some disk partitions were busted but not others,
> and the current database-level counters are not enough to make a
> difference when it comes to grab patterns in this area.  A list of the
> last N failures may be able to show some pattern, but that would be
> like analyzing things with a lot of noise without a clear conclusion.

Anyway, the workload caused by the threshold number had better be
> measured before being decided (large set of relation files with a full
> range of blocks corrupted, much better if these are in the OS cache
> when scanned), which does not change the need of a benchmark.
>
> What about just adding a counter tracking the number of checksum
> failures for relfilenodes in a new structure related to them (note
> that I did not write PgStat_StatTabEntry)?
>
> If we do that, it is then possible to cross-check the failures with
> tablespaces, which would point to disk areas that are more sensitive
> to corruption.
>

If that's the concern, then perhaps the level we should be tracking things
on is tablespace? We don't have any stats per tablespace today I believe,
but that doesn't mean we couldn't create that.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/