Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-17 Thread John Naylor
On Wed, Sep 28, 2022 at 1:18 PM I wrote:

> Along those lines, one thing I've been thinking about is the number of
size classes. There is a tradeoff between memory efficiency and number of
branches when searching/inserting. My current thinking is there is too much
coupling between size class and data type. Each size class currently uses a
different data type and a different algorithm to search and set it, which
in turn requires another branch. We've found that a larger number of size
classes leads to poor branch prediction [1] and (I imagine) code density.
>
> I'm thinking we can use "flexible array members" for the values/pointers,
and keep the rest of the control data in the struct the same. That way, we
never have more than 4 actual "kinds" to code and branch on. As a bonus,
when migrating a node to a larger size class of the same kind, we can
simply repalloc() to the next size.

While the most important challenge right now is how to best represent and
organize the shared memory case, I wanted to get the above idea working and
out of the way, to be saved for a future time. I've attached a rough
implementation (applies on top of v9 0003) that splits node32 into 2 size
classes. They both share the exact same base data type and hence the same
search/set code, so the number of "kind"s is still four, but here there are
five "size classes", so a new case in the "unlikely" node-growing path. The
smaller instance of node32 is a "node15", because that's currently 160
bytes, corresponding to one of the DSA size classes. This idea can be
applied to any other node except the max size, as we see fit. (Adding a
singleton size class would bring it back in line with the prototype, at
least as far as memory consumption.)

One issue with this patch: The "fanout" member is a uint8, so it can't hold
256 for the largest node kind. That's not an issue in practice, since we
never need to grow it, and we only compare that value with the count in an
Assert(), so I just set it to zero. That does break an invariant, so it's
not great. We could use 2 bytes to be strictly correct in all cases, but
that limits what we can do with the smallest node kind.

In the course of working on this, I encountered a pain point. Since it's
impossible to repalloc in slab, we have to do alloc/copy/free ourselves.
That's fine, but the current coding makes too many assumptions about the
use cases: rt_alloc_node and rt_copy_node are too entangled with each other
and do too much work unrelated to what the names imply. I seem to remember
an earlier version had something like rt_node_copy_common that did
only...copying. That was much easier to reason about. In 0002 I resorted to
doing my own allocation to show what I really want to do, because the new
use case doesn't need zeroing and setting values. It only needs
to...allocate (and increase the stats counter if built that way).

Future optimization work while I'm thinking of it: rt_alloc_node should be
always-inlined and the memset done separately (i.e. not *AllocZero). That
way the compiler should be able generate more efficient zeroing code for
smaller nodes. I'll test the numbers on this sometime in the future.

--
John Naylor
EDB: http://www.enterprisedb.com
From 6fcc970ae7e31f44fa6b6aface983cadb023cc50 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Thu, 17 Nov 2022 16:10:44 +0700
Subject: [PATCH v901 2/2] Make node32 variable sized

Add a size class for 15 elements, which corresponds to 160 bytes,
an allocation size used by DSA. When a 16th element is to be
inserted, allocte a larger area and memcpy the entire old node
to it.

NB: Zeroing the new area is only necessary if it's for an
inner node128, since insert logic must check for null child
pointers.

This technique allows us to limit the node kinds to 4, which
1. limits the number of cases in switch statements
2. allows a possible future optimization to encode the node kind
in a pointer tag
---
 src/backend/lib/radixtree.c | 141 +++-
 1 file changed, 108 insertions(+), 33 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bef1a438ab..f368e750d5 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -130,6 +130,7 @@ typedef enum
 typedef enum rt_size_class
 {
RT_CLASS_4_FULL = 0,
+   RT_CLASS_32_PARTIAL,
RT_CLASS_32_FULL,
RT_CLASS_128_FULL,
RT_CLASS_256
@@ -147,6 +148,8 @@ typedef struct rt_node
uint16  count;
 
/* Max number of children. We can use uint8 because we never need to 
store 256 */
+   /* WIP: if we don't have a variable sized node4, this should instead be 
in the base
+   types as needed, since saving every byte is crucial for the smallest 
node kind */
uint8   fanout;
 
/*
@@ -166,6 +169,8 @@ typedef struct rt_node
((node)->base.n.count < (node)->base.n.fanout)
 
 /* Base type of each node kinds for leaf and inner nodes */
+/* The base 

Re: contrib: auth_delay module

2022-11-17 Thread Julien Rouhaud
On Thu, Nov 17, 2022 at 05:37:51PM -0500, Tom Lane wrote:
> =?UTF-8?B?5oiQ5LmL54SV?=  writes:
> > The attached patch is a contrib module  to set login restrictions on users 
> > with
> > too many authentication failure. The administrator could manage several GUC
> > parameters to control the login restrictions which are listed below.
> > - set the wait time when password authentication fails.
> > - allow the wait time grows when users of the same IP consecutively logon 
> > failed.
> > - set the maximum authentication failure number from the same user. The 
> > system
> > will prevent a user who gets too many authentication failures from entering 
> > the
> > database.
>
> I'm not yet forming an opinion on whether this is useful enough
> to accept.

I'm not sure that doing that on the backend side is really a great idea, an
attacker will still be able to exhaust available connection slots.

If your instance is reachable from some untrusted network (which already sounds
scary), it's much easier to simply configure something like fail2ban to provide
the same feature in a more efficient way.  You can even block access to other
services too while at it.

Note that there's also an extension to log failed connection attempts on an
alternate file with a fixed simple format if you're worried about your regular
logs are too verbose.




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-17 Thread Peter Smith
Here are some review comments for v47-0001

(This review is a WIP - I will post more comments for this patch next week)

==

.../replication/logical/applyparallelworker.c

1.


+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION src/backend/replication/logical/applyparallelworker.c
+ *

This IDENTIFICATION should be on 2 lines like it previously was
instead of wrapped into one line. For consistency with all other file
headers.

~~~

2. File header comment

+ * Since the database structure (schema of subscription tables, etc.) of
+ * publisher and subscriber may be different.

Incomplete sentence?

~~~

3.

+ * When the following two scenarios occur, a deadlock occurs.

Actually, you described three scenarios in this comment. Not two.

SUGGESTION
The following scenarios can cause a deadlock.

~~~

4.

+ * LA (waiting to acquire the local transaction lock) -> PA1 (waiting to
+ * acquire the lock on the unique index) -> PA2 (waiting to acquire the lock on
+ * the remote transaction) -> LA

"PA1" -> "PA-1"
"PA2" -> "PA-2"

~~~

5.

+ * To resolve this issue, we use non-blocking write and wait with a timeout. If
+ * timeout is exceeded, the LA report an error and restart logical replication.

"report" --> "reports"
"restart" -> "restarts"

OR

"LA report" -> "LA will report"

~~~

6. pa_wait_for_xact_state

+/*
+ * Wait until the parallel apply worker's transaction state reach or exceed the
+ * given xact_state.
+ */
+static void
+pa_wait_for_xact_state(ParallelApplyWorkerShared *wshared,
+ParallelTransState xact_state)

"reach or exceed" -> "reaches or exceeds"

~~~

7. pa_stream_abort

+ /*
+ * Although the lock can be automatically released during transaction
+ * rollback, but we still release the lock here as we may not in a
+ * transaction.
+ */
+ pa_unlock_transaction(xid, AccessShareLock);

"but we still" -> "we still"
"we may not in a" -> "we may not be in a"

~~~

8.

+ pa_savepoint_name(MySubscription->oid, subxid, spname,
+   sizeof(spname));
+

Unnecessary wrapping

~~~

9.

+ for (i = list_length(subxactlist) - 1; i >= 0; i--)
+ {
+ TransactionId xid_tmp = lfirst_xid(list_nth_cell(subxactlist, i));
+
+ if (xid_tmp == subxid)
+ {
+ found = true;
+ break;
+ }
+ }
+
+ if (found)
+ {
+ RollbackToSavepoint(spname);
+ CommitTransactionCommand();
+ subxactlist = list_truncate(subxactlist, i + 1);
+ }

This code logic does not seem to require the 'found' flag. You can do
the RollbackToSavepoint/CommitTransactionCommand/list_truncate before
the break.

~~~

10. pa_lock/unlock _stream/_transaction

+/*
+ * Helper functions to acquire and release a lock for each stream block.
+ *
+ * Set locktag_field4 to 0 to indicate that it's a stream lock.
+ */

+/*
+ * Helper functions to acquire and release a lock for each local transaction.
+ *
+ * Set locktag_field4 to 1 to indicate that it's a transaction lock.

Should constants/defines/enums replace those magic numbers 0 and 1?

~~~

11. pa_lock_transaction

+ * Note that all the callers are passing remote transaction ID instead of local
+ * transaction ID as xid. This is because the local transaction ID will only be
+ * assigned while applying the first change in the parallel apply, but it's
+ * possible that the first change in parallel apply worker is blocked by a
+ * concurrently executing transaction in another parallel apply worker causing
+ * the leader cannot get local transaction ID.

"causing the leader cannot" -> "which means the leader cannot" (??)

==

src/backend/replication/logical/worker.c

12. TransApplyAction

+/*
+ * What action to take for the transaction.
+ *
+ * TRANS_LEADER_APPLY:
+ * The action means that we are in the leader apply worker and changes of the
+ * transaction are applied directly in the worker.
+ *
+ * TRANS_LEADER_SERIALIZE:
+ * It means that we are in the leader apply worker or table sync worker.
+ * Changes are written to temporary files and then applied when the final
+ * commit arrives.
+ *
+ * TRANS_LEADER_SEND_TO_PARALLEL:
+ * The action means that we are in the leader apply worker and need to send the
+ * changes to the parallel apply worker.
+ *
+ * TRANS_PARALLEL_APPLY:
+ * The action that we are in the parallel apply worker and changes of the
+ * transaction are applied directly in the worker.
+ */
+typedef enum

12a
Too many various ways of saying the same thing:

"The action means that we..."
"It means that we..."
"The action that we..." (typo?)

Please word all these comments consistently

~

12b.
"directly in the worker" -> "directly by the worker" (??) 2x

~~~

13. get_worker_name

+/*
+ * Return the name of the logical replication worker.
+ */
+static const char *
+get_worker_name(void)
+{
+ if (am_tablesync_worker())
+ return _("logical replication table synchronization worker");
+ else if (am_parallel_apply_worker())
+ return _("logical replication parallel apply worker");
+ else
+ return _("logical replication apply worker");
+}

This function belongs nearer the top 

Re: Avoid double lookup in pgstat_fetch_stat_tabentry()

2022-11-17 Thread Bharath Rupireddy
On Fri, Nov 18, 2022 at 10:32 AM Drouvot, Bertrand
 wrote:
>
> Hi hackers,
>
> Please find attached a patch proposal to avoid 2 calls to
> pgstat_fetch_stat_tabentry_ext() in pgstat_fetch_stat_tabentry() in case
> the relation is not a shared one and no statistics are found.
>
> Thanks Andres for the suggestion done in [1].
>
> [1]:
> https://www.postgresql.org/message-id/20221116201202.3k74ajawyom2c3eq%40awork3.anarazel.de

+1. The patch LGTM. However, I have a suggestion to simplify it
further by getting rid of the local variable tabentry and just
returning pgstat_fetch_stat_tabentry_ext(IsSharedRelation(relid),
relid);. Furthermore, the pgstat_fetch_stat_tabentry() can just be a
static inline function.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Assertion failure in SnapBuildInitialSnapshot()

2022-11-17 Thread Amit Kapila
On Thu, Nov 17, 2022 at 11:15 PM Andres Freund  wrote:
>
> On 2022-11-17 10:44:18 +0530, Amit Kapila wrote:
> > On Wed, Nov 16, 2022 at 11:56 PM Andres Freund  wrote:
> > > On 2022-11-16 14:22:01 +0530, Amit Kapila wrote:
> > > > On Wed, Nov 16, 2022 at 7:30 AM Andres Freund  
> > > > wrote:
> > > > > On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:
> > > > > > On Tue, Nov 15, 2022 at 8:08 AM Andres Freund  
> > > > > > wrote:
> > > I don't think that'd catch a catalog snapshot. But perhaps the better 
> > > answer
> > > for the catalog snapshot is to just invalidate it explicitly. The user 
> > > doesn't
> > > have control over the catalog snapshot being taken, and it's not too hard 
> > > to
> > > imagine the walsender code triggering one somewhere.
> > >
> > > So maybe we should add something like:
> > >
> > > InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */
> > >
> >
> > The comment "/* about to overwrite MyProc->xmin */" is unclear to me.
> > We already have a check (/* so we don't overwrite the existing value
> > */
> > if (TransactionIdIsValid(MyProc->xmin))) in SnapBuildInitialSnapshot()
> > which ensures that we don't overwrite MyProc->xmin, so the above
> > comment seems contradictory to me.
>
> The point is that catalog snapshots could easily end up setting MyProc->xmin,
> even though the caller hasn't done anything wrong. So the
> InvalidateCatalogSnapshot() would avoid erroring out in a number of scenarios.
>

Okay, updated the patch accordingly.

-- 
With Regards,
Amit Kapila.


v3-0001-Add-additional-checks-while-creating-the-initial-.patch
Description: Binary data


Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-17 Thread Tom Lane
Richard Guo  writes:
> Yes, it is.  Using zero flag would short-cut get_attstatsslot() to just
> return whether the slot type exists without loading it.  Do you think we
> need to emphasize this use case in the comments for 'flags'?

Perhaps, it's not really obvious now.

> I wonder whether we need to also check statistic_proc_security_check()
> when determining if MCVs exists in both sides.

Yeah, I thought about hoisting the statistic_proc_security_check
tests up into get_mcv_stats.  I don't think it's a great idea
though.  Again, it'd complicate untangling this if we ever
generalize the use of MCVs in this function.  Also, I don't
think we should be micro-optimizing the case where the security
check doesn't pass --- if it doesn't, you're going to be hurting
from bad plans a lot more than you are from some wasted cycles
here.

regards, tom lane




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-17 Thread Amit Kapila
On Fri, Nov 18, 2022 at 10:31 AM Masahiko Sawada  wrote:
>
> On Fri, Nov 18, 2022 at 1:47 PM Amit Kapila  wrote:
> >
> > On Fri, Nov 18, 2022 at 8:01 AM Peter Smith  wrote:
> > >
> > > On Fri, Nov 18, 2022 at 11:36 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > ...
> > > > ---
> > > > The streaming parameter has the new value "parallel" for "streaming"
> > > > option to enable the parallel apply. It fits so far but I think the
> > > > parallel apply feature doesn't necessarily need to be tied up with
> > > > streaming replication. For example, we might want to support parallel
> > > > apply also for non-streaming transactions in the future. It might be
> > > > better to have another option, say "parallel", to control parallel
> > > > apply behavior. The "parallel" option can be a boolean option and
> > > > setting parallel = on requires streaming = on.
> > > >
> >
> > If we do that then how will the user be able to use streaming
> > serialize mode (write to file for streaming transactions) as we have
> > now? Because after we introduce parallelism for non-streaming
> > transactions, the user would want parallel = on irrespective of the
> > streaming mode. Also, users may wish to only parallelize large
> > transactions because of additional overhead for non-streaming
> > transactions for transaction dependency tracking, etc. So, the user
> > may wish to have a separate knob for large transactions as the patch
> > has now.
>
> One idea for that would be to make it enum. For example, setting
> parallel = "streaming" works for that.
>

Yeah, but then we will have two different parameters (parallel and
streaming) to control streaming behavior. This will be confusing say
when the user says parallel = 'streaming' and streaming = off, we need
to probably disallow such settings but not sure if it would be any
better than allowing parallelism for large xacts by streaming
parameter.

-- 
With Regards,
Amit Kapila.




Avoid double lookup in pgstat_fetch_stat_tabentry()

2022-11-17 Thread Drouvot, Bertrand

Hi hackers,

Please find attached a patch proposal to avoid 2 calls to 
pgstat_fetch_stat_tabentry_ext() in pgstat_fetch_stat_tabentry() in case 
the relation is not a shared one and no statistics are found.


Thanks Andres for the suggestion done in [1].

[1]: 
https://www.postgresql.org/message-id/20221116201202.3k74ajawyom2c3eq%40awork3.anarazel.de


Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.comdiff --git a/src/backend/utils/activity/pgstat_relation.c 
b/src/backend/utils/activity/pgstat_relation.c
index 55a355f583..a2a4fc6e2c 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -25,6 +25,7 @@
 #include "utils/pgstat_internal.h"
 #include "utils/rel.h"
 #include "utils/timestamp.h"
+#include "catalog/catalog.h"
 
 
 /* Record that's written to 2PC state file when pgstat state is persisted */
@@ -439,14 +440,8 @@ pgstat_fetch_stat_tabentry(Oid relid)
 {
PgStat_StatTabEntry *tabentry;
 
-   tabentry = pgstat_fetch_stat_tabentry_ext(false, relid);
-   if (tabentry != NULL)
-   return tabentry;
+   tabentry = pgstat_fetch_stat_tabentry_ext(IsSharedRelation(relid), 
relid);
 
-   /*
-* If we didn't find it, maybe it's a shared table.
-*/
-   tabentry = pgstat_fetch_stat_tabentry_ext(true, relid);
return tabentry;
 }
 


Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-17 Thread Masahiko Sawada
On Fri, Nov 18, 2022 at 1:47 PM Amit Kapila  wrote:
>
> On Fri, Nov 18, 2022 at 8:01 AM Peter Smith  wrote:
> >
> > On Fri, Nov 18, 2022 at 11:36 AM Masahiko Sawada  
> > wrote:
> > >
> > ...
> > > ---
> > > The streaming parameter has the new value "parallel" for "streaming"
> > > option to enable the parallel apply. It fits so far but I think the
> > > parallel apply feature doesn't necessarily need to be tied up with
> > > streaming replication. For example, we might want to support parallel
> > > apply also for non-streaming transactions in the future. It might be
> > > better to have another option, say "parallel", to control parallel
> > > apply behavior. The "parallel" option can be a boolean option and
> > > setting parallel = on requires streaming = on.
> > >
>
> If we do that then how will the user be able to use streaming
> serialize mode (write to file for streaming transactions) as we have
> now? Because after we introduce parallelism for non-streaming
> transactions, the user would want parallel = on irrespective of the
> streaming mode. Also, users may wish to only parallelize large
> transactions because of additional overhead for non-streaming
> transactions for transaction dependency tracking, etc. So, the user
> may wish to have a separate knob for large transactions as the patch
> has now.

One idea for that would be to make it enum. For example, setting
parallel = "streaming" works for that.

>
> >
> > FWIW, I tend to agree with this idea but for a different reason. In
> > this patch, the 'streaming' parameter had become a kind of hybrid
> > boolean/enum. AFAIK there are no other parameters anywhere that use a
> > hybrid pattern like this so I was thinking it may be better not to be
> > different.
> >
>
> I think we have a similar pattern for GUC parameters like
> constraint_exclusion (see constraint_exclusion_options),
> backslash_quote (see backslash_quote_options), etc.

Right. vacuum_index_cleanup and buffering storage parameters that
accept 'on', 'off', or 'auto') are other examples.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-17 Thread Amit Kapila
On Fri, Nov 18, 2022 at 8:01 AM Peter Smith  wrote:
>
> On Fri, Nov 18, 2022 at 11:36 AM Masahiko Sawada  
> wrote:
> >
> ...
> > ---
> > The streaming parameter has the new value "parallel" for "streaming"
> > option to enable the parallel apply. It fits so far but I think the
> > parallel apply feature doesn't necessarily need to be tied up with
> > streaming replication. For example, we might want to support parallel
> > apply also for non-streaming transactions in the future. It might be
> > better to have another option, say "parallel", to control parallel
> > apply behavior. The "parallel" option can be a boolean option and
> > setting parallel = on requires streaming = on.
> >

If we do that then how will the user be able to use streaming
serialize mode (write to file for streaming transactions) as we have
now? Because after we introduce parallelism for non-streaming
transactions, the user would want parallel = on irrespective of the
streaming mode. Also, users may wish to only parallelize large
transactions because of additional overhead for non-streaming
transactions for transaction dependency tracking, etc. So, the user
may wish to have a separate knob for large transactions as the patch
has now.

>
> FWIW, I tend to agree with this idea but for a different reason. In
> this patch, the 'streaming' parameter had become a kind of hybrid
> boolean/enum. AFAIK there are no other parameters anywhere that use a
> hybrid pattern like this so I was thinking it may be better not to be
> different.
>

I think we have a similar pattern for GUC parameters like
constraint_exclusion (see constraint_exclusion_options),
backslash_quote (see backslash_quote_options), etc.

-- 
With Regards,
Amit Kapila.




Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-17 Thread Richard Guo
On Fri, Nov 18, 2022 at 9:36 AM Tom Lane  wrote:

> Actually, looking at get_attstatslot, I realize it was already designed
> to do that -- just pass zero for flags.  So we could do it as attached.


Yes, it is.  Using zero flag would short-cut get_attstatsslot() to just
return whether the slot type exists without loading it.  Do you think we
need to emphasize this use case in the comments for 'flags'?  It seems
currently there is no such use case in the codes on HEAD.

I wonder whether we need to also check statistic_proc_security_check()
when determining if MCVs exists in both sides.

Thanks
Richard


Re: Strange failure on mamba

2022-11-17 Thread Andres Freund
Hi,

On 2022-11-17 17:47:50 -0500, Tom Lane wrote:
> Yeah, that or some other NetBSD bug could be the explanation, too.
> Without a stack trace it's hard to have any confidence about it,
> but I've been unable to reproduce the problem outside the buildfarm.
> (Which is a familiar refrain.  I wonder what it is about the buildfarm
> environment that makes it act different from the exact same code running
> on the exact same machine.)
> 
> So I'd like to have some way to make the postmaster send SIGABRT instead
> of SIGKILL in the buildfarm environment.  The lowest-tech way would be
> to drive that off some #define or other.  We could scale it up to a GUC
> perhaps.  Adjacent to that, I also wonder whether SIGABRT wouldn't be
> more useful than SIGSTOP for the existing SendStop half-a-feature ---
> the idea that people should collect cores manually seems mighty
> last-century.

I suspect that having a GUC would be a good idea. I needed something similar
recently, debugging an occasional hang in the AIO patchset. I first tried
something like your #define approach and it did cause a problematic flood of
core files.

I ended up using libbacktrace to generate useful backtraces (vs what
backtrace_symbols() generates) when receiving SIGQUIT. I didn't do the legwork
to make it properly signal safe, but it'd be doable afaiu.

Greetings,

Andres Freund




Re: logical decoding and replication of sequences, take 2

2022-11-17 Thread Andres Freund
Hi,

On 2022-11-17 22:13:23 +0100, Tomas Vondra wrote:
> On 11/17/22 18:07, Andres Freund wrote:
> > On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:
> >> On 11/17/22 03:43, Andres Freund wrote:
> >>> I assume that part of the initial sync would have to be a new sequence
> >>> synchronization step that reads all the sequence states on the publisher 
> >>> and
> >>> ensures that the subscriber sequences are at the same point. There's a 
> >>> bit of
> >>> trickiness there, but it seems entirely doable. The logical replication 
> >>> replay
> >>> support for sequences will have to be a bit careful about not decreasing 
> >>> the
> >>> subscriber's sequence values - the standby initially will be ahead of the
> >>> increments we'll see in the WAL. But that seems inevitable given the
> >>> non-transactional nature of sequences.
> >>>
> >>
> >> See fetch_sequence_data / copy_sequence in the patch. The bit about
> >> ensuring the sequence does not go away (say, using page LSN and/or LSN
> >> of the increment) is not there, however isn't that pretty much what I
> >> proposed doing for "reconciling" the sequence state logged at COMMIT?
> >
> > Well, I think the approach of logging all sequence increments at commit is 
> > the
> > wrong idea...
> >
>
> But we're not logging all sequence increments, no?

I was imprecise - I meant streaming them out at commit.



> Yeah, I think you're right. I looked at this again, with fresh mind, and
> I came to the same conclusion. Roughly what the attached patch does.

To me it seems a bit nicer to keep the SnapBuildGetOrBuildSnapshot() call in
decode.c instead of moving it to reorderbuffer.c. Perhaps we should add a
snapbuild.c helper similar to SnapBuildProcessChange() for non-transactional
changes that also gets a snapshot?

Could look something like

Snapshot snapshot = NULL;

if (message->transactional &&
!SnapBuildProcessChange(builder, xid, buf->origptr))
return;
else if (!SnapBuildProcessStateNonTx(builder, ))
return;

...

Or perhaps we should just bite the bullet and add an argument to
SnapBuildProcessChange to deal with that?

Greetings,

Andres Freund




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-17 Thread Peter Smith
On Fri, Nov 18, 2022 at 11:36 AM Masahiko Sawada  wrote:
>
...
> ---
> The streaming parameter has the new value "parallel" for "streaming"
> option to enable the parallel apply. It fits so far but I think the
> parallel apply feature doesn't necessarily need to be tied up with
> streaming replication. For example, we might want to support parallel
> apply also for non-streaming transactions in the future. It might be
> better to have another option, say "parallel", to control parallel
> apply behavior. The "parallel" option can be a boolean option and
> setting parallel = on requires streaming = on.
>

FWIW, I tend to agree with this idea but for a different reason. In
this patch, the 'streaming' parameter had become a kind of hybrid
boolean/enum. AFAIK there are no other parameters anywhere that use a
hybrid pattern like this so I was thinking it may be better not to be
different.

But I didn't think that parallel_apply=on should *require*
streaming=on. It might be better for parallel_apply=on is just the
*default*, but it simply achieves nothing unless streaming=on too.
That way users would not need to change anything at all to get the
benefits of parallel streaming.

> Another variant is to have a new subscription parameter for example
> "parallel_workers" parameter that specifies the number of parallel
> workers. That way, users can specify the number of parallel workers
> per subscription.
>

--
Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: Fix some newly modified tab-complete changes

2022-11-17 Thread Michael Paquier
On Wed, Nov 16, 2022 at 08:29:24AM +, shiy.f...@fujitsu.com wrote:
> I have fixed the problems you saw, and improved the patch as you suggested.
> 
> Besides, I noticed that the tab completion for "ALTER DEFAULT PRIVILEGES ...
> GRANT/REVOKE ..." missed "CREATE". Fix it in 0001 patch.
> 
> And commit e3ce2de09 supported GRANT ... WITH INHERIT ..., but there's no tab
> completion for it. Add this in 0002 patch.

Thanks, I have been looking at the patch, and pondered about all the
bloat added by the handling of PRIVILEGES, to note at the end that ALL
PRIVILEGES is parsed the same way as ALL.  So we don't actually need
any of the complications related to it and the result would be the
same.

I have merged 0001 and 0002 together, and applied the rest, which
looked rather fine.  I have simplified as well a bit the parts where
"REVOKE GRANT" are specified in a row, to avoid fancy results in some
branches when we apply Privilege_options_of_grant_and_revoke.
--
Michael


signature.asc
Description: PGP signature


Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-17 Thread Amit Kapila
On Wed, Nov 16, 2022 at 1:50 PM houzj.f...@fujitsu.com
 wrote:
>
> On Tuesday, November 15, 2022 7:58 PM houzj.f...@fujitsu.com 
>  wrote:
>
> I noticed that I didn't add CHECK_FOR_INTERRUPTS while retrying send message.
> So, attach the new version which adds that. Also attach the 0004 patch that
> restarts logical replication with temporarily disabling the parallel apply if
> failed to apply a transaction in parallel apply worker.
>

Few comments on v48-0001
==
1. The variable name pending_message_count seems to indicate a number
of pending messages but normally it is pending start/stop streams
except for probably rollback to savepoint case. Shall we name it
pending_stream_count and change the comments accordingly?

2. The variable name abort_toplevel_transaction seems unnecessarily
long. Shall we rename it to toplevel_xact or something like that?

3.
+ /*
+ * Increment the number of messages waiting to be processed by
+ * parallel apply worker.
+ */
+ if (!abort_toplevel_transaction)
+ pg_atomic_add_fetch_u32(&(winfo->shared->pending_message_count), 1);
+ else
+ pa_unlock_stream(xid, AccessExclusiveLock);

It is better to explain here why different actions are required for
subtransaction and transaction rather than the current comment.

4.
+
+ if (abort_toplevel_transaction)
+ {
+ (void) pa_free_worker(winfo, xid);
+ }

{} is not required here.

5.
/*
+ * Although the lock can be automatically released during transaction
+ * rollback, but we still release the lock here as we may not in a
+ * transaction.
+ */
+ pa_unlock_transaction(xid, AccessShareLock);
+

It is better to explain for which case (I think it is for empty xacts)
it will be useful to release it explicitly.

6.
+ *
+ * XXX We can avoid sending pairs of the START/STOP messages to the parallel
+ * worker because unlike apply worker it will process only one transaction at a
+ * time. However, it is not clear whether any optimization is worthwhile
+ * because these messages are sent only when the logical_decoding_work_mem
+ * threshold is exceeded.
  */
 static void
 apply_handle_stream_start(StringInfo s)

I think this comment is no longer valid as now we need to wait for the
next stream at stream_stop message and also need to acquire the lock
in stream_start message. So, I think it is better to remove it unless
I am missing something.

7. I am able to compile applyparallelworker.c by commenting few of the
header includes. Please check if those are really required.
#include "libpq/pqformat.h"
#include "libpq/pqmq.h"
//#include "mb/pg_wchar.h"
#include "pgstat.h"
#include "postmaster/interrupt.h"
#include "replication/logicallauncher.h"
//#include "replication/logicalworker.h"
#include "replication/origin.h"
//#include "replication/walreceiver.h"
#include "replication/worker_internal.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
//#include "storage/procarray.h"
#include "tcop/tcopprot.h"
#include "utils/inval.h"
#include "utils/memutils.h"
//#include "utils/resowner.h"
#include "utils/syscache.h"

8.
+/*
+ * Is there a message sent by parallel apply worker which we need to receive?
+ */
+volatile sig_atomic_t ParallelApplyMessagePending = false;

This comment and variable are placed in applyparallelworker.c, so 'we'
in the above sentence is not clear. I think you need to use leader
apply worker instead.

9.
+static ParallelApplyWorkerInfo *pa_get_free_worker(void);

Will it be better if we name this function pa_get_available_worker()?

-- 
With Regards,
Amit Kapila.




Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-17 Thread Tom Lane
I wrote:
> Tomas Vondra  writes:
>> Or perhaps what if we have a function that quickly determines if the
>> attribute has MCV, without loading it? I'd bet the expensive part of
>> get_attstatslot() is the deconstruct_array().
>> We could have a function that only does the first small loop over slots,
>> and returns true/false if we have a slot of the requested stakind.

> Yeah, I like this idea.

Actually, looking at get_attstatslot, I realize it was already designed
to do that -- just pass zero for flags.  So we could do it as attached.

We could make some consequent simplifications by only retaining one
"have_mcvs" flag, but I'm inclined to leave the rest of the code as-is.
We would not get much gain from that, and it would make this harder
to undo if there ever is a reason to consider just one set of MCVs.

regards, tom lane

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index d597b7e81f..e0aeaa6909 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2261,6 +2261,7 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	Form_pg_statistic stats2 = NULL;
 	bool		have_mcvs1 = false;
 	bool		have_mcvs2 = false;
+	bool		get_mcv_stats;
 	bool		join_is_reversed;
 	RelOptInfo *inner_rel;
 
@@ -2275,11 +2276,25 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	memset(, 0, sizeof(sslot1));
 	memset(, 0, sizeof(sslot2));
 
+	/*
+	 * There is no use in fetching one side's MCVs if we lack MCVs for the
+	 * other side, so do a quick check to verify that both stats exist.
+	 */
+	get_mcv_stats = (HeapTupleIsValid(vardata1.statsTuple) &&
+	 HeapTupleIsValid(vardata2.statsTuple) &&
+	 get_attstatsslot(, vardata1.statsTuple,
+	  STATISTIC_KIND_MCV, InvalidOid,
+	  0) &&
+	 get_attstatsslot(, vardata2.statsTuple,
+	  STATISTIC_KIND_MCV, InvalidOid,
+	  0));
+
 	if (HeapTupleIsValid(vardata1.statsTuple))
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats1 = (Form_pg_statistic) GETSTRUCT(vardata1.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats &&
+			statistic_proc_security_check(, opfuncoid))
 			have_mcvs1 = get_attstatsslot(, vardata1.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);
@@ -2289,7 +2304,8 @@ eqjoinsel(PG_FUNCTION_ARGS)
 	{
 		/* note we allow use of nullfrac regardless of security check */
 		stats2 = (Form_pg_statistic) GETSTRUCT(vardata2.statsTuple);
-		if (statistic_proc_security_check(, opfuncoid))
+		if (get_mcv_stats &&
+			statistic_proc_security_check(, opfuncoid))
 			have_mcvs2 = get_attstatsslot(, vardata2.statsTuple,
 		  STATISTIC_KIND_MCV, InvalidOid,
 		  ATTSTATSSLOT_VALUES | ATTSTATSSLOT_NUMBERS);


Re: Getting rid of SQLValueFunction

2022-11-17 Thread Michael Paquier
On Tue, Oct 25, 2022 at 02:20:12PM +0900, Michael Paquier wrote:
> Attached is a rebased patch set, as of the conflicts from 2e0d80c.

So, this patch set has been sitting in the CF app for a few weeks now,
and I would like to apply them to remove a bit of code from the
executor.

Please note that in order to avoid tweaks when choosing the attribute
name of function call, this needs a total of 8 new catalog functions
mapping to the SQL keywords, which is what the test added by 2e0d80c
is about:
- current_role
- user
- current_catalog
- current_date
- current_time
- current_timestamp
- localtime
- localtimestamp

Any objections?
--
Michael


signature.asc
Description: PGP signature


Re: libpq compression (part 2)

2022-11-17 Thread Andrey Borodin
On Thu, Nov 17, 2022 at 7:09 AM Peter Eisentraut
 wrote:
>
> Note that the above code was just changed in dce92e59b1.
Thanks!

> I don't know
> how that affects this patch set.
With dce92e59b1 it would be much easier to find a bug in the compression patch.

Some more notes about the patch. (sorry for posting review notes in so
many different messages)

1. zs_is_valid_impl_id(unsigned int id)
{
return id >= 0 && id < lengthof(zs_algorithms);
}

id is unsigned, no need to check it's non-negative.

2. This literal
{no_compression_name}
should be replaced by explicit form
{no_compression_name, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL}

3. Comments like "Return true if should, false if should not." are useless.

Thanks!

Best regards, Andrey Borodin.




Re: Typo in SH_LOOKUP and SH_LOOKUP_HASH comments

2022-11-17 Thread vignesh C
On Thu, 17 Nov 2022 at 17:48, Daniel Gustafsson  wrote:
>
> > On 17 Nov 2022, at 10:56, vignesh C  wrote:
>
> > When I was reviewing one of the patches, I found a typo in the
> > comments of SH_LOOKUP and SH_LOOKUP_HASH. I felt "lookup up" should
> > have been "lookup".
> > Attached a patch to modify it.
>
> I agree with that, applied to HEAD. Thanks!

Thanks for pushing this change.

Regards,
Vignesh




Re: Reducing power consumption on idle servers

2022-11-17 Thread Andres Freund
Hi,

On 2022-11-17 13:06:23 +0530, Bharath Rupireddy wrote:
> I understand. I know it's a bit hard to measure the power savings, I'm
> wondering if there's any info, maybe not necessarily related to
> postgres, but in general how much power gets saved if a certain number
> of waits/polls/system calls are reduced.

It's heavily hardware and hardware settings dependent.

On some systems you can get an *approximate* idea of the power usage even
while plugged in. On others you can only see it while on battery power.

On systems with RAPL support (most Intel and I think newer AMD CPUs) you can
query power usage with:
  powerstat -R 1 (adding -D provides a bit more detail)

But note that RAPL typically severely undercounts power usage because it
doesn't cover a lot of sources of power usage (display, sometimes memory,
all other peripherals).

Sometimes powerstat -R -D can split power usage up more granularly,
e.g. between the different CPU sockets and memory.


On laptops you can often measure the discharge rate when not plugged in, with
powerstat -d 0 1. But IME the latency till the values update makes it harder
to interpret.

On some workstations / servers you can read the power usage via ACPI. E.g. on
my workstation 'sensors' shows a power_meter-acpi-0


As an example of the difference it can make, here's the output of
  powerstat -D 1
on my laptop.

  TimeUser  Nice   Sys  IdleIO  Run Ctxt/s  IRQ/s Fork Exec Exit  Watts
16:41:55   0.1   0.0   0.1  99.8   0.01403246101   2.62
16:41:56   0.1   0.0   0.1  99.8   0.01357196101   2.72
16:41:57   0.1   0.0   0.1  99.6   0.21510231404   2.64
16:41:58   0.1   0.0   0.1  99.9   0.02   1350758   64   62   63   4.06
16:41:59   0.3   0.0   1.0  98.7   0.02   4166   2406  244  243  244   7.20
16:42:00   0.2   0.0   0.7  99.1   0.02   4203   2353  247  246  247   7.21
16:42:01   0.5   0.0   1.6  98.0   0.02   4079   2395  240  239  240   7.08
16:42:02   0.5   0.0   0.9  98.7   0.02   4097   2405  245  243  245   7.20
16:42:03   0.4   0.0   1.3  98.3   0.02   4117   2311  243  242  243   7.14
16:42:04   0.1   0.0   0.4  99.4   0.11   1721   1152   70   70   71   4.48
16:42:05   0.1   0.0   0.2  99.8   0.01433250101   2.92
16:42:06   0.0   0.0   0.3  99.7   0.01400231101   2.66

In the period of higher power etc usage I ran
while true;do sleep 0.001;done
and then interupted that after a bit with ctrl-c.

Same thing on my workstation (:

  TimeUser  Nice   Sys  IdleIO  Run Ctxt/s  IRQ/s Fork Exec Exit  Watts
16:43:48   1.0   0.0   0.2  98.7   0.11   8218   2354000  46.43
16:43:49   1.1   0.0   0.3  98.7   0.01   7866   2477000  45.99
16:43:50   1.1   0.0   0.4  98.5   0.02   7753   2996000  48.93
16:43:51   0.8   0.0   1.7  97.5   0.01   9395   5285000  75.48
16:43:52   0.5   0.0   1.7  97.8   0.01   9141   4806000  75.30
16:43:53   1.1   0.0   1.8  97.1   0.02  10065   5504000  76.27
16:43:54   1.3   0.0   1.5  97.2   0.01  10962   5165000  76.33
16:43:55   0.9   0.0   0.8  98.3   0.01   8452   3939000  61.99
16:43:56   0.6   0.0   0.1  99.3   0.02   6541   1999000  40.92
16:43:57   0.9   0.0   0.2  98.9   0.02   8199   2477000  42.91


And if I query the power supply via ACPI instead:

while true;do sensors power_meter-acpi-0|grep power1|awk '{print $2, $3}';sleep 
1;done
...
163.00 W
173.00 W
173.00 W
172.00 W
203.00 W
206.00 W
206.00 W
206.00 W
209.00 W
205.00 W
211.00 W
213.00 W
203.00 W
175.00 W
166.00 W
...

As you can see the difference is quite substantial. This is solely due to a
1ms sleep loop (albeit one where we fork a process after each sleep, which
likely is a good chunk of the CPU usage).

However, the difference in power usage will be far smaller if the system
already busy, because the CPU will already run at a high frequency.

Greetings,

Andres Freund




Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-17 Thread Tom Lane
Tomas Vondra  writes:
> Or perhaps what if we have a function that quickly determines if the
> attribute has MCV, without loading it? I'd bet the expensive part of
> get_attstatslot() is the deconstruct_array().
> We could have a function that only does the first small loop over slots,
> and returns true/false if we have a slot of the requested stakind.

Yeah, I like this idea.

> It might even check the isunique flag first, to make it more convenient.

That would tie it to this one use-case, which doesn't seem advisable.
I think we should forget the known-unique angle and just do quick
checks to see if both sides have MCVs.

regards, tom lane




Re: Perform streaming logical transactions by background workers and parallel apply

2022-11-17 Thread Masahiko Sawada
On Tue, Nov 15, 2022 at 8:57 PM houzj.f...@fujitsu.com
 wrote:
>
> On Saturday, November 12, 2022 7:06 PM Amit Kapila 
> >
> > On Fri, Nov 11, 2022 at 2:12 PM houzj.f...@fujitsu.com
> >  wrote:
> > >
> >
> > Few comments on v46-0001:
> > ==
> >
>
> Thanks for the comments.
>
> > 1.
> > +static void
> > +apply_handle_stream_abort(StringInfo s)
> > {
> > ...
> > + /* Send STREAM ABORT message to the parallel apply worker. */
> > + parallel_apply_send_data(winfo, s->len, s->data);
> > +
> > + if (abort_toplevel_transaction)
> > + {
> > + parallel_apply_unlock_stream(xid, AccessExclusiveLock);
> >
> > Shouldn't we need to release this lock before sending the message as
> > we are doing for streap_prepare and stream_commit? If there is a
> > reason for doing it differently here then let's add some comments for
> > the same.
>
> Changed.
>
> > 2. It seems once the patch makes the file state as busy
> > (LEADER_FILESET_BUSY), it will only be accessible after the leader
> > apply worker receives a transaction end message like stream_commit. Is
> > my understanding correct? If yes, then why can't we make it accessible
> > after the stream_stop message? Are you worried about the concurrency
> > handling for reading and writing the file? If so, we can probably deal
> > with it via some lock for reading and writing to file for each change.
> > I think after this we may not need additional stream level lock/unlock
> > in parallel_apply_spooled_messages. I understand that you probably
> > want to keep the code simple so I am not suggesting changing it
> > immediately but just wanted to know whether you have considered
> > alternatives here.
>
> I thought about this, but it seems the current buffile design doesn't allow 
> two
> processes to open the same buffile at the same time(refer to the comment atop
> of BufFileOpenFileSet()). This means the LA needs to make sure the PA has
> closed the buffile before writing more changes into it. Although we could let
> the LA wait for that, but it could cause another kind of deadlock. Suppose the
> PA opened the file and is blocked when applying the just read change. And the
> LA starts to wait when trying to write the next set of streaming changes into
> file because the file is still opened by PA. Then the lock edge is like:
>
> LA (wait for file to be closed) -> PA1 (wait for unique lock in PA2) -> PA2
> (wait for stream lock held in LA)
>
> We could introduce another lock for this, but that seems not very great as we
> already had two kinds of locks here.
>
> Another solution could be we create different filename for each streaming 
> block
> so that the leader don't need to reopen the same file after writing changes
> into it, but that seems largely increase the number of temp files and looks a
> bit hacky. Or we could let PA open the file, then read and close the file for
> each change, but it seems bring some overhead of opening and closing file.
>
> Another solution which doesn't need a new lock could be that we create
> different filename for each streaming block so that the leader doesn't need to
> reopen the same file after writing changes into it, but that seems largely
> increase the number of temp files and looks a bit hacky. Or we could let PA
> open the file, then read and close the file for each change, but it seems 
> bring
> some overhead of opening and closing file.
>
> Based on above, how about keep the current approach ?(i.e. PA
> will open the file only after the leader apply worker receives a transaction
> end message like stream_commit). Ideally, it will enter partial serialize mode
> only when PA is blocked by a backend or another PA which seems not that 
> common.

+1. We can improve this area later in a separate patch.

Here are review comments on v47-0001 and v47-0002 patches:

When the parallel apply worker exited, I got the following server log.
I think this log is not appropriate since the worker was not
terminated by administrator command but exited by itself. Also,
probably it should exit with exit code 0?

FATAL:  terminating logical replication worker due to administrator command
LOG:  background worker "logical replication parallel worker" (PID
3594918) exited with exit code 1

---
/*
 * Stop the worker if there are enough workers in the pool or the leader
 * apply worker serialized part of the transaction data to a file due to
 * send timeout.
 */
if (winfo->serialize_changes ||
napplyworkers > (max_parallel_apply_workers_per_subscription / 2))

Why do we need to stop the worker if the leader serializes changes?

---
+/*
+ * Release all session level locks that could be held in parallel apply
+ * mode.
+ */
+LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+

I think we call LockReleaseAll() at the process exit (in ProcKill()),
but do we really need to do LockReleaseAll() here too?

---

+elog(ERROR, "could not find replication state slot
for replication"
+  

Re: Optimize join selectivity estimation by not reading MCV stats for unique join attributes

2022-11-17 Thread Tomas Vondra



On 11/14/22 10:19, David Geier wrote:
> Hi Tom,
>> There won't *be* any MCV stats for a column that ANALYZE perceives to
>> be unique, so I'm not quite sure where the claimed savings comes from.
>
> We save if one join attribute is unique while the other isn't. In that
> case stored MCV stats are read for the non-unique attribute but then
> never used. This is because MCV stats in join selectivity estimation are
> only used if they're present on both columns
>

Right - if we only have MCV on one side of the join, we currently end up
loading the MCV we have only to not use it anyway. The uniqueness is a
simple way to detect some of those cases. I'd bet the savings can be
quite significant for small joins and/or cases with large MCV.

I wonder if we might be yet a bit smarter, though.

For example, assume the first attribute is not defined as "unique" but
we still don't have a MCV (it may be unique - or close to unique - in
practice, or maybe it's just uniform distribution). We end up with

  have_mcvs1 = false

Can't we just skip trying to load the second MCV? So we could do

if (have_mcvs1 && HeapTupleIsValid(vardata2.statsTuple))
{ ... try loading mcv2 ... }

Or perhaps what if we have a function that quickly determines if the
attribute has MCV, without loading it? I'd bet the expensive part of
get_attstatslot() is the deconstruct_array().

We could have a function that only does the first small loop over slots,
and returns true/false if we have a slot of the requested stakind. It
might even check the isunique flag first, to make it more convenient.

And only if both sides return "true" we'd load the MCV, deconstruct the
array and all that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: allowing for control over SET ROLE

2022-11-17 Thread Jeff Davis
On Thu, 2022-11-17 at 16:52 -0500, Robert Haas wrote:

> But I think the bigger
> reason is that, in my opinion, this proposal is more generally
> useful,
> because it takes no position on why you wish to disallow SET ROLE.
> You
> can just disallow it in some cases and allow it in others, and that's
> fine.

I agree that the it's more flexible in the sense that it does what it
does, and administrators can use it if it's useful for them. That means
we don't need to understand the actual goals as well; but it also means
that it's harder to determine the consequences if we tweak the behavior
(or any related behavior) later.

I'll admit that I don't have an example of a likely problem here,
though.

> That proposal targets a specific use case, which may make it a
> better solution to that particular problem, but it makes it
> unworkable
> as a solution to any other problem, I believe.

Yeah, that's the flip side: "virtual" roles (for lack of a better name)
are a more narrow fix for the problem as I understand it; but it might
leave related problems unfixed. You and Stephen[2] both seemed to
consider this approach, and I happened to like it, so I wanted to make
sure that it wasn't dismissed too quickly.

But I'm fine if you'd like to move on with the SET ROLE privilege
instead, as long as we believe it grants a stable set of capabilities
(and conversely, that if the SET ROLE privilege is revoked, that it
revokes a stable set of capabilities).

[2]
https://www.postgresql.org/message-id/YzIAGCrxoXibAKOD%40tamriel.snowman.net


-- 
Jeff Davis
PostgreSQL Contributor Team - AWS






Re: Mingw task for Cirrus CI

2022-11-17 Thread Andres Freund
Hi,

On 2022-10-24 14:13:06 +0300, Melih Mutlu wrote:
> If you say so, then I think it's ready.

I pushed it with a few changes:

- I added an only_if, otherwise the task shows up (without running) even if
  you use ci-os-only: linux
- I added -Dnls=disabled - It seemed to add about 1.5min to the cached build
- Inlined the -l into BASH_EXE and renamed it to BASH
- The indentation of WINDOWS_ENVIRONMENT_BASE was irregular, leading to a
  larger diff. Some other similar cleanups.
- I added 'mingw' as a ci-os-only choice. Perhaps not the
  be-all-end-all. Perhaps we should use windows-{msvc,mingw,cygwin}? Or add
  ci-task-only?


I suspect we can make at least -Dssl=openssl work. Perhaps one of the uuid
implementation is available in msys as well?

Currently we end up with:
   External libraries
 bonjour: NO
 bsd_auth   : NO
 gss: NO
 icu: YES 72.1
 ldap   : YES
 libxml : YES 2.10.3
 libxslt: YES 1.1.37
 llvm   : NO
 lz4: YES 1.9.4
 nls: NO
 pam: NO
 plperl : YES
 plpython   : YES 3.10
 pltcl  : YES
 readline   : YES
 selinux: NO
 ssl: NO
 systemd: NO
 uuid   : NO
 zlib   : YES 1.2.13
 zstd   : YES 1.5.2

gss should also work if the is available in the msys repo.

Bonjour is apple only, bsd_auth bsd specific, selinux and systemd linux
specific and pam unixoid specific.

It could also be interesting to see if llvm works, but that might be more
work.

Greetings,

Andres Freund




Re: redundant check of msg in does_not_exist_skipping

2022-11-17 Thread Li Japin



> On Nov 18, 2022, at 4:04 AM, Robert Haas  wrote:
> 
> On Thu, Nov 17, 2022 at 10:56 AM Tom Lane  wrote:
>> This is a completely bad idea.  If it takes that level of analysis
>> to see that msg can't be null, we should leave the test in place.
>> Any future modification of either this code or what it calls could
>> break the conclusion.
> 
> +1. Also, even if the check were quite obviously useless, it's cheap
> insurance. It's difficult to believe that it hurts performance in any
> measurable way. If anything, we would benefit from having more sanity
> checks in our code, rather than fewer.
> 

Thanks for the explanation! Got it.





Re: Don't treate IndexStmt like AlterTable when DefineIndex is called from ProcessUtilitySlow.

2022-11-17 Thread 正华吕
Hi, thanks a lot for your reply!

> Why do you think this is an improvement?

I hit the issue in Greenplum Database (Massively Parallel Postgres),
the MPP architecture is that coordinator dispatch statement to segments.
The dispatch logic is quit different for AlterTable and CreateTableLike:

* alter table: for each sub command, it will not dispatch; later it will
dispatch
  the alter table statement as a whole.
* for create table like statement, like `create table t (like t1 including
indexes);`
  this statement's 2nd stmt, has to dispatch to segments, but now it is
treated
  as alter table, the dispatch logic is broken for this case in Greenplum.

I look into the issue and Greenplum Database wants to keep align with
Upstream
as possible. That is why I ask if we can force it to false.

Best,
Zhenghua Lyu


Tom Lane  于2022年11月18日周五 06:26写道:

> =?UTF-8?B?5q2j5Y2O5ZCV?=  writes:
> >   Recently read the code, I find that when calling DefineIndex
> >   from ProcessUtilitySlow, is_alter_table will be set true if
> >   this statement is came from expandTableLikeClause.
>
> Yeah.
>
> >   Based on the above, I think  we can always a false value
> >   for is_alter_table when DefineIndex is called from
> >   ProcessUtilitySlow.
>
> Why do you think this is an improvement?  Even if it's correct,
> the code savings is so negligible that I'm not sure I want to
> expend brain cells on figuring out whether it's correct.  The
> comment you want to remove does not suggest that it's optional
> which value we should pass, so I think the burden of proof
> is to show that this patch is okay not that somebody else
> has to demonstrate that it isn't.
>
> regards, tom lane
>


Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-17 Thread Peter Geoghegan
On Thu, Nov 17, 2022 at 9:02 AM Peter Geoghegan  wrote:
> Plan is to commit this later on today, barring objections.

Pushed, thanks.

-- 
Peter Geoghegan




Re: Strange failure on mamba

2022-11-17 Thread Thomas Munro
On Fri, Nov 18, 2022 at 11:35 AM Thomas Munro  wrote:
> I wonder if it's a runtime variant of the other problem.  We do
> load_file("libpqwalreceiver", false) before unblocking signals but
> maybe don't resolve the symbols until calling them, or something like
> that...

Hmm, no, I take that back.  A key ingredient was that a symbol was
being resolved inside the signal handler, which is a postmaster-only
thing.




Re: Strange failure on mamba

2022-11-17 Thread Tom Lane
Thomas Munro  writes:
> On Fri, Nov 18, 2022 at 11:08 AM Tom Lane  wrote:
>> mamba has been showing intermittent failures in various replication
>> tests since day one.

> I wonder if it's a runtime variant of the other problem.  We do
> load_file("libpqwalreceiver", false) before unblocking signals but
> maybe don't resolve the symbols until calling them, or something like
> that...

Yeah, that or some other NetBSD bug could be the explanation, too.
Without a stack trace it's hard to have any confidence about it,
but I've been unable to reproduce the problem outside the buildfarm.
(Which is a familiar refrain.  I wonder what it is about the buildfarm
environment that makes it act different from the exact same code running
on the exact same machine.)

So I'd like to have some way to make the postmaster send SIGABRT instead
of SIGKILL in the buildfarm environment.  The lowest-tech way would be
to drive that off some #define or other.  We could scale it up to a GUC
perhaps.  Adjacent to that, I also wonder whether SIGABRT wouldn't be
more useful than SIGSTOP for the existing SendStop half-a-feature ---
the idea that people should collect cores manually seems mighty
last-century.

regards, tom lane




Re: contrib: auth_delay module

2022-11-17 Thread Tom Lane
=?UTF-8?B?5oiQ5LmL54SV?=  writes:
> The attached patch is a contrib module  to set login restrictions on users 
> with 
> too many authentication failure. The administrator could manage several GUC 
> parameters to control the login restrictions which are listed below.
> - set the wait time when password authentication fails.
> - allow the wait time grows when users of the same IP consecutively logon 
> failed.
> - set the maximum authentication failure number from the same user. The 
> system 
> will prevent a user who gets too many authentication failures from entering 
> the
> database.

I'm not yet forming an opinion on whether this is useful enough
to accept.  However, I wonder why you chose to add this functionality
to auth_delay instead of making a new, independent module.
It seems fairly unrelated to what auth_delay does, and the
newly-created requirement that the module be preloaded might
possibly break some existing use-case for auth_delay.

Also, a patch that lacks user documentation and has no code comments to
speak of seems unlikely to draw serious review.

regards, tom lane




Re: Strange failure on mamba

2022-11-17 Thread Thomas Munro
On Fri, Nov 18, 2022 at 11:08 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > I wonder why the walreceiver didn't start in
> > 008_min_recovery_point_node_3.log here:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-11-16%2023%3A13%3A38
>
> mamba has been showing intermittent failures in various replication
> tests since day one.  My guess is that it's slow enough to be
> particularly subject to the signal-handler race conditions that we
> know exist in walreceivers and elsewhere.  (Now, it wasn't any faster
> in its previous incarnation as a macOS critter.  But maybe modern
> NetBSD has different scheduler behavior than ancient macOS and that
> contributes somehow.  Or maybe there's some other NetBSD weirdness
> in here.)
>
> I've tried to reproduce manually, without much success :-(
>
> Like many of its other failures, there's a suggestive postmaster
> log entry at the very end:
>
> 2022-11-16 19:45:53.851 EST [2036:4] LOG:  received immediate shutdown request
> 2022-11-16 19:45:58.873 EST [2036:5] LOG:  issuing SIGKILL to recalcitrant 
> children
> 2022-11-16 19:45:58.881 EST [2036:6] LOG:  database system is shut down
>
> So some postmaster child is stuck somewhere where it's not responding
> to SIGQUIT.  While it's not unreasonable to guess that that's a
> walreceiver, there's no hard evidence of it here.  I've been wondering
> if it'd be worth patching the postmaster so that it's a bit more verbose
> about which children it had to SIGKILL.  I've also wondered about
> changing the SIGKILL to SIGABRT in hopes of reaping a core file that
> could be investigated.

I wonder if it's a runtime variant of the other problem.  We do
load_file("libpqwalreceiver", false) before unblocking signals but
maybe don't resolve the symbols until calling them, or something like
that...




Re: Don't treate IndexStmt like AlterTable when DefineIndex is called from ProcessUtilitySlow.

2022-11-17 Thread Tom Lane
=?UTF-8?B?5q2j5Y2O5ZCV?=  writes:
>   Recently read the code, I find that when calling DefineIndex
>   from ProcessUtilitySlow, is_alter_table will be set true if
>   this statement is came from expandTableLikeClause.

Yeah.

>   Based on the above, I think  we can always a false value
>   for is_alter_table when DefineIndex is called from
>   ProcessUtilitySlow.

Why do you think this is an improvement?  Even if it's correct,
the code savings is so negligible that I'm not sure I want to
expend brain cells on figuring out whether it's correct.  The
comment you want to remove does not suggest that it's optional
which value we should pass, so I think the burden of proof
is to show that this patch is okay not that somebody else
has to demonstrate that it isn't.

regards, tom lane




Re: allow segment size to be set to < 1GiB

2022-11-17 Thread Andrew Dunstan


On 2022-11-17 Th 10:48, Tom Lane wrote:
> Andres Freund  writes:
>> On 2022-11-17 09:58:48 -0500, Andrew Dunstan wrote:
>>> Are we going to impose some sane minimum, or leave it up to developers
>>> to discover that for themselves?
>> I don't think we should. It's actually useful to e.g. use 1 page sized
>> segments for testing, and with one exceptions the tests pass with it too. Do
>> you see a reason to impose one?
> Yeah, I think we should allow setting it to 1 block.  This switch is
> only for testing purposes (I hope the docs make that clear).
>
>   


Yeah clearly if 1 is useful there's no point in limiting it.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Fix proposal for comparaison bugs in PostgreSQL::Version

2022-11-17 Thread Andrew Dunstan


On 2022-11-04 Fr 10:06, Jehan-Guillaume de Rorthais wrote:
> On Thu, 3 Nov 2022 13:11:18 -0500
> Justin Pryzby  wrote:
>
>> On Tue, Jun 28, 2022 at 06:17:40PM -0400, Andrew Dunstan wrote:
>>> Nice catch, but this looks like massive overkill. I think we can very
>>> simply fix the test in just a few lines of code, instead of a 190 line
>>> fix and a 130 line TAP test.
>>>
>>> It was never intended to be able to compare markers like rc1 vs rc2, and
>>> I don't see any need for it. If you can show me a sane use case I'll
>>> have another look, but right now it seems quite unnecessary.
>>>
>>> Here's my proposed fix.
>>>
>>> diff --git a/src/test/perl/PostgreSQL/Version.pm
>>> b/src/test/perl/PostgreSQL/Version.pm index 8f70491189..8d4dbbf694 100644
>>> --- a/src/test/perl/PostgreSQL/Version.pm  
>> Is this still an outstanding issue ?
> The issue still exists on current HEAD:
>
>   $ perl -Isrc/test/perl/ -MPostgreSQL::Version -le \
>   'print "bug" if PostgreSQL::Version->new("9.6") <= 9.0'
>   bug
>
> Regards,
>

Oops. this slipped off mt radar. I'll apply a fix shortly, thanks for
the reminder.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Strange failure on mamba

2022-11-17 Thread Tom Lane
Thomas Munro  writes:
> I wonder why the walreceiver didn't start in
> 008_min_recovery_point_node_3.log here:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-11-16%2023%3A13%3A38

mamba has been showing intermittent failures in various replication
tests since day one.  My guess is that it's slow enough to be
particularly subject to the signal-handler race conditions that we
know exist in walreceivers and elsewhere.  (Now, it wasn't any faster
in its previous incarnation as a macOS critter.  But maybe modern
NetBSD has different scheduler behavior than ancient macOS and that
contributes somehow.  Or maybe there's some other NetBSD weirdness
in here.)

I've tried to reproduce manually, without much success :-(

Like many of its other failures, there's a suggestive postmaster
log entry at the very end:

2022-11-16 19:45:53.851 EST [2036:4] LOG:  received immediate shutdown request
2022-11-16 19:45:58.873 EST [2036:5] LOG:  issuing SIGKILL to recalcitrant 
children
2022-11-16 19:45:58.881 EST [2036:6] LOG:  database system is shut down

So some postmaster child is stuck somewhere where it's not responding
to SIGQUIT.  While it's not unreasonable to guess that that's a
walreceiver, there's no hard evidence of it here.  I've been wondering
if it'd be worth patching the postmaster so that it's a bit more verbose
about which children it had to SIGKILL.  I've also wondered about
changing the SIGKILL to SIGABRT in hopes of reaping a core file that
could be investigated.

regards, tom lane




Patch: Global Unique Index

2022-11-17 Thread Cary Huang
Patch: Global Unique Index 



“Global unique index” in our definition is a unique index on a partitioned 
table that can ensure cross-partition uniqueness using a non-partition key. 
This work is inspired by this email thread, “Proposal: Global Index” started 
back in 2019 (Link below). My colleague David and I took a different approach 
to implement the feature that ensures uniqueness constraint spanning multiple 
partitions. We achieved this mainly by using application logics without heavy 
modification to current Postgres’s partitioned table/index structure. In other 
words, a global unique index and a regular partitioned index are essentially 
the same in terms of their storage structure except that one can do 
cross-partition uniqueness check, the other cannot.



https://www.postgresql.org/message-id/CALtqXTcurqy1PKXzP9XO%3DofLLA5wBSo77BnUnYVEZpmcA3V0ag%40mail.gmail.com



- Patch -

The attached patches were generated based on commit 
`85d8b30724c0fd117a683cc72706f71b28463a05` on master branch.



- Benefits of global unique index -

1. Ensure uniqueness spanning all partitions using a non-partition key column

2. Allow user to create a unique index on a non-partition key column without 
the need to include partition key (current Postgres enforces this)

3. Query performance increase when using a single unique non-partition key 
column





- Supported Features -

1. Global unique index is supported only on btree index type

2. Global unique index is useful only when created on a partitioned table.

3. Cross-partition uniqueness check with CREATE UNIQUE INDEX in serial and 
parallel mode

4. Cross-partition uniqueness check with ATTACH in serial and parallel mode

5. Cross-partition uniqueness check when INSERT and UPDATE





- Not-supported Features -

1. Global uniqueness check with Sub partition tables is not yet supported as we 
do not have immediate use case and it may involve majoy change in current 
implementation





- Global unique index syntax -

A global unique index can be created with "GLOBAL" and "UNIQUE" clauses in a 
"CREATE INDEX" statement run against a partitioned table. For example,



CREATE UNIQUE INDEX global_index ON idxpart(bid) GLOBAL;





- New Relkind: RELKIND_GLOBAL_INDEX -

When a global unique index is created on a partitioned table, its relkind is 
RELKIND_PARTITIONED_INDEX (I). This is the same as creating a regular index. 
Then Postgres will recursively create index on each child partition, except now 
the relkind will be set as RELKIND_GLOBAL_INDEX (g) instead of RELKIND_INDEX 
(i). This new relkind, along with uniqueness flag are needed for 
cross-partition uniqueness check later.





- Create a global unique index -

To create a regular unique index on a partitioned table, Postgres has to 
perform heap scan and sorting on every child partition. Uniqueness check 
happens during the sorting phase and will raise an error if multiple tuples 
with the same index key are sorted together. To achieve global uniqueness 
check, we make Postgres perform the sorting after all of the child partitions 
have been scanned instead of on the "sort per partition" fashion. In 
otherwords, the sorting only happens once at the very end and it sorts the 
tuples coming from all the partitions and therefore can ensure global 
uniqueness.



In parallel index build case, the idea is the same, except that the tuples will 
be put into shared file set (temp files) on disk instead of in memory to ensure 
other workers can share the sort results. At the end of the very last partition 
build, we make Postgres take over all the temp files and perform a final merge 
sort to ensure global uniqueness.



Example:



> CREATE TABLE gidx_part(a int, b int, c text) PARTITION BY RANGE (a);

> CREATE TABLE gidx_part1 PARTITION OF gidx_part FOR VALUES FROM (0) to (10);

> CREATE TABLE gidx_part2 PARTITION OF gidx_part FOR VALUES FROM (10) to (20);

> INSERT INTO gidx_part values(5, 5, 'test');

> INSERT INTO gidx_part values(15, 5, 'test');

> CREATE UNIQUE INDEX global_unique_idx ON gidx_part(b) GLOBAL;

ERROR:  could not create unique index "gidx_part1_b_idx"

DETAIL:  Key (b)=(5) is duplicated.





- INSERT and UPDATE -

For every new tuple inserted or updated, Postgres attempts to fetch the same 
tuple from current partition to determine if a duplicate already exists. In the 
global unique index case, we make Postgres attempt to fetch the same tuple from 
other partitions as well as the current partition. If a duplicate is found, 
global uniqueness is violated and an error is raised.



Example:



> CREATE TABLE gidx_part (a int, b int, c text) PARTITION BY RANGE (a);

> CREATE TABLE gidx_part1 partition of gidx_part FOR VALUES FROM (0) TO (10);

> CREATE TABLE gidx_part2 partition of gidx_part FOR VALUES FROM (10) TO (20);

> CREATE UNIQUE INDEX global_unique_idx ON gidx_part USING BTREE(b) GLOBAL;

> INSERT INTO gidx_part values(5, 5, 'test');

> INSERT INTO gidx_part values(15, 5, 'test');

Strange failure on mamba

2022-11-17 Thread Thomas Munro
Hi,

I wonder why the walreceiver didn't start in
008_min_recovery_point_node_3.log here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2022-11-16%2023%3A13%3A38

There was the case of commit 8acd8f86, but that involved a deadlocked
postmaster whereas this one still handled a shutdown request.




Re: allowing for control over SET ROLE

2022-11-17 Thread Robert Haas
On Tue, Nov 15, 2022 at 7:23 PM Jeff Davis  wrote:
> On Tue, 2022-11-15 at 12:07 -0500, Robert Haas wrote:
> > If anyone else wants to comment, or if either of those people want to
> > comment further, please speak up soon.
>
> Did you have some thoughts on:
>
> https://postgr.es/m/a41d606d03b629c2ef0ed274ae3b04a2c266.ca...@j-davis.com

I mean, I think what we were discussing there could be done, but it's
not the approach I like best. That's partly because that was just a
back-of-the-envelope sketch of an idea, not a real proposal for
something with a clear implementation path. But I think the bigger
reason is that, in my opinion, this proposal is more generally useful,
because it takes no position on why you wish to disallow SET ROLE. You
can just disallow it in some cases and allow it in others, and that's
fine. That proposal targets a specific use case, which may make it a
better solution to that particular problem, but it makes it unworkable
as a solution to any other problem, I believe.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Outdated url to Hunspell

2022-11-17 Thread Daniel Gustafsson
I happened to notice that the link to Hunspell in our documentation goes to the
hunspell sourceforge page last updated in 2015.  The project has since moved to
Github [0] with hunspell.sourceforge.net redirecting there, I'm not sure
exactly when but Wikipedia updated their link entry in 2016 [1] so it seems
about time we do too.

Unless objected to I'll apply the attached to master with the doc/ hunk to all
supported branches.

--
Daniel Gustafsson   https://vmware.com/

[0] https://hunspell.github.io/
[1] 
https://en.wikipedia.org/w/index.php?title=Hunspell=704306462=697230645



hunspell_url.diff
Description: Binary data


Re: ubsan fails on 32bit builds

2022-11-17 Thread Thomas Munro
On Fri, Nov 18, 2022 at 9:13 AM Andres Freund  wrote:
> Agreed. I had started on a set of patches for some of the SHM_QUEUE uses, but
> somehow we ended up a bit stuck on the naming of dlist_delete variant that
> afterwards zeroes next/prev so we can replace SHMQueueIsDetached() uses.
>
> Should probably find and rebase those patches...

https://www.postgresql.org/message-id/flat/20200211042229.msv23badgqljrdg2%40alap3.anarazel.de




Re: logical decoding and replication of sequences, take 2

2022-11-17 Thread Tomas Vondra


On 11/17/22 18:07, Andres Freund wrote:
> Hi,
> 
> On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:
>> On 11/17/22 03:43, Andres Freund wrote:
>>> On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
 Well, yeah - we can either try to perform the stuff independently of the
 transactions that triggered it, or we can try making it part of some of
 the transactions. Each of those options has problems, though :-(

 The first version of the patch tried the first approach, i.e. decode the
 increments and apply that independently. But:

   (a) What would you do with increments of sequences created/reset in a
   transaction? Can't apply those outside the transaction, because it
   might be rolled back (and that state is not visible on primary).
>>>
>>> I think a reasonable approach could be to actually perform different WAL
>>> logging for that case. It'll require a bit of machinery, but could actually
>>> result in *less* WAL logging overall, because we don't need to emit a WAL
>>> record for each SEQ_LOG_VALS sequence values.
>>>
>>
>> Could you elaborate? Hard to comment without knowing more ...
>>
>> My point was that stuff like this (creating a new sequence or at least a
>> new relfilenode) means we can't apply that independently of the
>> transaction (unlike regular increments). I'm not sure how a change to
>> WAL logging would make that go away.
> 
> Different WAL logging would make it easy to handle that on the logical
> decoding level. We don't need to emit WAL records each time a
> created-in-this-toplevel-xact sequences gets incremented as they're not
> persisting anyway if the surrounding xact aborts. We already need to remember
> the filenode so it can be dropped at the end of the transaction, so we could
> emit a single record for each sequence at that point.
> 
> 
   (b) What about increments created before we have a proper snapshot?
   There may be transactions dependent on the increment. This is what
   ultimately led to revert of the patch.
>>>
>>> I don't understand this - why would we ever need to process those increments
>>> from before we have a snapshot?  Wouldn't they, by definition, be before the
>>> slot was active?
>>>
>>> To me this is the rough equivalent of logical decoding not giving the 
>>> initial
>>> state of all tables. You need some process outside of logical decoding to 
>>> get
>>> that (obviously we have some support for that via the exported data snapshot
>>> during slot creation).
>>>
>>
>> Which is what already happens during tablesync, no? We more or less copy
>> sequences as if they were tables.
> 
> I think you might have to copy sequences after tables, but I'm not sure. But
> otherwise, yea.
> 
> 
>>> I assume that part of the initial sync would have to be a new sequence
>>> synchronization step that reads all the sequence states on the publisher and
>>> ensures that the subscriber sequences are at the same point. There's a bit 
>>> of
>>> trickiness there, but it seems entirely doable. The logical replication 
>>> replay
>>> support for sequences will have to be a bit careful about not decreasing the
>>> subscriber's sequence values - the standby initially will be ahead of the
>>> increments we'll see in the WAL. But that seems inevitable given the
>>> non-transactional nature of sequences.
>>>
>>
>> See fetch_sequence_data / copy_sequence in the patch. The bit about
>> ensuring the sequence does not go away (say, using page LSN and/or LSN
>> of the increment) is not there, however isn't that pretty much what I
>> proposed doing for "reconciling" the sequence state logged at COMMIT?
> 
> Well, I think the approach of logging all sequence increments at commit is the
> wrong idea...
> 

But we're not logging all sequence increments, no?

We're logging the state for each sequence touched by the transaction,
but only once - if the transaction incremented the sequence 100x
times, we'll still log it just once (at least for this particular purpose).

Yes, if transactions touch each sequence just once, then we're logging
individual increments.

The only more efficient solution would be to decode the existing WAL
(every ~32 increments), and perhaps also tracking which sequences were
accessed by a transaction. And then simply stashing the increments in a
global reorderbuffer hash table, and then applying only the last one at
commit time. This would require the transactional / non-transactional
behavior (I think), but perhaps we can make that work.

Or are you thinking about some other scheme?

> Creating a new relfilenode whenever a sequence is incremented seems like a
> complete no-go to me. That increases sequence overhead by several orders of
> magnitude and will lead to *awful* catalog bloat on the subscriber.
> 

You mean on the the apply side? Yes, I agree this needs a better
approach, I've focused on the decoding side so far.

> 
>>>
 This version of the patch tries to do the opposite thing - 

Re: Reducing power consumption on idle servers

2022-11-17 Thread Robert Haas
On Thu, Nov 17, 2022 at 2:55 AM Simon Riggs
 wrote:
> No, it will have a direct effect only on people using promote_trigger_file
> who do not read and act upon the deprecation notice before upgrading
> by making a one line change to their failover scripts.

TBH, I wonder if we shouldn't just nuke promote_trigger_file.
pg_promote was added in 2018, and pg_ctl promote was added in 2011. So
even if we haven't said promote_trigger_file was deprecated, it hasn't
been the state-of-the-art way of doing this in a really long time. If
we think that there are still a lot of people using it, or if popular
tools are relying on it, then perhaps a deprecation period is
warranted anyway. But I think we should at least consider the
possibility that it's OK to just get rid of it right away.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-11-17 Thread Tomas Vondra



On 11/17/22 18:29, Simon Riggs wrote:
> On Thu, 17 Nov 2022 at 17:04, Simon Riggs  
> wrote:
>>
>> New version with greatly improved comments coming very soon.
> 
>>> Perhaps it would be a good idea to split up the patch.  The business
>>> about making pg_subtrans flat rather than a tree seems like a good
>>> idea in any event, although as I said it doesn't seem like we've got
>>> a fleshed-out version of that here.  We could push forward on getting
>>> that done and then separately consider the rest of it.
>>
>> Yes, I thought you might ask that so, after some thought, have found a
>> clean way to do that and have split this into two parts.
> 
> Attached.
> 
> 002 includes many comment revisions, as well as flattening the loops
> in SubTransGetTopmostTransaction and TransactionIdDidCommit/Abort
> 003 includes the idea to not-always do SubTransSetParent()
> 

I'm a bit confused by the TransactionIdsAreOnSameXactPage naming. Isn't
this really checking clog pages?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: ubsan fails on 32bit builds

2022-11-17 Thread Tom Lane
Andres Freund  writes:
> On 2022-11-17 14:20:47 -0500, Robert Haas wrote:
>> Not that I object to a targeted fix

> Should we backpatch this fix? Likely this doesn't cause active breakage
> outside of 32bit builds under ubsan, but that's not an unreasonable thing to
> want to do in the backbranches.

+1 for backpatching what you showed.

>> but it's been 10 years since
>> slist and dlist were committed, and we really ought to eliminate
>> SHM_QUEUE entirely in favor of using those.

> Agreed. I had started on a set of patches for some of the SHM_QUEUE uses, but
> somehow we ended up a bit stuck on the naming of dlist_delete variant that
> afterwards zeroes next/prev so we can replace SHMQueueIsDetached() uses.

Also +1, but of course for HEAD only.

regards, tom lane




Re: ubsan fails on 32bit builds

2022-11-17 Thread Andres Freund
Hi,

On 2022-11-17 14:20:47 -0500, Robert Haas wrote:
> On Wed, Nov 16, 2022 at 8:42 PM Andres Freund  wrote:
> > Afaict the problem is that
> > proc = (PGPROC *) &(waitQueue->links);
> >
> > is a gross gross hack - this isn't actually a PGPROC, it's pointing to an
> > SHM_QUEUE, but *not* one embedded in PGPROC.  It kinda works because ->links
> > is at offset 0 in PGPROC, which means that
> > SHMQueueInsertBefore(&(proc->links), &(MyProc->links));
> > will turn >links back into waitQueue->links. Which we then can enqueue
> > again.
> 
> Not that I object to a targeted fix

Should we backpatch this fix? Likely this doesn't cause active breakage
outside of 32bit builds under ubsan, but that's not an unreasonable thing to
want to do in the backbranches.


> but it's been 10 years since
> slist and dlist were committed, and we really ought to eliminate
> SHM_QUEUE entirely in favor of using those. It's basically an
> open-coded implementation of something for which we now have a
> toolkit. Not that it's impossible to make this kind of mistake with a
> toolkit, but in general open-coding the same logic in multiple places
> increases the risk of bugs.

Agreed. I had started on a set of patches for some of the SHM_QUEUE uses, but
somehow we ended up a bit stuck on the naming of dlist_delete variant that
afterwards zeroes next/prev so we can replace SHMQueueIsDetached() uses.

Should probably find and rebase those patches...

Greetings,

Andres Freund




Re: Allow single table VACUUM in transaction block

2022-11-17 Thread Robert Haas
On Wed, Nov 16, 2022 at 5:14 PM Greg Stark  wrote:
> However I'm not a fan of commands that sometimes do one thing and
> sometimes magically do something very different. I don't like the idea
> that the same vacuum command would sometimes run in-process and
> sometimes do this out of process request. And the rules for when it
> does which are fairly complex to explain -- it runs in process unless
> you're in a transaction when it runs out of process unless it's a
> temporary table ...

100% agree.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: redundant check of msg in does_not_exist_skipping

2022-11-17 Thread Robert Haas
On Thu, Nov 17, 2022 at 10:56 AM Tom Lane  wrote:
> This is a completely bad idea.  If it takes that level of analysis
> to see that msg can't be null, we should leave the test in place.
> Any future modification of either this code or what it calls could
> break the conclusion.

+1. Also, even if the check were quite obviously useless, it's cheap
insurance. It's difficult to believe that it hurts performance in any
measurable way. If anything, we would benefit from having more sanity
checks in our code, rather than fewer.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Allow single table VACUUM in transaction block

2022-11-17 Thread Justin Pryzby
On Wed, Nov 16, 2022 at 05:14:07PM -0500, Greg Stark wrote:
> I think this requesting autovacuum worker should be a distinct
> command. Or at least an explicit option to vacuum.

+1.  I was going to suggest VACUUM (NOWAIT) ..

-- 
Justin




Re: [PoC] configurable out of disk space elog level

2022-11-17 Thread Robert Haas
On Wed, Nov 16, 2022 at 7:59 AM Maxim Orlov  wrote:
> The customer, mentioned above, in this particular case, would be glad to be 
> able to have a mechanism to stop the cluster.
> Again, in this concrete case.
>
> My proposal is to add a tablespace option in order to be able to configure 
> which behaviour is appropriate for a
> particular user. I've decided to call this option “on_no_space” for now. If 
> anyone has a better naming for this feature,
> please, report.

I don't think this is a good feature to add to PostgreSQL. First, it's
unclear why stopping the cluster is a desirable behavior. It doesn't
stop the user transactions from failing; it just makes them fail in
some other way. Now it is of course perfectly legitimate for a
particular user to want that particular behavior anyway, but there are
a bunch of other things that a user could equally legitimately want to
do, like page the DBA or trigger a failover or kill off sessions that
are using large temporary relations or whatever. And, equally, there
are many other operating system errors to which a user could want the
database system to respond in similar ways. For example, someone might
want any given one of those treatments when an I/O error occurs
writing to the data directory, or a read-only filesystem error, or a
permission denied error.

Having a switch for one particular kind of error (out of many that
could possibly occur) that triggers one particular coping strategy
(out of many that could possibly be used) seems far too specific a
thing to add as a core feature. And even if we had something much more
general, I'm not sure why that should go into the database rather than
being implemented outside it. After all, nothing at all prevents the
user from scanning the database logs for "out of space" errors and
shutting down the database if any are found. While you're at it, you
could make your monitoring script also check the free space on the
relevant partition using statfs() and page somebody if the utilization
goes above 95% or whatever threshold you like, which would probably
avoid service outages much more effectively than $SUBJECT.

I just can't see much real benefit in putting this logic inside the database.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: ubsan fails on 32bit builds

2022-11-17 Thread Robert Haas
On Wed, Nov 16, 2022 at 8:42 PM Andres Freund  wrote:
> Afaict the problem is that
> proc = (PGPROC *) &(waitQueue->links);
>
> is a gross gross hack - this isn't actually a PGPROC, it's pointing to an
> SHM_QUEUE, but *not* one embedded in PGPROC.  It kinda works because ->links
> is at offset 0 in PGPROC, which means that
> SHMQueueInsertBefore(&(proc->links), &(MyProc->links));
> will turn >links back into waitQueue->links. Which we then can enqueue
> again.

Not that I object to a targeted fix, but it's been 10 years since
slist and dlist were committed, and we really ought to eliminate
SHM_QUEUE entirely in favor of using those. It's basically an
open-coded implementation of something for which we now have a
toolkit. Not that it's impossible to make this kind of mistake with a
toolkit, but in general open-coding the same logic in multiple places
increases the risk of bugs.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: when the startup process doesn't (logging startup delays)

2022-11-17 Thread Robert Haas
On Thu, Nov 17, 2022 at 2:22 AM Bharath Rupireddy
 wrote:
> Duplication is a problem that I agree with and I have an idea here -
> how about introducing a new function, say EnableStandbyMode() that
> sets StandbyMode to true and disables the startup progress timeout,
> something like the attached?

That works for me, more or less. But I think that
enable_startup_progress_timeout should be amended to either say if
(log_startup_progress_interval == 0 || StandbyMode) return; or else it
should at least Assert(!StandbyMode), so that we can't accidentally
re-enable the timer after we shut it off.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: TAP output format in pg_regress

2022-11-17 Thread Andres Freund
Hi,

On 2022-11-17 11:36:13 +0100, Daniel Gustafsson wrote:
> > On 10 Nov 2022, at 11:44, Nikolay Shaplov  wrote:
> > Did not found quick way to include prove TAP harness right into Makefile, 
> > so I
> > check dumped output, but it is not really important for now, I guess.
>
> I think we'll start by adding TAP to the meson testrunner and await to see
> where the make buildsystem ends up before doing anything there.

+1

prove IME is of, uh, very questionable quality IME. I do not want to run
pg_regress/isolation tests through that unnecessarily. It'd not gain us much
anyway, afaict. And we don't want it as a hard dependency either.


> I'm not sure if that's the right way to go about configuring regress tests as
> TAP emitting, but it at least shows that meson is properly parsing the output
> AFAICT.

Looks correct to me.


> @@ -742,13 +823,13 @@ initialize_environment(void)
>   }
>
>   if (pghost && pgport)
> - printf(_("(using postmaster on %s, port %s)\n"), 
> pghost, pgport);
> + status(_("# (using postmaster on %s, port %s)\n"), 
> pghost, pgport);
>   if (pghost && !pgport)
> - printf(_("(using postmaster on %s, default port)\n"), 
> pghost);
> + status(_("# (using postmaster on %s, default port)\n"), 
> pghost);
>   if (!pghost && pgport)
> - printf(_("(using postmaster on Unix socket, port 
> %s)\n"), pgport);
> + status(_("# (using postmaster on Unix socket, port 
> %s)\n"), pgport);
>   if (!pghost && !pgport)
> - printf(_("(using postmaster on Unix socket, default 
> port)\n"));
> + status(_("# (using postmaster on Unix socket, default 
> port)\n"));
>   }

It doesn't seem quite right to include the '# ' bit in the translated
string. If a translator were to not include it we'd suddenly have incorrect
tap output. That's perhaps not likely, but ... It's also not really necessary
to have it in as many translated strings?

Not that I understand why we even make these strings translatable. It seems
like it'd be a bad idea to translate this kind of output.

But either way, it seems nicer to output the # inside a helper function?


> + /*
> +  * In order for this information (or any error information) to be shown
> +  * when running pg_regress test suites under prove it needs to be 
> emitted
> +  * stderr instead of stdout.
> +  */
>   if (file_size(difffilename) > 0)
>   {
> - printf(_("The differences that caused some tests to fail can be 
> viewed in the\n"
> -  "file \"%s\".  A copy of the test summary that 
> you see\n"
> -  "above is saved in the file \"%s\".\n\n"),
> + status(_("\n# The differences that caused some tests to fail 
> can be viewed in the\n"
> +   "# file \"%s\".  A copy of the test summary 
> that you see\n"
> +   "# above is saved in the file \"%s\".\n\n"),
>  difffilename, logfilename);

The comment about needing to print to stderr is correct - but the patch
doesn't do so (anymore)?

The count of failed tests also should go to stderr (but not the report of all
tests having passed).

bail() probably also ought to print the error to stderr, so the most important
detail is immediately visible when looking at the failed test result.

Greetings,

Andres Freund




Re: HOT chain validation in verify_heapam()

2022-11-17 Thread Andres Freund
Hi,

On 2022-11-17 21:33:17 +0530, Himanshu Upadhyaya wrote:
> On Tue, Nov 15, 2022 at 3:32 AM Andres Freund  wrote:
> > > Furthermore, it is
> > > possible that successor[x] = successor[x'] since the page might be
> > corrupted
> > > and we haven't checked otherwise.
> > >
> > > predecessor[y] = x means that successor[x] = y but in addition we've
> > > checked that y is sane, and that x.xmax=y.xmin. If there are multiple
> > > tuples for which these conditions hold, we've issued complaints about
> > > all but one and entered the last into the predecessor array.
> >
> > As shown by the isolationtester test I just posted, this doesn't quite work
> > right now. Probably fixable.
> >
> > I don't think we can follow non-HOT ctid chains if they're older than the
> > xmin
> > horizon, including all cases of xmin being frozen. There's just nothing
> > guaranteeing that the tuples are actually "related".
> >
> I understand the problem with frozen tuples but don't understand the
> concern with non-HOT chains,
> could you please help with some explanation around it?

I think there might be cases where following non-HOT ctid-chains across tuples
within a page will trigger spurious errors, if the tuple versions are older
than the xmin horizon. But it's a bit hard to say without seeing the code with
a bunch of the other bugs fixed.

Greetings,

Andres Freund




Re: CREATE UNLOGGED TABLE seq faults when debug_discard_caches=1

2022-11-17 Thread Tom Lane
I wrote:
> I wonder whether we ought to back-patch f10f0ae42.  We could
> leave the RelationOpenSmgr macro in existence to avoid unnecessary
> breakage of extension code, but stop using it within our own code.

Concretely, about like this for v14 (didn't look at the older
branches yet).

I'm not sure whether to recommend that outside extensions switch to using
RelationGetSmgr in pre-v15 branches.  If they do, they run a risk
of compile failure should they be built against old back-branch
headers.  Once compiled, though, they'd work against any minor release
(since RelationGetSmgr is static inline, not something in the core
backend).  So maybe that'd be good enough, and keeping their code in
sync with what they need for v15 would be worth something.

regards, tom lane

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 7f8231a681..f37ca1313d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -322,8 +322,7 @@ bt_index_check_internal(Oid indrelid, bool parentcheck, bool heapallindexed,
 		bool		heapkeyspace,
 	allequalimage;
 
-		RelationOpenSmgr(indrel);
-		if (!smgrexists(indrel->rd_smgr, MAIN_FORKNUM))
+		if (!smgrexists(RelationGetSmgr(indrel), MAIN_FORKNUM))
 			ereport(ERROR,
 	(errcode(ERRCODE_INDEX_CORRUPTED),
 	 errmsg("index \"%s\" lacks a main relation fork",
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index c34a640d1c..23661d1105 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -177,9 +177,9 @@ blbuildempty(Relation index)
 	 * this even when wal_level=minimal.
 	 */
 	PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
-	smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
+	smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
 			  (char *) metapage, true);
-	log_newpage(>rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+	log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
 BLOOM_METAPAGE_BLKNO, metapage, true);
 
 	/*
@@ -187,7 +187,7 @@ blbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
 }
 
 /*
diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index b3f73ea92d..0289ea657c 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -514,15 +514,13 @@ autoprewarm_database_main(Datum main_arg)
 			old_blk->filenode != blk->filenode ||
 			old_blk->forknum != blk->forknum)
 		{
-			RelationOpenSmgr(rel);
-
 			/*
 			 * smgrexists is not safe for illegal forknum, hence check whether
 			 * the passed forknum is valid before using it in smgrexists.
 			 */
 			if (blk->forknum > InvalidForkNumber &&
 blk->forknum <= MAX_FORKNUM &&
-smgrexists(rel->rd_smgr, blk->forknum))
+smgrexists(RelationGetSmgr(rel), blk->forknum))
 nblocks = RelationGetNumberOfBlocksInFork(rel, blk->forknum);
 			else
 nblocks = 0;
diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 48d0132a0d..0043823974 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -109,8 +109,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		aclcheck_error(aclresult, get_relkind_objtype(rel->rd_rel->relkind), get_rel_name(relOid));
 
 	/* Check that the fork exists. */
-	RelationOpenSmgr(rel);
-	if (!smgrexists(rel->rd_smgr, forkNumber))
+	if (!smgrexists(RelationGetSmgr(rel), forkNumber))
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("fork \"%s\" does not exist for this relation",
@@ -178,7 +177,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		for (block = first_block; block <= last_block; ++block)
 		{
 			CHECK_FOR_INTERRUPTS();
-			smgrread(rel->rd_smgr, forkNumber, block, blockbuffer.data);
+			smgrread(RelationGetSmgr(rel), forkNumber, block, blockbuffer.data);
 			++blocks_done;
 		}
 	}
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index dd0c124e62..7ccefdb566 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -391,14 +391,14 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 	/* Only some relkinds have a visibility map */
 	check_relation_relkind(rel);
 
-	RelationOpenSmgr(rel);
-	rel->rd_smgr->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
+	/* Forcibly reset cached file size */
+	RelationGetSmgr(rel)->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
 
 	block = visibilitymap_prepare_truncate(rel, 0);
 	if (BlockNumberIsValid(block))
 	{
 		fork = VISIBILITYMAP_FORKNUM;
-		smgrtruncate(rel->rd_smgr, , 1, );
+		smgrtruncate(RelationGetSmgr(rel), , 1, );
 	}
 
 	if (RelationNeedsWAL(rel))
diff --git a/src/backend/access/gist/gistbuild.c 

Re: Assertion failure in SnapBuildInitialSnapshot()

2022-11-17 Thread Andres Freund
Hi,

On 2022-11-17 10:44:18 +0530, Amit Kapila wrote:
> On Wed, Nov 16, 2022 at 11:56 PM Andres Freund  wrote:
> > On 2022-11-16 14:22:01 +0530, Amit Kapila wrote:
> > > On Wed, Nov 16, 2022 at 7:30 AM Andres Freund  wrote:
> > > > On 2022-11-15 16:20:00 +0530, Amit Kapila wrote:
> > > > > On Tue, Nov 15, 2022 at 8:08 AM Andres Freund  
> > > > > wrote:
> > I don't think that'd catch a catalog snapshot. But perhaps the better answer
> > for the catalog snapshot is to just invalidate it explicitly. The user 
> > doesn't
> > have control over the catalog snapshot being taken, and it's not too hard to
> > imagine the walsender code triggering one somewhere.
> >
> > So maybe we should add something like:
> >
> > InvalidateCatalogSnapshot(); /* about to overwrite MyProc->xmin */
> >
> 
> The comment "/* about to overwrite MyProc->xmin */" is unclear to me.
> We already have a check (/* so we don't overwrite the existing value
> */
> if (TransactionIdIsValid(MyProc->xmin))) in SnapBuildInitialSnapshot()
> which ensures that we don't overwrite MyProc->xmin, so the above
> comment seems contradictory to me.

The point is that catalog snapshots could easily end up setting MyProc->xmin,
even though the caller hasn't done anything wrong. So the
InvalidateCatalogSnapshot() would avoid erroring out in a number of scenarios.

Greetings,

Andres Freund




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-11-17 Thread Robert Haas
On Thu, Nov 17, 2022 at 11:24 AM Andres Freund  wrote:
> As an experiment, I added a progress report to BufferSync()'s first loop
> (i.e. where it checks all buffers). On a 128GB shared_buffers cluster that
> increases the time for a do-nothing checkpoint from ~235ms to ~280ms. If I
> remove the changecount stuff and use a single write + write barrier, it ends
> up as 250ms. Inlining brings it down a bit further, to 247ms.

OK, I'd say that's pretty good evidence that we can't totally
disregard the issue.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-11-17 Thread Tom Lane
Andres Freund  writes:
> I think pgstat_progress_start_command() needs the changecount stuff, as does
> pgstat_progress_update_multi_param(). But for anything updating a single
> parameter at a time it really doesn't do anything useful on a platform that
> doesn't tear 64bit writes (so it could be #ifdef
> PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY).

Seems safe to restrict it to that case.

regards, tom lane




Re: logical decoding and replication of sequences, take 2

2022-11-17 Thread Andres Freund
Hi,

On 2022-11-17 12:39:49 +0100, Tomas Vondra wrote:
> On 11/17/22 03:43, Andres Freund wrote:
> > On 2022-11-17 02:41:14 +0100, Tomas Vondra wrote:
> >> Well, yeah - we can either try to perform the stuff independently of the
> >> transactions that triggered it, or we can try making it part of some of
> >> the transactions. Each of those options has problems, though :-(
> >>
> >> The first version of the patch tried the first approach, i.e. decode the
> >> increments and apply that independently. But:
> >>
> >>   (a) What would you do with increments of sequences created/reset in a
> >>   transaction? Can't apply those outside the transaction, because it
> >>   might be rolled back (and that state is not visible on primary).
> >
> > I think a reasonable approach could be to actually perform different WAL
> > logging for that case. It'll require a bit of machinery, but could actually
> > result in *less* WAL logging overall, because we don't need to emit a WAL
> > record for each SEQ_LOG_VALS sequence values.
> >
>
> Could you elaborate? Hard to comment without knowing more ...
>
> My point was that stuff like this (creating a new sequence or at least a
> new relfilenode) means we can't apply that independently of the
> transaction (unlike regular increments). I'm not sure how a change to
> WAL logging would make that go away.

Different WAL logging would make it easy to handle that on the logical
decoding level. We don't need to emit WAL records each time a
created-in-this-toplevel-xact sequences gets incremented as they're not
persisting anyway if the surrounding xact aborts. We already need to remember
the filenode so it can be dropped at the end of the transaction, so we could
emit a single record for each sequence at that point.


> >>   (b) What about increments created before we have a proper snapshot?
> >>   There may be transactions dependent on the increment. This is what
> >>   ultimately led to revert of the patch.
> >
> > I don't understand this - why would we ever need to process those increments
> > from before we have a snapshot?  Wouldn't they, by definition, be before the
> > slot was active?
> >
> > To me this is the rough equivalent of logical decoding not giving the 
> > initial
> > state of all tables. You need some process outside of logical decoding to 
> > get
> > that (obviously we have some support for that via the exported data snapshot
> > during slot creation).
> >
>
> Which is what already happens during tablesync, no? We more or less copy
> sequences as if they were tables.

I think you might have to copy sequences after tables, but I'm not sure. But
otherwise, yea.


> > I assume that part of the initial sync would have to be a new sequence
> > synchronization step that reads all the sequence states on the publisher and
> > ensures that the subscriber sequences are at the same point. There's a bit 
> > of
> > trickiness there, but it seems entirely doable. The logical replication 
> > replay
> > support for sequences will have to be a bit careful about not decreasing the
> > subscriber's sequence values - the standby initially will be ahead of the
> > increments we'll see in the WAL. But that seems inevitable given the
> > non-transactional nature of sequences.
> >
>
> See fetch_sequence_data / copy_sequence in the patch. The bit about
> ensuring the sequence does not go away (say, using page LSN and/or LSN
> of the increment) is not there, however isn't that pretty much what I
> proposed doing for "reconciling" the sequence state logged at COMMIT?

Well, I think the approach of logging all sequence increments at commit is the
wrong idea...

Creating a new relfilenode whenever a sequence is incremented seems like a
complete no-go to me. That increases sequence overhead by several orders of
magnitude and will lead to *awful* catalog bloat on the subscriber.


> >
> >> This version of the patch tries to do the opposite thing - make sure
> >> that the state after each commit matches what the transaction might have
> >> seen (for sequences it accessed). It's imperfect, because it might log a
> >> state generated "after" the sequence got accessed - it focuses on the
> >> guarantee not to generate duplicate values.
> >
> > That approach seems quite wrong to me.
> >
>
> Why? Because it might log a state for sequence as of COMMIT, when the
> transaction accessed the sequence much earlier?

Mainly because sequences aren't transactional and trying to make them will
require awful contortions.

While there are cases where we don't flush the WAL / wait for syncrep for
sequences, we do replicate their state correctly on physical replication. If
an LSN has been acknowledged as having been replicated, we won't just loose a
prior sequence increment after promotion, even if the transaction didn't [yet]
commit.

It's completely valid for an application to call nextval() in one transaction,
potentially even abort it, and then only use that sequence value in another

Re: SUBTRANS: Minimizing calls to SubTransSetParent()

2022-11-17 Thread Simon Riggs
On Wed, 16 Nov 2022 at 15:44, Tom Lane  wrote:
>
> Simon Riggs  writes:
> > On Wed, 16 Nov 2022 at 00:09, Tom Lane  wrote:
> >> No, that's not what XidInMVCCSnapshot does. If snapshot->suboverflowed
> >> is set (ie, somebody somewhere/somewhen overflowed), then it does
> >> SubTransGetTopmostTransaction and searches only the xips with the result.
> >> This behavior requires that all live subxids be correctly mapped by
> >> SubTransGetTopmostTransaction, or we'll draw false conclusions.
>
> > Your comments are correct wrt to the existing coding, but not to the
> > patch, which is coded as described and does not suffer those issues.
>
> Ah, OK.
>
> Still ... I really do not like this patch.  It introduces a number of
> extremely fragile assumptions, and I think those would come back to
> bite us someday, even if they hold now which I'm still unsure about.

Completely understand. It took me months to think this through.

> It doesn't help that you've chosen to document them only at the place
> making them and not at the place(s) likely to break them.

Yes, apologies for that, I focused on the holistic explanation in the slides.

> Also, to be blunt, this is not Ready For Committer.  It's more WIP,
> because even if the code is okay there are comments all over the system
> that you've invalidated.  (At the very least, the file header comments
> in subtrans.c and the comments in struct SnapshotData need work; I've
> not looked hard but I'm sure there are more places with comments
> bearing on these data structures.)

New version with greatly improved comments coming very soon.

> Perhaps it would be a good idea to split up the patch.  The business
> about making pg_subtrans flat rather than a tree seems like a good
> idea in any event, although as I said it doesn't seem like we've got
> a fleshed-out version of that here.  We could push forward on getting
> that done and then separately consider the rest of it.

Yes, I thought you might ask that so, after some thought, have found a
clean way to do that and have split this into two parts.

Julien has agreed to do further perf tests and is working on that now.

I will post new versions soon, earliest tomorrow.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: [BUG] pg_dump blocked

2022-11-17 Thread Gilles Darold

Le 17/11/2022 à 17:59, Tom Lane a écrit :

Gilles Darold  writes:

I have an incorrect behavior with pg_dump prior PG version 15. With
PostgreSQL 15, thanks to commit e3fcbbd623b9ccc16cdbda374654d91a4727d173
the problem is gone but for older versions it persists with locks on
partitioned tables.

I didn't want to back-patch e3fcbbd62 at the time, but it's probably aged
long enough now to be safe to back-patch.  If we do anything here,
it should be to back-patch the whole thing, else we've only partially
fixed the issue.



I can handle this work.

--
Gilles Darold





Re: Standardizing how pg_waldump presents recovery conflict XID cutoffs

2022-11-17 Thread Peter Geoghegan
On Wed, Nov 16, 2022 at 4:34 PM Peter Geoghegan  wrote:
> WFM.

Attached is v3.

Plan is to commit this later on today, barring objections.

Thanks
-- 
Peter Geoghegan


v3-0001-Standardize-rmgrdesc-recovery-conflict-XID-output.patch
Description: Binary data


Re: [BUG] pg_dump blocked

2022-11-17 Thread Tom Lane
Gilles Darold  writes:
> I have an incorrect behavior with pg_dump prior PG version 15. With 
> PostgreSQL 15, thanks to commit e3fcbbd623b9ccc16cdbda374654d91a4727d173 
> the problem is gone but for older versions it persists with locks on 
> partitioned tables.

I didn't want to back-patch e3fcbbd62 at the time, but it's probably aged
long enough now to be safe to back-patch.  If we do anything here,
it should be to back-patch the whole thing, else we've only partially
fixed the issue.

regards, tom lane




[BUG] pg_dump blocked

2022-11-17 Thread Gilles Darold

Hi,


I have an incorrect behavior with pg_dump prior PG version 15. With 
PostgreSQL 15, thanks to commit e3fcbbd623b9ccc16cdbda374654d91a4727d173 
the problem is gone but for older versions it persists with locks on 
partitioned tables.



When we try to dump a database where a table is locked, pg_dump wait 
until the lock is released, this is expected. Now if the table is 
excluded from the dump using the -T option, obviously pg_dump is not 
concerned by the lock. Unfortunately this is not the case when the table 
is partitioned because of the call to pg_get_partkeydef(), pg_get_expr() 
in the query generated in getTables().  Here is the use case to reproduce.



In a psql session execute:

   BEGIN;

   LOCK TABLE measurement;

then run a pg_dump command excluding the measurement partitions:

    pg_dump -d test -T "measurement*" > /dev/null

it will not end until the lock on the partition is released.

I think the problem is the same if we use a schema exclusion where the 
partitioned table is locked.



Is it possible to consider a backport fix? If yes, does adding the 
table/schema filters in the query generated in getTables() is enough or 
do you think about of a kind of backport of commit 
e3fcbbd623b9ccc16cdbda374654d91a4727d173 ?



Best regards,

--
Gilles Darold


Re: Moving forward with TDE

2022-11-17 Thread David Christensen
Hi Dilip,

Thanks for the feedback here. I will review the docs changes and add to my tree.

Best,

David




Re: [PATCH] Teach pg_waldump to extract FPIs from the WAL

2022-11-17 Thread David Christensen
On Wed, Nov 16, 2022 at 3:30 AM Bharath Rupireddy
 wrote:
> 1.
> -if (config.filter_by_fpw && !XLogRecordHasFPW(xlogreader_state))
> +if (config.filter_by_fpw && !XLogRecordHasFPI(xlogreader_state))
> These changes are not related to this feature, hence renaming those
> variables/function names must be dealt with separately. If required,
> proposing another patch can be submitted to change filter_by_fpw to
> filter_by_fpi and XLogRecordHasFPW() to XLogRecordHasFPI().

Not required; can revert the changes unrelated to this specific patch.
(I'd written the original ones for it, so didn't really think anything
of it... :-))

> 2.
> +/* We fsync our output directory only; since these files are not part
> + * of the production database we do not require the performance hit
> + * that fsyncing every FPI would entail, so are doing this as a
> + * compromise. */
> The commenting style doesn't match the standard that we follow
> elsewhere in postgres, please refer to other multi-line comments.

Will fix.

> 3.
> +fsync_fname(config.save_fpi_path, true);
> +}
> It looks like fsync_fname()/fsync() in general isn't recursive, in the
> sense that it doesn't fsync the files under the directory, but the
> directory only. So, the idea of directory fsync doesn't seem worth it.
> We either 1) get rid of fsync entirely or 2) fsync all the files after
> they are created and the directory at the end or 3) do option (2) with
> --no-sync option similar to its friends. Since option (2) is a no go,
> we can either choose option (1) or option (2). My vote at this point
> is for option (1).

Agree to remove.

> 4.
> +($walfile_name, $blocksize) = split '\|' =>
> $node->safe_psql('postgres',"SELECT pg_walfile_name(pg_switch_wal()),
> current_setting('block_size')");
> +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> I think there's something wrong with this, no? pg_switch_wal() can, at
> times, return end+1 of the prior segment (see below snippet) and I'm
> not sure if such a case can happen here.
>
>  * The return value is either the end+1 address of the switch record,
>  * or the end+1 address of the prior segment if we did not need to
>  * write a switch record because we are already at segment start.
>  */
> XLogRecPtr
> RequestXLogSwitch(bool mark_unimportant)

I think this approach is pretty common to get the walfile name, no?
While there might be an edge case here, since the rest of the test is
a controlled environment I'm inclined to just not worry about it; this
would require the changes prior to this to exactly fill a WAL segment
which strikes me as extremely unlikely to the point of impossible in
this specific scenario.

> 5.
> +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> +ok(-f $walfile, "Got a WAL file");
> Is this checking if the WAL file is present or not in PGDATA/pg_wal?
> If yes, I think this isn't required as pg_switch_wal() ensures that
> the WAL is written and flushed to disk.

You are correct, probably another artifact of the earlier version.
That said, not sure I see the harm in keeping it as a sanity-check.

> 6.
> +my $walfile = $node->basedir . '/pgdata/pg_wal/' . $walfile_name;
> Isn't "pgdata" hardcoded here? I think you might need to do the following:
> $node->data_dir . '/pg_wal/' . $walfile_name;;

Can fix.

> 7.
> +# save filename for later verification
> +$files{$file}++;
>
> +# validate that we ended up with some FPIs saved
> +ok(keys %files > 0, 'verify we processed some files');
> Why do we need to store filenames in an array when we later just check
> the size of the array? Can't we use a boolean (file_found) or an int
> variable (file_count) to verify that we found the file.

Another artifact; we were comparing the files output between two
separate lists of arbitrary numbers of pages being written out and
verifying the raw/fixup versions had the same lists.

> 8.
> +$node->safe_psql('postgres', < +SELECT 'init' FROM
> pg_create_physical_replication_slot('regress_pg_waldump_slot', true,
> false);
> +CREATE TABLE test_table AS SELECT generate_series(1,100) a;
> +CHECKPOINT; -- required to force FPI for next writes
> +UPDATE test_table SET a = a + 1;
> +EOF
> The EOF with append_conf() is being used in 4 files and elsewhere in
> the TAP test files (more than 100?) qq[] or quotes is being used. I
> have no strong opinion here, I'll leave it to the other reviewers or
> committer.

I'm inclined to leave it just for (personal) readability, but can
change if there's a strong consensus against.

> > > 11.
> > > +# verify filename formats matches w/--save-fpi
> > > +for my $fullpath (glob "$tmp_folder/raw/*")
> > > Do we need to look for the exact match of the file that gets created
> > > in the save-fpi path? While checking for this is great, it makes the
> > > test code non-portable (may not work on Windows or other platforms,
> > > no?) and complex? This way, you can get rid of 

Re: allow segment size to be set to < 1GiB

2022-11-17 Thread Andres Freund
On 2022-11-17 10:48:52 -0500, Tom Lane wrote:
> Yeah, I think we should allow setting it to 1 block.  This switch is
> only for testing purposes (I hope the docs make that clear).

"This option is only for developers, to test segment related code."




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-11-17 Thread Andres Freund
Hi,

On 2022-11-17 09:03:32 -0500, Robert Haas wrote:
> On Wed, Nov 16, 2022 at 2:52 PM Andres Freund  wrote:
> > I don't think we'd want every buffer write or whatnot go through the
> > changecount mechanism, on some non-x86 platforms that could be noticable. 
> > But
> > if we didn't stage the stats updates locally I think we could make most of 
> > the
> > stats changes without that overhead.  For updates that just increment a 
> > single
> > counter there's simply no benefit in the changecount mechanism afaict.
>
> You might be right, but I'm not sure whether it's worth stressing
> about. The progress reporting mechanism uses the st_changecount
> mechanism, too, and as far as I know nobody's complained about that
> having too much overhead. Maybe they have, though, and I've just
> missed it.

I've seen it in profiles, although not as the major contributor. Most places
do a reasonable amount of work between calls though.

As an experiment, I added a progress report to BufferSync()'s first loop
(i.e. where it checks all buffers). On a 128GB shared_buffers cluster that
increases the time for a do-nothing checkpoint from ~235ms to ~280ms. If I
remove the changecount stuff and use a single write + write barrier, it ends
up as 250ms. Inlining brings it down a bit further, to 247ms.

Obviously this is a very extreme case - we only do very little work between
the progress report calls. But it does seem to show that the overhead is not
entirely neglegible.


I think pgstat_progress_start_command() needs the changecount stuff, as does
pgstat_progress_update_multi_param(). But for anything updating a single
parameter at a time it really doesn't do anything useful on a platform that
doesn't tear 64bit writes (so it could be #ifdef
PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY).


Out of further curiosity I wanted to test the impact when the loop doesn't
even do a LockBufHdr() and added an unlocked pre-check. 109ms without
progress. 138ms with. 114ms with the simplified
pgstat_progress_update_param(). 108ms after inlining the simplified
pgstat_progress_update_param().

Greetings,

Andres Freund




Re: CREATE UNLOGGED TABLE seq faults when debug_discard_caches=1

2022-11-17 Thread Tom Lane
Spyridon Dimitrios Agathos  writes:
> while testing the developer settings of PSQL (14.5) I came across this
> issue:

> postgres=# CREATE UNLOGGED TABLE stats (
> postgres(# pg_hash BIGINT NOT NULL,
> postgres(# category TEXT NOT NULL,
> postgres(# PRIMARY KEY (pg_hash, category)
> postgres(# );
> server closed the connection unexpectedly

Hmm ... confirmed in the v14 branch, but v15 and HEAD are fine,
evidently as a result of commit f10f0ae42 having replaced this
unprotected use of index->rd_smgr.

I wonder whether we ought to back-patch f10f0ae42.  We could
leave the RelationOpenSmgr macro in existence to avoid unnecessary
breakage of extension code, but stop using it within our own code.

regards, tom lane




Re: logical decoding and replication of sequences, take 2

2022-11-17 Thread Robert Haas
On Wed, Nov 16, 2022 at 8:41 PM Tomas Vondra
 wrote:
> There's a couple of caveats, though:
>
> 1) Maybe we should focus more on "actually observed" state instead of
> "observable". Who cares if the sequence moved forward in a transaction
> that was ultimately rolled back? No committed transaction should have
> observer those values - in a way, the last "valid" state of the sequence
> is the last value generated in a transaction that ultimately committed.

When I say "observable" I mean from a separate transaction, not one
that is making changes to things.

I said "observable" rather than "actually observed" because we neither
know nor care whether someone actually ran a SELECT statement at any
given moment in time, just what they would have seen if they did.

> 2) I think what matters more is that we never generate duplicate value.
> That is, if you generate a value from a sequence, commit a transaction
> and replicate it, then the logical standby should not generate the same
> value from the sequence. This guarantee seems necessary for "failover"
> to logical standby.

I think that matters, but I don't think it's sufficient. We need to
preserve the order in which things appear to happen, and which changes
are and are not atomic, not just the final result.

> Well, yeah - we can either try to perform the stuff independently of the
> transactions that triggered it, or we can try making it part of some of
> the transactions. Each of those options has problems, though :-(
>
> The first version of the patch tried the first approach, i.e. decode the
> increments and apply that independently. But:
>
>   (a) What would you do with increments of sequences created/reset in a
>   transaction? Can't apply those outside the transaction, because it
>   might be rolled back (and that state is not visible on primary).

If the state isn't going to be visible until the transaction commits,
it has to be replicated as part of the transaction. If I create a
sequence and then nextval it a bunch of times, I can't replicate that
by first creating the sequence, and then later, as a separate
operation, replicating the nextvals. If I do that, then there's an
intermediate state visible on the replica that was never visible on
the origin server. That's broken.

>   (b) What about increments created before we have a proper snapshot?
>   There may be transactions dependent on the increment. This is what
>   ultimately led to revert of the patch.

Whatever problem exists here is with the implementation, not the
concept. If you copy the initial state as it exists at some moment in
time to a replica, and then replicate all the changes that happen
afterward to that replica without messing up the order, the replica
WILL be in sync with the origin server. The things that happen before
you copy the initial state do not and cannot matter.

But what you're describing sounds like the changes aren't really
replicated in visibility order, and then it is easy to see how a
problem like this can happen. Because now, an operation that actually
became visible just before or just after the initial copy was taken
might be thought to belong on the other side of that boundary, and
then everything will break. And it sounds like that is what you are
describing.

> This version of the patch tries to do the opposite thing - make sure
> that the state after each commit matches what the transaction might have
> seen (for sequences it accessed). It's imperfect, because it might log a
> state generated "after" the sequence got accessed - it focuses on the
> guarantee not to generate duplicate values.

Like Andres, I just can't imagine this being correct. It feels like
it's trying to paper over the failure to do the replication properly
during the transaction by overwriting state at the end.

> Yes, this would mean we accept we may end up with something like this:
>
> 1: T1 logs sequence state S1
> 2: someone increments sequence
> 3: T2 logs sequence stats S2
> 4: T2 commits
> 5: T1 commits
>
> which "inverts" the apply order of S1 vs. S2, because we first apply S2
> and then the "old" S1. But as long as we're smart enough to "discard"
> applying S1, I think that's acceptable - because it guarantees we'll not
> generate duplicate values (with values in the committed transaction).
>
> I'd also argue it does not actually generate invalid state, because once
> we commit either transaction, S2 is what's visible.

I agree that it's OK if the sequence increment gets merged into the
commit that immediately follows. However, I disagree with the idea of
discarding the second update on the grounds that it would make the
sequence go backward and we know that can't be right. That algorithm
works in the really specific case where the only operations are
increments. As soon as anyone does anything else to the sequence, such
an algorithm can no longer work. Nor can it work for objects that are
not sequences. The alternative strategy of replicating each change

Re: HOT chain validation in verify_heapam()

2022-11-17 Thread Himanshu Upadhyaya
On Tue, Nov 15, 2022 at 3:32 AM Andres Freund  wrote:

>
> > Furthermore, it is
> > possible that successor[x] = successor[x'] since the page might be
> corrupted
> > and we haven't checked otherwise.
> >
> > predecessor[y] = x means that successor[x] = y but in addition we've
> > checked that y is sane, and that x.xmax=y.xmin. If there are multiple
> > tuples for which these conditions hold, we've issued complaints about
> > all but one and entered the last into the predecessor array.
>
> As shown by the isolationtester test I just posted, this doesn't quite work
> right now. Probably fixable.
>
> I don't think we can follow non-HOT ctid chains if they're older than the
> xmin
> horizon, including all cases of xmin being frozen. There's just nothing
> guaranteeing that the tuples are actually "related".
>
> I understand the problem with frozen tuples but don't understand the
concern with non-HOT chains,
could you please help with some explanation around it?


-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com


Re: Moving forward with TDE

2022-11-17 Thread David Christensen
Hi Jacob,

Thanks, I've added this patch in my tree [1]. (For now, just adding
fixes and the like atop the original separate patches, but will
eventually get things winnowed down into probably the same 12 parts
the originals were reviewed in.

Best,

David

[1] https://github.com/pgguru/postgres/tree/tde




Re: HOT chain validation in verify_heapam()

2022-11-17 Thread Himanshu Upadhyaya
On Wed, Nov 16, 2022 at 12:41 PM Himanshu Upadhyaya <
upadhyaya.himan...@gmail.com> wrote:

>
>
>> > + }
>> > +
>> > + /* Loop over offsets and validate the data in the
>> predecessor array. */
>> > + for (OffsetNumber currentoffnum = FirstOffsetNumber;
>> currentoffnum <= maxoff;
>> > +  currentoffnum = OffsetNumberNext(currentoffnum))
>> > + {
>> > + HeapTupleHeader pred_htup;
>> > + HeapTupleHeader curr_htup;
>> > + TransactionId pred_xmin;
>> > + TransactionId curr_xmin;
>> > + ItemId  pred_lp;
>> > + ItemId  curr_lp;
>> > +
>> > + ctx.offnum = predecessor[currentoffnum];
>> > + ctx.attnum = -1;
>> > +
>> > + if (ctx.offnum == 0)
>> > + {
>> > + /*
>> > +  * Either the root of the chain or an
>> xmin-aborted tuple from
>> > +  * an abandoned portion of the HOT chain.
>> > +  */
>>
>> Hm - couldn't we check that the tuple could conceivably be at the root of
>> a
>> chain? I.e. isn't HEAP_HOT_UPDATED? Or alternatively has an aborted xmin?
>>
>>
>  I don't see a way to check if tuple is at the root of HOT chain because
> predecessor array will always be having either xmin from non-abandoned
> transaction or it will be zero. We can't differentiate root or tuple
> inserted via abandoned transaction.
>
> I was wrong here. I think this can be done and will be doing these changes
in my next patch.


-- 
Regards,
Himanshu Upadhyaya
EnterpriseDB: http://www.enterprisedb.com


Re: redundant check of msg in does_not_exist_skipping

2022-11-17 Thread Tom Lane
Japin Li  writes:
> On Thu, 17 Nov 2022 at 23:06, Japin Li  wrote:
>> I think we cannot remove the check, for example, if objtype is 
>> OBJECT_OPFAMILY,
>> and schema_does_not_exist_skipping() returns true, the so the msg keeps NULL,
>> if we remove this check, a sigfault might be occurd in ereport().

> Sorry, I didn't look into schema_does_not_exist_skipping(), and after look
> into schema_does_not_exist_skipping function and others, all paths that go
> out switch branch has non-NULL for msg, so we can remove this check safely.

This is a completely bad idea.  If it takes that level of analysis
to see that msg can't be null, we should leave the test in place.
Any future modification of either this code or what it calls could
break the conclusion.

regards, tom lane




Re: allow segment size to be set to < 1GiB

2022-11-17 Thread Tom Lane
Andres Freund  writes:
> On 2022-11-17 09:58:48 -0500, Andrew Dunstan wrote:
>> Are we going to impose some sane minimum, or leave it up to developers
>> to discover that for themselves?

> I don't think we should. It's actually useful to e.g. use 1 page sized
> segments for testing, and with one exceptions the tests pass with it too. Do
> you see a reason to impose one?

Yeah, I think we should allow setting it to 1 block.  This switch is
only for testing purposes (I hope the docs make that clear).

regards, tom lane




Re: [PoC] configurable out of disk space elog level

2022-11-17 Thread Andres Freund
On 2022-11-17 14:40:02 +0300, Maxim Orlov wrote:
> On Wed, 16 Nov 2022 at 20:41, Andres Freund  wrote:
> > that aside, we can't do catalog accesses in all kinds of environments that
> > this currently is active in - most importantly it's affecting the startup
> > process. We don't do catalog accesses in the startup process, and even if
> > we
> > were to do so, we couldn't unconditionally because the catalog might not
> > even
> > be consistent at this point (nor is it guaranteed that the wal_level even
> > allows to access catalogs during recovery).
> >
> Yep, that is why I do use in get_tablespace_elevel:
> +   /*
> +* Use GUC level only in normal mode.
> +*/
> +   if (!IsNormalProcessingMode())
> +   return ERROR;
> 
> Anyway, I appreciate the opinion, thank you!

The startup process is in normal processing mode.




Re: allow segment size to be set to < 1GiB

2022-11-17 Thread Andres Freund
Hi,

On 2022-11-17 09:58:48 -0500, Andrew Dunstan wrote:
> On 2022-11-09 We 15:25, Andres Freund wrote:
> > Hi,
> >
> > On 2022-11-09 14:44:42 -0500, Tom Lane wrote:
> >> Andres Freund  writes:
> >>> A second question: Both autoconf and meson print the segment size as GB 
> >>> right
> >>> now. Obviously that'll print out a size of 0 for a segsize < 1GB.
> >>> The easiest way to would be to just display the number of blocks, but 
> >>> that's
> >>> not particularly nice.
> >> Well, it would be fine if you'd written --with-segsize-blocks, wouldn't
> >> it?  Can we make the printout format depend on which switch was used?
> > Not sure why I didn't think of that...
> >
> > Updated patch attached.
> >
> > I made one autoconf and one meson CI task use a small block size, but just 
> > to
> > ensure it work on both. I'd probably leave it set on one, so we keep the
> > coverage for cfbot?
> >
> 
> Are we going to impose some sane minimum, or leave it up to developers
> to discover that for themselves?

I don't think we should. It's actually useful to e.g. use 1 page sized
segments for testing, and with one exceptions the tests pass with it too. Do
you see a reason to impose one?

Greetings,

Andres Freund




Re: Introduce "log_connection_stages" setting.

2022-11-17 Thread Justin Pryzby
On Tue, Nov 08, 2022 at 07:30:38PM +0100, Sergey Dudoladov wrote:
> +Logs reception of a connection. At this point a connection 
> has been received, but no further work has been done:

receipt

> +Logs the original identity that an authentication method 
> employs to identify a user. In most cases, the identity string equals the 
> PostgreSQL username,

s/equals/matches

> +/* check_hook: validate new log_connection_messages value */
> +bool
> +check_log_connection_messages(char **newval, void **extra, GucSource source)
> +{
> + char*rawname;
> + List*namelist;
> + ListCell*l;
> + char*log_connection_messages = *newval;
> + bool*myextra;
> +
> + /*
> +  * Set up the "extra" struct actually used by 
> assign_log_connection_messages.
> +  */
> + myextra = (bool *) guc_malloc(LOG, 4 * sizeof(bool));

This function hardcodes each of the 4 connections:

> + if (pg_strcasecmp(stage, "received") == 0)
> + myextra[0] = true;

It'd be better to use #defines or enums for these.

> --- a/src/backend/tcop/postgres.c
> +++ b/src/backend/tcop/postgres.c
> @@ -84,8 +84,11 @@ const char *debug_query_string; /* client-supplied query 
> string */
>  /* Note: whereToSendOutput is initialized for the bootstrap/standalone case 
> */
>  CommandDest whereToSendOutput = DestDebug;
>  
> -/* flag for logging end of session */
> -bool Log_disconnections = false;
> +/* flags for logging information about session state */
> +bool Log_disconnected = false;
> +bool Log_authenticated = false;
> +bool Log_authorized = false;
> +bool Log_received = false;

I think this ought to be an integer with flag bits, rather than 4
booleans (I don't know, but there might be more later?).  Then, the
implementation follows the user-facing GUC and also follows
log_destination.

-- 
Justin




Re: redundant check of msg in does_not_exist_skipping

2022-11-17 Thread Japin Li


On Thu, 17 Nov 2022 at 23:06, Japin Li  wrote:
> On Thu, 17 Nov 2022 at 20:12, Ted Yu  wrote:
>> Hi,
>> I was looking at commit aca992040951c7665f1701cd25d48808eda7a809
>>
>> I think the check of msg after the switch statement is not necessary. The
>> variable msg is used afterward.
>> If there is (potential) missing case in switch statement, the compiler
>> would warn.
>>
>> How about removing the check ?
>>
>
> I think we cannot remove the check, for example, if objtype is 
> OBJECT_OPFAMILY,
> and schema_does_not_exist_skipping() returns true, the so the msg keeps NULL,
> if we remove this check, a sigfault might be occurd in ereport().
>
> case OBJECT_OPFAMILY:
> {
> List   *opfname = list_copy_tail(castNode(List, object), 
> 1);
>
> if (!schema_does_not_exist_skipping(opfname, , ))
> {
> msg = gettext_noop("operator family \"%s\" does not exist 
> for access method \"%s\", skipping");
> name = NameListToString(opfname);
> args = strVal(linitial(castNode(List, object)));
> }
> }
> break;

Sorry, I didn't look into schema_does_not_exist_skipping(), and after look
into schema_does_not_exist_skipping function and others, all paths that go
out switch branch has non-NULL for msg, so we can remove this check safely.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: libpq compression (part 2)

2022-11-17 Thread Peter Eisentraut

On 17.11.22 01:27, Andrey Borodin wrote:

Also I've found one more TODO item for the patch. Currently in
fe-connect.c patch is doing buffer reset:
conn->inStart = conn->inCursor = conn->inEnd = 0;
This effectively consumes butes up tu current cursor. However, packet
inspection is continued. The patch works because in most cases
following code will issue re-read of message. Coincidentally.
/* Get the type of request. */
if (pqGetInt((int *) , 4, conn))
{
/* We'll come back when there are more data */
return PGRES_POLLING_READING;
}


Note that the above code was just changed in dce92e59b1.  I don't know 
how that affects this patch set.






Re: redundant check of msg in does_not_exist_skipping

2022-11-17 Thread Japin Li


On Thu, 17 Nov 2022 at 20:12, Ted Yu  wrote:
> Hi,
> I was looking at commit aca992040951c7665f1701cd25d48808eda7a809
>
> I think the check of msg after the switch statement is not necessary. The
> variable msg is used afterward.
> If there is (potential) missing case in switch statement, the compiler
> would warn.
>
> How about removing the check ?
>

I think we cannot remove the check, for example, if objtype is OBJECT_OPFAMILY,
and schema_does_not_exist_skipping() returns true, the so the msg keeps NULL,
if we remove this check, a sigfault might be occurd in ereport().

case OBJECT_OPFAMILY:
{
List   *opfname = list_copy_tail(castNode(List, object), 1);

if (!schema_does_not_exist_skipping(opfname, , ))
{
msg = gettext_noop("operator family \"%s\" does not exist 
for access method \"%s\", skipping");
name = NameListToString(opfname);
args = strVal(linitial(castNode(List, object)));
}
}
break;

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




contrib: auth_delay module

2022-11-17 Thread 成之焕

The attached patch is a contrib module  to set login restrictions on users with 
too many authentication failure. The administrator could manage several GUC 
parameters to control the login restrictions which are listed below.
- set the wait time when password authentication fails.
- allow the wait time grows when users of the same IP consecutively logon 
failed.
- set the maximum authentication failure number from the same user. The system 
will prevent a user who gets too many authentication failures from entering the
database.


I hope this will be useful to future development.
Thanks.
-


zhch...@ceresdata.com















pgsql-v12.8-auth-delay.1.patch
Description: Binary data


Re: libpq support for NegotiateProtocolVersion

2022-11-17 Thread Peter Eisentraut

On 16.11.22 19:35, Jacob Champion wrote:

On Tue, Nov 15, 2022 at 2:19 AM Peter Eisentraut
 wrote:

I think for the current code, the following would be an appropriate
adjustment:

diff --git a/src/interfaces/libpq/fe-connect.c
b/src/interfaces/libpq/fe-connect.c
index 746e9b4f1efc..d15fb96572d9 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -3412,8 +3412,7 @@ PQconnectPoll(PGconn *conn)
  /* Get the type of request. */
  if (pqGetInt((int *) , 4, conn))
  {
-   /* We'll come back when there are more data */
-   return PGRES_POLLING_READING;
+   goto error_return;
  }
  msgLength -= 4;

And then the handling of the 'v' message in my patch would also be
adjusted like that.


Yes -- though that particular example may be dead code, since we
should have already checked that there are at least four more bytes in
the buffer.


I have committed this change and the adjusted original patch.  Thanks.





Re: allow segment size to be set to < 1GiB

2022-11-17 Thread Andrew Dunstan


On 2022-11-09 We 15:25, Andres Freund wrote:
> Hi,
>
> On 2022-11-09 14:44:42 -0500, Tom Lane wrote:
>> Andres Freund  writes:
>>> A second question: Both autoconf and meson print the segment size as GB 
>>> right
>>> now. Obviously that'll print out a size of 0 for a segsize < 1GB.
>>> The easiest way to would be to just display the number of blocks, but that's
>>> not particularly nice.
>> Well, it would be fine if you'd written --with-segsize-blocks, wouldn't
>> it?  Can we make the printout format depend on which switch was used?
> Not sure why I didn't think of that...
>
> Updated patch attached.
>
> I made one autoconf and one meson CI task use a small block size, but just to
> ensure it work on both. I'd probably leave it set on one, so we keep the
> coverage for cfbot?
>

Are we going to impose some sane minimum, or leave it up to developers
to discover that for themselves?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Hash index build performance tweak from sorting

2022-11-17 Thread Tomas Vondra
Hi,

I did some simple benchmark with v2 and v3, using the attached script,
which essentially just builds hash index on random data, with different
data types and maintenance_work_mem values. And what I see is this
(median of 10 runs):

machine   data type  m_w_mmasterv2v3

 i5  bigint  128MB  9652  9402  9669
  32MB  9545  9291  9535
   4MB  9599  9371  9741
int  128MB  9666  9475  9676
  32MB  9530  9347  9528
   4MB  9595  9394  9624
   text  128MB  9755  9596  9897
  32MB  9711  9547  9846
   4MB  9808  9744 10024
   xeon  bigint  128MB 10790 10555 10812
  32MB 10690 10373 10579
   4MB 10682 10351 10650
int  128MB 11258 10550 10712
  32MB 10963 10272 10410
   4MB 11152 10366 10589
   text  128MB 10935 10694 10930
  32MB 10822 10672 10861
   4MB 10835 10684 10895

Or, relative to master:

machine data type  memory  v2   v3
--
 i5bigint   128MB  97.40%  100.17%
 32MB  97.34%   99.90%
  4MB  97.62%  101.48%
  int   128MB  98.03%  100.11%
 32MB  98.08%   99.98%
  4MB  97.91%  100.31%
 text   128MB  98.37%  101.46%
 32MB  98.32%  101.40%
  4MB  99.35%  102.20%
   xeonbigint   128MB  97.82%  100.20%
 32MB  97.03%   98.95%
  4MB  96.89%   99.70%
  int   128MB  93.71%   95.15%
 32MB  93.70%   94.95%
  4MB  92.95%   94.95%
 text   128MB  97.80%   99.96%
 32MB  98.62%  100.36%
  4MB  98.61%  100.55%

So to me it seems v2 performs demonstrably better, v3 is consistently
slower - not only compared to v2, but often also to master.

Attached is the script I used and the raw results - this includes also
results for logged tables - the improvement is smaller, but the
conclusions are otherwise similar.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Companyunlogged int pg-master 4MB 1 9654.418
unlogged int pg-master 32MB 1 9531.940
unlogged int pg-master 128MB 1 9676.285
unlogged int pg-master 4MB 2 9583.072
unlogged int pg-master 32MB 2 9527.424
unlogged int pg-master 128MB 2 9666.685
unlogged int pg-master 4MB 3 9581.031
unlogged int pg-master 32MB 3 9520.868
unlogged int pg-master 128MB 3 9652.160
unlogged int pg-master 4MB 4 9617.728
unlogged int pg-master 32MB 4 9532.923
unlogged int pg-master 128MB 4 9665.569
unlogged int pg-master 4MB 5 9590.640
unlogged int pg-master 32MB 5 9533.886
unlogged int pg-master 128MB 5 9668.846
unlogged int pg-master 4MB 6 9598.494
unlogged int pg-master 32MB 6 9528.925
unlogged int pg-master 128MB 6 9659.308
unlogged int pg-master 4MB 7 9605.972
unlogged int pg-master 32MB 7 9536.492
unlogged int pg-master 128MB 7 9662.790
unlogged int pg-master 4MB 8 9569.671
unlogged int pg-master 32MB 8 9522.346
unlogged int pg-master 128MB 8 9666.571
unlogged int pg-patched-v2 4MB 1 9392.726
unlogged int pg-patched-v2 32MB 1 9340.266
unlogged int pg-patched-v2 128MB 1 9478.272
unlogged int pg-patched-v2 4MB 2 9407.779
unlogged int pg-patched-v2 32MB 2 9339.967
unlogged int pg-patched-v2 128MB 2 9477.475
unlogged int pg-patched-v2 4MB 3 9403.061
unlogged int pg-patched-v2 32MB 3 9336.728
unlogged int pg-patched-v2 128MB 3 9465.251
unlogged int pg-patched-v2 4MB 4 9387.758
unlogged int pg-patched-v2 32MB 4 9346.300
unlogged int pg-patched-v2 128MB 4 9470.000
unlogged int pg-patched-v2 4MB 5 9394.634
unlogged int pg-patched-v2 32MB 5 9357.497
unlogged int pg-patched-v2 128MB 5 9474.249
unlogged int pg-patched-v2 4MB 6 9392.015
unlogged int pg-patched-v2 32MB 6 9350.377
unlogged int pg-patched-v2 128MB 6 9478.068
unlogged int pg-patched-v2 4MB 7 9397.745
unlogged int 

Re: locale -a missing on Alpine Linux?

2022-11-17 Thread Tom Lane
Andrew Dunstan  writes:
> malleefowl is a docker instance (mostly docker images is what Alpine is
> used for). It would be extremely easy to recreate the image and add in
> musl-locales, but maybe we should just leave it as it is to test the
> case where the command isn't available.

Agreed, leave it as-is.

regards, tom lane




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-11-17 Thread Robert Haas
On Wed, Nov 16, 2022 at 2:52 PM Andres Freund  wrote:
> I don't think we'd want every buffer write or whatnot go through the
> changecount mechanism, on some non-x86 platforms that could be noticable. But
> if we didn't stage the stats updates locally I think we could make most of the
> stats changes without that overhead.  For updates that just increment a single
> counter there's simply no benefit in the changecount mechanism afaict.

You might be right, but I'm not sure whether it's worth stressing
about. The progress reporting mechanism uses the st_changecount
mechanism, too, and as far as I know nobody's complained about that
having too much overhead. Maybe they have, though, and I've just
missed it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Supporting TAP tests with MSVC and Windows

2022-11-17 Thread Andrew Dunstan


On 2022-11-17 Th 01:18, Noah Misch wrote:
> On Mon, Aug 22, 2022 at 09:44:42AM -0400, Andrew Dunstan wrote:
>> On 2022-08-21 Su 20:40, Noah Misch wrote:
>>> This (commit 13d856e of 2015-07-29) added the following:
>>>
>>> --- a/src/test/perl/TestLib.pm
>>> +++ b/src/test/perl/TestLib.pm
>>> @@ -242,7 +288,17 @@ sub command_exit_is
>>> print("# Running: " . join(" ", @{$cmd}) ."\n");
>>> my $h = start $cmd;
>>> $h->finish();
>>> -   is($h->result(0), $expected, $test_name);
>>> +
>>> +   # On Windows, the exit status of the process is returned directly as the
>>> +   # process's exit code, while on Unix, it's returned in the high bits
>>> +   # of the exit code (see WEXITSTATUS macro in the standard 
>>> +   # header file). IPC::Run's result function always returns exit code >> 
>>> 8,
>>> +   # assuming the Unix convention, which will always return 0 on Windows as
>>> +   # long as the process was not terminated by an exception. To work around
>>> +   # that, use $h->full_result on Windows instead.
>>> +   my $result = ($Config{osname} eq "MSWin32") ?
>>> +   ($h->full_results)[0] : $h->result(0);
>>> +   is($result, $expected, $test_name);
>>>  }
>>>
>>> That behavior came up again in the context of a newer IPC::Run test case.  
>>> I'm
>>> considering changing the IPC::Run behavior such that the above would have 
>>> been
>>> unnecessary.  However, if I do, the above code would want to adapt to handle
>>> pre-change and post-change IPC::Run versions.  If you have an opinion on
>>> whether or how IPC::Run should change, I welcome comments on
>>> https://github.com/toddr/IPC-Run/issues/161.
>> Assuming it changes, we'll have to have a version test here. I don't
>> think we can have a flag day where we suddenly require IPC::Run's
>> bleeding edge on Windows. So changing it is a good thing, but it won't
>> help us much.
> IPC::Run git now has the change, and we may release soon.  Here's what I plan
> to push to make PostgreSQL cope.


LGTM


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: locale -a missing on Alpine Linux?

2022-11-17 Thread Andrew Dunstan


On 2022-11-16 We 14:25, Tom Lane wrote:
> Christoph Moench-Tegeder  writes:
>> ## Peter Eisentraut (peter.eisentr...@enterprisedb.com):
>>> First of all, is this a standard installation of this OS, or is perhaps 
>>> something incomplete, broken, or unusual about the current OS installation?
>> Alpine uses musl libc, on which you need package musl-locales to get
>> a /usr/bin/locale.
>> https://pkgs.alpinelinux.org/package/edge/community/x86/musl-locales
> Ah.  And that also shows that if you didn't install that package,
> you don't have any locales either, except presumably C/POSIX.
>
> So probably we should treat failure of the locale command as okay
> and just press on with no non-built-in locales.


malleefowl is a docker instance (mostly docker images is what Alpine is
used for). It would be extremely easy to recreate the image and add in
musl-locales, but maybe we should just leave it as it is to test the
case where the command isn't available.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: HOT chain validation in verify_heapam()

2022-11-17 Thread Robert Haas
On Wed, Nov 16, 2022 at 10:57 PM Himanshu Upadhyaya
 wrote:
> I think if pred_xmin is aborted and curr_xmin is in progress we should 
> consider it as a corruption case but vice versa is not true.

Yeah, you're right. I'm being stupid about subtransactions again.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)

2022-11-17 Thread Bharath Rupireddy
On Thu, Nov 17, 2022 at 12:49 AM Robert Haas  wrote:
>
> > I also think that a new pg_stat_checkpoint view is needed
> > because, right now, the PgStat_CheckpointerStats stats are exposed via
> > the pg_stat_bgwriter view, having a separate view for checkpoint stats
> > is good here.
>
> Yep.

On Wed, Nov 16, 2022 at 11:44 PM Andres Freund  wrote:
>
> > I also think that a new pg_stat_checkpoint view is needed
> > because, right now, the PgStat_CheckpointerStats stats are exposed via
> > the pg_stat_bgwriter view, having a separate view for checkpoint stats
> > is good here.
>
> I agree that we should do that, but largely independent of the architectural
> question at hand.

Thanks. I quickly prepared a patch introducing pg_stat_checkpointer
view and posted it here -
https://www.postgresql.org/message-id/CALj2ACVxX2ii%3D66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg%40mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Don't treate IndexStmt like AlterTable when DefineIndex is called from ProcessUtilitySlow.

2022-11-17 Thread 正华吕
Hi,

  Recently read the code, I find that when calling DefineIndex
  from ProcessUtilitySlow, is_alter_table will be set true if
  this statement is came from expandTableLikeClause.

  I check the code of DefineIndex, there are only two places use
  is_alter_table:
1. the function index_check_primary_key
2. print a debug log on what the statement is

  For 1, since we are doing create table xxx (like yyy including
  indexes), we are sure that the check relationHasPrimaryKey in the
  function index_check_primary_key will be satisfied because we are just
  create the new table.

  For 2, I do not think it will mislead the user if we print it as
  CreateStmt.

  Based on the above, I think  we can always a false value
  for is_alter_table when DefineIndex is called from
  ProcessUtilitySlow.

   Here I attach a patch. Any ideas?
   Thanks a lot.

Best,
Zhenghua Lyu


0001-Don-t-treate-IndexStmt-like-AlterTable-when-DefineIn.patch
Description: Binary data


Introduce a new view for checkpointer related stats

2022-11-17 Thread Bharath Rupireddy
Hi,

pg_stat_bgwriter view currently reports checkpointer stats as well. It
is that way because historically checkpointer was part of bgwriter
until the commits 806a2ae and bf405ba, that went into PG 9.2,
separated them out. I think it is time for us to separate checkpointer
stats to its own view. I discussed it in another thread [1] and it
seems like there's an unequivocal agreement for the proposal.

I'm attaching a patch introducing a new pg_stat_checkpointer view,
with this change the pg_stat_bgwriter view only focuses on bgwriter
related stats. The patch does mostly mechanical changes. I'll add it
to CF in a bit.

Thoughts?

[1]
https://www.postgresql.org/message-id/flat/20221116181433.y2hq2pirtbqmmndt%40awork3.anarazel.de#b873a4bd7d8d7ec70750a7047db33f56
https://www.postgresql.org/message-id/CA%2BTgmoYCu6RpuJ3cZz0e7cZSfaVb%3Dcr6iVcgGMGd5dxX0MYNRA%40mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 707b3f0e0e7cf55b48a09709b070341d23ad03b8 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 17 Nov 2022 12:20:13 +
Subject: [PATCH v1] Introduce a new view for checkpointer related stats

pg_stat_bgwriter view currently reports checkpointer stats as well.
It is that way because historically checkpointer was part of
bgwriter until the commits 806a2ae and bf405ba, that went into
PG 9.2, separated them out.

It is time for us to separate checkpointer stats to its own view
called pg_stat_checkpointer.

Bump catalog version.
---
 doc/src/sgml/monitoring.sgml  | 108 +-
 src/backend/catalog/system_views.sql  |  18 +--
 .../utils/activity/pgstat_checkpointer.c  |   1 +
 src/backend/utils/adt/pgstatfuncs.c   |  19 +--
 src/include/catalog/pg_proc.dat   |  22 ++--
 src/include/pgstat.h  |   1 +
 src/test/recovery/t/029_stats_restart.pl  |   6 +-
 src/test/regress/expected/rules.out   |  15 +--
 src/test/regress/expected/stats.out   |  21 +++-
 src/test/regress/sql/stats.sql|  12 +-
 10 files changed, 154 insertions(+), 69 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d514..52c3118727 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -448,6 +448,15 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  
  
 
+ 
+  pg_stat_checkpointerpg_stat_checkpointer
+  One row only, showing statistics about the
+   checkpointer process's activity. See
+   
+   pg_stat_checkpointer for details.
+ 
+ 
+
  
   pg_stat_walpg_stat_wal
   One row only, showing statistics about WAL activity. See
@@ -3647,97 +3656,138 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
  
   
-   checkpoints_timed bigint
+   buffers_clean bigint
   
   
-   Number of scheduled checkpoints that have been performed
+   Number of buffers written by the background writer
   
  
 
  
   
-   checkpoints_req bigint
+   maxwritten_clean bigint
   
   
-   Number of requested checkpoints that have been performed
+   Number of times the background writer stopped a cleaning
+   scan because it had written too many buffers
   
  
 
  
   
-   checkpoint_write_time double precision
+   buffers_alloc bigint
   
   
-   Total amount of time that has been spent in the portion of
-   checkpoint processing where files are written to disk, in milliseconds
+   Number of buffers allocated
   
  
 
  
   
-   checkpoint_sync_time double precision
+   stats_reset timestamp with time zone
   
   
-   Total amount of time that has been spent in the portion of
-   checkpoint processing where files are synchronized to disk, in
-   milliseconds
+   Time at which these statistics were last reset
   
  
+
+   
+  
+
+ 
+
+ 
+  pg_stat_checkpointer
+
+  
+   pg_stat_checkpointer
+  
 
+  
+   The pg_stat_checkpointer view will always have a
+   single row, containing global data for the cluster.
+  
+
+  
+   pg_stat_checkpointer View
+   
+
  
   
-   buffers_checkpoint bigint
+   Column Type
   
   
-   Number of buffers written during checkpoints
+   Description
+  
+ 
+
+
+
+ 
+  
+   checkpoints_timed bigint
+  
+  
+   Number of scheduled checkpoints that have been performed
   
  
 
  
   
-   buffers_clean bigint
+   checkpoints_req bigint
   
   
-   Number of buffers written by the background writer
+   Number of requested checkpoints that have been performed
   
  
 
  
   
-   maxwritten_clean bigint
+   checkpoint_write_time double 

Re: old_snapshot: add test for coverage

2022-11-17 Thread Daniel Gustafsson
> On 8 Aug 2022, at 07:37, Tom Lane  wrote:
> Dong Wook Lee  writes:

>> I wrote a test of the old_snapshot extension for coverage.
> 
> Hmm, does this really provide any meaningful coverage?  The test
> sure looks like it's not doing much.

Looking at this I agree, this test doesn't provide enough to be of value and
the LIMIT 0 might even hide bugs under a postive test result.  I think we
should mark this entry RwF.

--
Daniel Gustafsson   https://vmware.com/





CREATE UNLOGGED TABLE seq faults when debug_discard_caches=1

2022-11-17 Thread Spyridon Dimitrios Agathos
Hi Hackers,

while testing the developer settings of PSQL (14.5) I came across this
issue:

postgres=# CREATE UNLOGGED TABLE stats (
postgres(# pg_hash BIGINT NOT NULL,
postgres(# category TEXT NOT NULL,
postgres(# PRIMARY KEY (pg_hash, category)
postgres(# );
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Checking the stack trace I found this:
Program received signal SIGSEGV, Segmentation fault.
0x00ab6662 in smgrwrite (reln=0x0, forknum=INIT_FORKNUM,
blocknum=0, buffer=0x2b5eec0 "", skipFsync=true)
at
/opt/postgresql-src/debug-build/../src/backend/storage/smgr/smgr.c:526
526 smgrsw[reln->smgr_which].smgr_write(reln, forknum, blocknum,
(gdb) bt
#0  0x00ab6662 in smgrwrite (reln=0x0, forknum=INIT_FORKNUM,
blocknum=0, buffer=0x2b5eec0 "", skipFsync=true) at
/opt/postgresql-src/debug-build/../src/backend/storage/smgr/smgr.c:526
#1  0x0056991b in btbuildempty (index=0x7fe60ac9be60) at
/opt/postgresql-src/debug-build/../src/backend/access/nbtree/nbtree.c:166
#2  0x00623ad9 in index_build (heapRelation=0x7fe60ac9c078,
indexRelation=0x7fe60ac9be60, indexInfo=0x2b4c330, isreindex=false,
parallel=true) at
/opt/postgresql-src/debug-build/../src/backend/catalog/index.c:3028
#3  0x00621886 in index_create (heapRelation=0x7fe60ac9c078,
indexRelationName=0x2b4c448 "stats_pkey", indexRelationId=16954,
parentIndexRelid=0, parentConstraintId=0, relFileNode=0,
indexInfo=0x2b4c330, indexColNames=0x2b4bee8, accessMethodObjectId=403,
tableSpaceId=0, collationObjectId=0x2b4c560, classObjectId=0x2b4c580,
coloptions=0x2b4c5a0, reloptions=0,
flags=3, constr_flags=0, allow_system_table_mods=false,
is_internal=false, constraintId=0x7ffef5cc4a7c) at
/opt/postgresql-src/debug-build/../src/backend/catalog/index.c:1232
#4  0x0074af6e in DefineIndex (relationId=16949, stmt=0x2b527a0,
indexRelationId=0, parentIndexId=0, parentConstraintId=0,
is_alter_table=false, check_rights=true, check_not_in_use=true,
skip_build=false, quiet=false) at
/opt/postgresql-src/debug-build/../src/backend/commands/indexcmds.c:1164
#5  0x00ac8d78 in ProcessUtilitySlow (pstate=0x2b49230,
pstmt=0x2b48fe8, queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\n
 pg_hash BIGINT NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY
(pg_hash, category)\n);", context=PROCESS_UTILITY_SUBCOMMAND, params=0x0,
queryEnv=0x0, dest=0xe9ceb0 , qc=0x0)
at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:1535
#6  0x00ac6637 in standard_ProcessUtility (pstmt=0x2b48fe8,
queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\npg_hash BIGINT
NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY (pg_hash,
category)\n);", readOnlyTree=false, context=PROCESS_UTILITY_SUBCOMMAND,
params=0x0, queryEnv=0x0, dest=0xe9ceb0 , qc=0x0)
at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:1066
#7  0x00ac548b in ProcessUtility (pstmt=0x2b48fe8,
queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\npg_hash BIGINT
NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY (pg_hash,
category)\n);", readOnlyTree=false, context=PROCESS_UTILITY_SUBCOMMAND,
params=0x0, queryEnv=0x0, dest=0xe9ceb0 , qc=0x0)
at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:527
#8  0x00ac7e5e in ProcessUtilitySlow (pstate=0x2b52b10,
pstmt=0x2a72d28, queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\n
 pg_hash BIGINT NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY
(pg_hash, category)\n);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x2a72df8, qc=0x7ffef5cc6c10)
at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:1244
#9  0x00ac6637 in standard_ProcessUtility (pstmt=0x2a72d28,
queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\npg_hash BIGINT
NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY (pg_hash,
category)\n);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, dest=0x2a72df8, qc=0x7ffef5cc6c10)
at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:1066
#10 0x00ac548b in ProcessUtility (pstmt=0x2a72d28,
queryString=0x2a71650 "CREATE UNLOGGED TABLE stats (\npg_hash BIGINT
NOT NULL,\ncategory TEXT NOT NULL,\nPRIMARY KEY (pg_hash,
category)\n);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, dest=0x2a72df8, qc=0x7ffef5cc6c10)
at /opt/postgresql-src/debug-build/../src/backend/tcop/utility.c:527
#11 0x00ac4aad in PortalRunUtility (portal=0x2b06bf0,
pstmt=0x2a72d28, isTopLevel=true, setHoldSnapshot=false, dest=0x2a72df8,
qc=0x7ffef5cc6c10) at
/opt/postgresql-src/debug-build/../src/backend/tcop/pquery.c:1155
#12 0x00ac3b57 in PortalRunMulti (portal=0x2b06bf0,
isTopLevel=true, setHoldSnapshot=false, 

Re: Typo in SH_LOOKUP and SH_LOOKUP_HASH comments

2022-11-17 Thread Daniel Gustafsson
> On 17 Nov 2022, at 10:56, vignesh C  wrote:

> When I was reviewing one of the patches, I found a typo in the
> comments of SH_LOOKUP and SH_LOOKUP_HASH. I felt "lookup up" should
> have been "lookup".
> Attached a patch to modify it.

I agree with that, applied to HEAD. Thanks!

--
Daniel Gustafsson   https://vmware.com/





redundant check of msg in does_not_exist_skipping

2022-11-17 Thread Ted Yu
Hi,
I was looking at commit aca992040951c7665f1701cd25d48808eda7a809

I think the check of msg after the switch statement is not necessary. The
variable msg is used afterward.
If there is (potential) missing case in switch statement, the compiler
would warn.

How about removing the check ?

Thanks

diff --git a/src/backend/commands/dropcmds.c
b/src/backend/commands/dropcmds.c
index db906f530e..55996940eb 100644
--- a/src/backend/commands/dropcmds.c
+++ b/src/backend/commands/dropcmds.c
@@ -518,9 +518,6 @@ does_not_exist_skipping(ObjectType objtype, Node
*object)

 /* no default, to let compiler warn about missing case */
 }
-if (!msg)
-elog(ERROR, "unrecognized object type: %d", (int) objtype);
-
 if (!args)
 ereport(NOTICE, (errmsg(msg, name)));
 else


Re: Lockless queue of waiters in LWLock

2022-11-17 Thread Pavel Borisov
Hi, hackers!

I've noticed that alignment requirements for using
pg_atomic_init_u64() for LWlock state (there's an assertion that it's
aligned at 8 bytes) do not correspond to the code in SimpleLruInit()
on 32-bit arch when MAXIMUM_ALIGNOF = 4.
Fixed this in v4 patch (PFA).

Regards,
Pavel.


v4-0001-Lockless-queue-of-LWLock-waiters.patch
Description: Binary data


Re: [PoC] Reducing planning time when tables have many partitions

2022-11-17 Thread Thom Brown
On Thu, 17 Nov 2022 at 11:20, Thom Brown  wrote:
>
> On Thu, 17 Nov 2022 at 09:31, Alvaro Herrera  wrote:
> >
> > On 2022-Nov-16, Thom Brown wrote:
> >
> > > Once the issue Tom identified has been resolved, I'd like to test
> > > drive newer patches.
> >
> > What issue?  If you mean the one from the thread "Reducing
> > duplicativeness of EquivalenceClass-derived clauses", that patch is
> > already applied (commit a5fc46414deb), and Yuya Watari's v8 series
> > applies fine to current master.
>
> Ah, I see..  I'll test the v8 patches.

No issues with applying.  Created 1024 partitions, each of which is
partitioned into 64 partitions.

I'm getting a generic planning time of 1415ms.  Is that considered
reasonable in this situation?  Bear in mind that the planning time
prior to this patch was 282311ms, so pretty much a 200x speedup.

Thom




  1   2   >