date:20210929

Re: typos (and more)

2021-09-29 Thread Michael Paquier

On Mon, Sep 27, 2021 at 09:27:56PM -0500, Justin Pryzby wrote:
> That's an "expanded" version of 0008.

Okay, thanks.

> It doesn't include 0012, which is primarily about fixing incorrect references
> to "index expressions" that should refer to stats expressions.  Naturally 0012
> also uses the phrase "statistics objects", and fixes one nearby reference
> that's not itself about indexes, which could arguably be in 0008 instead..

Merging both made the most sense to me after reviewing the whole area
of the code dedicated to stats.  This has been applied after taking
care of some issues with the indentation, with few extra tweaks.
--
Michael


signature.asc
Description: PGP signature

[PATCH] Added TRANSFORM FOR for COMMENT tab completion

2021-09-29 Thread katouknl


Hi,

I created a patch for COMMENT tab completion.
It was missing TRANSFORM FOR where it's supposed to be.

Best wishes,
Ken Katodiff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 5cd5838668..d18b26a4a5 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -494,7 +494,6 @@ static const SchemaQuery Query_for_list_of_partitioned_indexes = {
 	.result = "pg_catalog.quote_ident(c.relname)",
 };
 
-
 /* All relations */
 static const SchemaQuery Query_for_list_of_relations = {
 	.catname = "pg_catalog.pg_class c",
@@ -628,7 +627,6 @@ static const SchemaQuery Query_for_list_of_collations = {
 	.result = "pg_catalog.quote_ident(c.collname)",
 };
 
-
 /*
  * Queries to get lists of names of various kinds of things, possibly
  * restricted to names matching a partially entered name.  In these queries,
@@ -2408,13 +2406,19 @@ psql_completion(const char *text, int start, int end)
 	  "COLUMN", "AGGREGATE", "FUNCTION",
 	  "PROCEDURE", "ROUTINE",
 	  "OPERATOR", "TRIGGER", "CONSTRAINT", "DOMAIN",
-	  "LARGE OBJECT", "TABLESPACE", "TEXT SEARCH", "ROLE");
+	  "LARGE OBJECT", "TABLESPACE", "TEXT SEARCH", "TRANSFORM FOR", "ROLE");
 	else if (Matches("COMMENT", "ON", "ACCESS", "METHOD"))
 		COMPLETE_WITH_QUERY(Query_for_list_of_access_methods);
 	else if (Matches("COMMENT", "ON", "FOREIGN"))
 		COMPLETE_WITH("DATA WRAPPER", "TABLE");
 	else if (Matches("COMMENT", "ON", "TEXT", "SEARCH"))
 		COMPLETE_WITH("CONFIGURATION", "DICTIONARY", "PARSER", "TEMPLATE");
+	else if (Matches("COMMENT", "ON", "TRANSFORM", "FOR"))
+		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_datatypes, NULL);
+	else if (Matches("COMMENT", "ON", "TRANSFORM", "FOR", MatchAny))
+		COMPLETE_WITH("LANGUAGE");
+	else if (Matches("COMMENT", "ON", "TRANSFORM", "FOR", MatchAny, "LANGUAGE"))
+		COMPLETE_WITH_QUERY(Query_for_list_of_languages);	
 	else if (Matches("COMMENT", "ON", "CONSTRAINT"))
 		COMPLETE_WITH_QUERY(Query_for_all_table_constraints);
 	else if (Matches("COMMENT", "ON", "CONSTRAINT", MatchAny))
@@ -2430,7 +2434,8 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH_QUERY(Query_for_list_of_event_triggers);
 	else if (Matches("COMMENT", "ON", MatchAny, MatchAnyExcept("IS")) ||
 			 Matches("COMMENT", "ON", MatchAny, MatchAny, MatchAnyExcept("IS")) ||
-			 Matches("COMMENT", "ON", MatchAny, MatchAny, MatchAny, MatchAnyExcept("IS")))
+			 Matches("COMMENT", "ON", MatchAny, MatchAny, MatchAny, MatchAnyExcept("IS")) ||
+			 Matches("COMMENT", "ON", MatchAny, MatchAny, MatchAny, MatchAny, MatchAnyExcept("IS")))
 		COMPLETE_WITH("IS");
 
 /* COPY */

Re: Added schema level support for publication.

2021-09-29 Thread Amit Kapila

On Wed, Sep 29, 2021 at 11:59 AM houzj.f...@fujitsu.com
 wrote:
>
> On Wed, Sep 29, 2021 12:34 PM Amit Kapila 
> > On Wed, Sep 29, 2021 at 9:07 AM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Tues, Sep 28, 2021 10:46 PM vignesh C  wrote:
> > > > Attached v34 patch has the changes for the same.
> > >
> > > 3)
> > > +   /*
> > > +* Check if setting the relation to a different schema will 
> > > result in the
> > > +* publication having schema and same schema's table in the
> > publication.
> > > +*/
> > > +   if (stmt->objectType == OBJECT_TABLE)
> > > +   {
> > > +   ListCell   *lc;
> > > +   List   *schemaPubids =
> > GetSchemaPublications(nspOid);
> > > +   foreach(lc, schemaPubids)
> > > +   {
> > > +   Oid pubid = lfirst_oid(lc);
> > > +   if (list_member_oid(GetPublicationRelations(pubid,
> > PUBLICATION_PART_ALL),
> > > +   relid))
> > > +   ereport(ERROR,
> > >
> > > How about we check this case like the following ?
> > >
> > > List   *schemaPubids = GetSchemaPublications(nspOid);
> > > List   *relPubids = GetRelationPublications(RelationGetRelid(rel));
> > > if (list_intersection(schemaPubids, relPubids))
> > > ereport(ERROR, ...
> > >
> >
> > Won't this will allow changing one of the partitions for which only 
> > partitioned
> > table is part of the target schema?
>
> I think it still disallow changing partition's schema to the published one.
> I tested with the following SQLs.
> -
> create schema sch1;
> create schema sch2;
> create schema sch3;
>
> create table sch1.tbl1 (a int) partition by range ( a );
> create table sch2.tbl1_part1 partition of sch1.tbl1 for values from (1) to 
> (101);
> create table sch3.tbl1_part2 partition of sch1.tbl1 for values from (101) to 
> (200);
> create publication pub for ALL TABLES IN schema sch1, TABLE sch2.tbl1_part1;
> alter table sch2.tbl1_part1 set schema sch1;
> ---* It will report an error here *
> -
>

Use all steps before "create publication" and then try below. These
will give an error with the patch proposed but if I change it to what
you are proposing then it won't give an error.
create publication pub for ALL TABLES IN schema sch2, Table sch1.tbl1;
alter table sch3.tbl1_part2 set schema sch2;

But now again thinking about it, I am not sure if we really want to
give error in this case. What do you think? Also, if we use
list_intersection trick, then how will we tell the publication due to
which this problem has occurred, or do you think we should leave that
as an exercise for the user?

-- 
With Regards,
Amit Kapila.

Re: stat() vs ERROR_DELETE_PENDING, round N + 1

2021-09-29 Thread Juan José Santamaría Flecha

On Tue, Sep 28, 2021 at 2:50 AM Thomas Munro  wrote:

> On Thu, Sep 23, 2021 at 9:13 PM Juan José Santamaría Flecha
>  wrote:
> > When GetTempFileName() finds a duplicated file name and the file is
> pending for deletion, it fails with an "ERROR_ACCESS_DENIED" error code.
> That may describe the situation better than "ERROR_FILE_EXISTS".
>
> Thanks for looking.  Why do you think that's better?  I assume that's
> just the usual NT->Win32 error conversion at work.
>
> When a function returns an error caused by accessing a file
in DELETE_PENDING you should expect an EACCES. Nonetheless, if we can
emulate a POSIX behaviour by mapping it to EEXIST, that works for me. I
also consider that having the logic for DELETE_PENDING in a single function
is an improvement.

Regards,

Juan José Santamaría Flecha

Re: Failed transaction statistics to measure the logical replication progress

2021-09-29 Thread Amit Kapila

On Wed, Sep 29, 2021 at 11:35 AM osumi.takami...@fujitsu.com
 wrote:
>
> Hi,
>
>
> Thank you, Amit-san and Sawada-san for the discussion.
> On Tuesday, September 28, 2021 7:05 PM Amit Kapila  
> wrote:
> > > Another idea could be to have a separate view, say
> > > pg_stat_subscription_xact but I'm not sure it's a better idea.
> > >
> >
> > Yeah, that is another idea but I am afraid that having three different
> > views for subscription stats will be too much. I think it would be
> > better if we can display these additional stats via the existing view
> > pg_stat_subscription or the new view pg_stat_subscription_errors (or
> > whatever name we want to give it).
> pg_stat_subscription_errors specializes in showing an error record.
> So, it would be awkward to combine it with other normal xact stats.
>
>
> > > > > Then, if, we proceed in this direction, the place to implement
> > > > > those stats would be on the LogicalRepWorker struct, instead ?
> > > > >
> > > >
> > > > Or, we can make existing stats persistent and then add these stats
> > > > on top of it. Sawada-San, do you have any thoughts on this matter?
> > >
> > > I think that making existing stats including received_lsn and
> > > last_msg_receipt_time persistent by using stats collector could cause
> > > massive reporting messages. We can report these messages with a
> > > certain interval to reduce the amount of messages but we will end up
> > > seeing old stats on the view.
> > >
> >
> > Can't we keep the current and new stats both in-memory and persist on disk?
> > So, the persistent stats data will be used to fill the in-memory counters 
> > after
> > restarting of workers, otherwise, we will always refer to in-memory values.
> I felt this isn't impossible.
> When we have to update the values of the xact stats is
> the end of message apply for COMMIT, COMMIT PREPARED, STREAM_ABORT and etc
> or the time when an error happens during apply. Then, if we want,
> we can update xact stats values at such moments accordingly.
> I'm thinking that we will have a hash table whose key is a pair of subid + 
> relid
> and entry is a proposed stats structure and update the entry,
> depending on the above timings.
>

Are you thinking of a separate hash table then what we are going to
create for Sawada-San's patch related to error stats? Isn't it
possible to have stats in the same hash table and same file?

> Here, one thing a bit unclear to me is
> whether we should move existing stats of pg_stat_subscription
> (such as last_lsn and reply_lsn) to the hash entry or not.
>

I think we should move it to hash entry. I think that is an
improvement over what we have now because now after restart those
stats gets lost.

-- 
With Regards,
Amit Kapila.

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

2021-09-29 Thread Ranier Vilela

Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand 
escreveu:

> Hi,
>
> On 9/28/21 6:50 PM, Tom Lane wrote:
> > "Drouvot, Bertrand"  writes:
> >> Does it make sense (as it is currently) to set the ActiveSnapshot to
> >> NULL and not ensuring the same is done for ActivePortal->portalSnapshot?
> > I think that patch is just a kluge :-(
> right, sounded too simple to be "true".
> >
> > After tracing through this I've figured out what is happening, and
> > why you need to put the exception block inside a loop to make it
> > happen.  The first iteration goes fine, and we end up with an empty
> > ActiveSnapshot stack (and no portalSnapshots) because that's how
> > the COMMIT leaves it.  In the second iteration, we try to
> > re-establish the portal snapshot, but at the point where
> > EnsurePortalSnapshotExists is called (from the first INSERT
> > command) we are already inside a subtransaction thanks to the
> > plpgsql exception block.  This means that the stacked ActiveSnapshot
> > has as_level = 2, although the Portal owning it belongs to the
> > outer transaction.  So at the next exception, AtSubAbort_Snapshot
> > zaps the stacked ActiveSnapshot, but the Portal stays alive and
> > now it has a dangling portalSnapshot pointer.
>
> Thanks for the explanation!
>
> > Basically there seem to be two ways to fix this, both requiring
> > API additions to snapmgr.c (hence, cc'ing Alvaro for opinion):
> >
> > 1. Provide a variant of PushActiveSnapshot that allows the caller
> > to specify the as_level to use, and then have
> > EnsurePortalSnapshotExists, as well as other places that create
> > Portal-associated snapshots, use this with as_level equal to the
> > Portal's createSubid.  The main hazard I see here is that this could
> > theoretically allow the ActiveSnapshot stack to end up with
> > out-of-order as_level values, causing issues.  For the moment we
> > could probably just add assertions to check that the passed as_level
> > is >= next level, or maybe even that this variant is only called with
> > empty ActiveSnapshot stack.
> I think we may get a non empty ActiveSnapshot stack with prepared
> statements, so tempted to do the assertion on the levels.
> >
> > 2. Provide a way for AtSubAbort_Portals to detect whether a
> > portalSnapshot pointer points to a snap that's going to be
> > deleted by AtSubAbort_Snapshot, and then just have it clear any
> > portalSnapshots that are about to become dangling.  (This'd amount
> > to accepting the possibility that portalSnapshot is of a different
> > subxact level from the portal, and dealing with the situation.)
> >
> > I initially thought #2 would be the way to go, but it turns out
> > to be a bit messy since what we have is a Snapshot pointer not an
> > ActiveSnapshotElt pointer.  We'd have to do something like search the
> > ActiveSnapshot stack looking for pointer equality to the caller's
> > Snapshot pointer, which seems fragile --- do we assume as_snap is
> > unique for any other purpose?
> >
> > That being the case, I'm now leaning to #1.  Thoughts?
>
> I'm also inclined to #1.
>
I have a stupid question, why duplicate PushActiveSnapshot?
Wouldn't one function be better?

PushActiveSnapshot(Snapshot snap, int as_level);

Sample calls:
PushActiveSnapshot(GetTransactionSnapshot(),
GetCurrentTransactionNestLevel());
PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());
PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);

regards,
Ranier Vilela

Re: POC: Cleaning up orphaned files using undo logs

2021-09-29 Thread Amit Kapila

On Tue, Sep 28, 2021 at 7:36 PM Antonin Houska  wrote:
>
> Amit Kapila  wrote:
>
> > On Mon, Sep 20, 2021 at 10:55 AM Antonin Houska  wrote:
> > >
> > > Amit Kapila  wrote:
> > >
> > > > On Thu, Sep 9, 2021 at 8:33 PM Antonin Houska  wrote:
> > >
> > > > > The problem of the temporary undo log is that it's loaded into local 
> > > > > buffers
> > > > > and that backend can exit w/o flushing local buffers to disk, and 
> > > > > thus we are
> > > > > not guaranteed to find enough information when trying to discard the 
> > > > > undo log
> > > > > the backend wrote. I'm thinking about the following solutions:
> > > > >
> > > > > 1. Let the backend manage temporary undo log on its own (even the slot
> > > > >metadata would stay outside the shared memory, and in particular 
> > > > > the
> > > > >insertion pointer could start from 1 for each session) and remove 
> > > > > the
> > > > >segment files at the same moment the temporary relations are 
> > > > > removed.
> > > > >
> > > > >However, by moving the temporary undo slots away from the shared 
> > > > > memory,
> > > > >computation of oldestFullXidHavingUndo (see the PROC_HDR 
> > > > > structure) would
> > > > >be affected. It might seem that a transaction which only writes 
> > > > > undo log
> > > > >for temporary relations does not need to affect 
> > > > > oldestFullXidHavingUndo,
> > > > >but it needs to be analyzed thoroughly. Since 
> > > > > oldestFullXidHavingUndo
> > > > >prevents transactions to be truncated from the CLOG too early, I 
> > > > > wonder if
> > > > >the following is possible (This scenario is only applicable to the 
> > > > > zheap
> > > > >storage engine [1], which is not included in this patch, but 
> > > > > should already
> > > > >be considered.):
> > > > >
> > > > >A transaction creates a temporary table, does some (many) changes 
> > > > > and then
> > > > >gets rolled back. The undo records are being applied and it takes 
> > > > > some
> > > > >time. Since XID of the transaction did not affect 
> > > > > oldestFullXidHavingUndo,
> > > > >the XID can disappear from the CLOG due to truncation.
> > > > >
> > > >
> > > > By above do you mean to say that in zheap code, we don't consider XIDs
> > > > that operate on temp table/undo for oldestFullXidHavingUndo?
> > >
> > > I was referring to the code
> > >
> > > /* We can't process temporary undo logs. */
> > > if (log->meta.persistence == UNDO_TEMP)
> > > continue;
> > >
> > > in undodiscard.c:UndoDiscard().
> > >
> >
> > Here, I think it will just skip undo of temporary undo logs and
> > oldestFullXidHavingUndo should be advanced after skipping it.
>
> Right, it'll be adavanced, but the transaction XID (if the transaction wrote
> only to temporary relations) might still be needed.
>
> > > >
> > > > > However zundo.c in
> > > > >[1] indicates that the transaction status *is* checked during undo
> > > > >execution, so we might have a problem.
> > > > >
> > > >
> > > > It would be easier to follow if you can tell which exact code are you
> > > > referring here?
> > >
> > > In meant the call of TransactionIdDidCommit() in
> > > zundo.c:zheap_exec_pending_rollback().
> > >
> >
> > IIRC, this should be called for temp tables after they have exited as
> > this is only to apply the pending undo actions if any, and in case of
> > temporary undo after session exit, we shouldn't need it.
>
> I see (had to play with debugger a bit). Currently this works because the
> temporary relations are dropped by AbortTransaction() ->
> smgrDoPendingDeletes(), before the undo execution starts. The situation will
> change as soon as the file removal will also be handled by the undo subsystem,
> however I'm still not sure how to hit the TransactionIdDidCommit() call for
> the XID already truncated from CLOG.
>
> I'm starting to admint that there's no issue here: temporary undo is always
> applied immediately in foreground, and thus the zheap_exec_pending_rollback()
> function never needs to examine XID which no longer exists in the CLOG.
>
> > I am not able to understand what exact problem you are facing for temp
> > tables after the session exit. Can you please explain it a bit more?
>
> The problem is that the temporary undo buffers are loaded into backend-local
> buffers. Thus there's no guarantee that we'll find a consistent information in
> the undo file even if the backend exited cleanly (local buffers are not
> flushed at backend exit and there's no WAL for them). However, we need to read
> the undo file to find out if (part of) it can be discarded.
>
> I'm trying to find out whether we can ignore the temporary undo when trying to
> advance oldestFullXidHavingUndo or not.
>

It seems this is the crucial point. In the code, you pointed, we
ignore the temporary undo while advancing oldestFullXidHavingUndo but
if you find any case where that is not true then we need to di

Re: Trap errors from streaming child in pg_basebackup to exit early

2021-09-29 Thread Daniel Gustafsson

> On 28 Sep 2021, at 15:48, Magnus Hagander  wrote:

> Might be worth noting in one of the comments the difference in
> behaviour if the backend process/thread *crashes* -- that is, on Unix
> it will be detected via the signal handler and on Windows the whole
> process including the main thread will die, so there is nothing to
> detect.

Good point, done.

> Other places in the code just refers to the background process as "the
> background process".  The term "WAL receiver" is new from this patch.
> While I agree it's in many ways a better term, I think (1) we should
> try to be consistent, and (2) maybe use a different term than "WAL
> receiver" specifically because we have a backend component with that
> name.

Looking at the user-facing messaging we have before this patch, there is a bit
of variability:

On UNIX:

  pg_log_error("could not create pipe for background process: %m");
  pg_log_error("could not create background process: %m");
  pg_log_info("could not send command to background pipe: %m");
  pg_log_error("could not wait for child process: %m");

On Windows:

  pg_log_error("could not create background thread: %m");
  pg_log_error("could not get child thread exit status: %m");
  pg_log_error("could not wait for child thread: %m");
  pg_log_error("child thread exited with error %u", ..);

On Both:

  pg_log_info("starting background WAL receiver");
  pg_log_info("waiting for background process to finish streaming ...");

So there is one mention of a background WAL receiver already in there, but it's
pretty inconsistent as to what we call it.  For now I've changed the messaging
in this patch to say "background process", leaving making this all consistent
for a follow-up patch.

The attached fixes the above, as well as the typo mentioned off-list and is
rebased on top of todays HEAD.

--
Daniel Gustafsson   https://vmware.com/



v4-0001-Quick-exit-on-log-stream-child-exit-in-pg_basebac.patch
Description: Binary data

Re: Failed transaction statistics to measure the logical replication progress

2021-09-29 Thread Masahiko Sawada

On Tue, Sep 28, 2021 at 7:04 PM Amit Kapila  wrote:
>
> On Tue, Sep 28, 2021 at 11:35 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila  wrote:
> > >
> > > >
> > > > Then, if, we proceed in this direction,
> > > > the place to implement those stats
> > > > would be on the LogicalRepWorker struct, instead ?
> > > >
> > >
> > > Or, we can make existing stats persistent and then add these stats on
> > > top of it. Sawada-San, do you have any thoughts on this matter?
> >
> > I think that making existing stats including received_lsn and
> > last_msg_receipt_time persistent by using stats collector could cause
> > massive reporting messages. We can report these messages with a
> > certain interval to reduce the amount of messages but we will end up
> > seeing old stats on the view.
> >
>
> Can't we keep the current and new stats both in-memory and persist on
> disk? So, the persistent stats data will be used to fill the in-memory
> counters after restarting of workers, otherwise, we will always refer
> to in-memory values.

Interesting. Probably we can have apply workers and table sync workers
send their statistics to the stats collector at exit (before the stats
collector shutting down)? And the startup process will restore them at
restart?

>
> > Another idea could be to have a separate
> > view, say pg_stat_subscription_xact but I'm not sure it's a better
> > idea.
> >
>
> Yeah, that is another idea but I am afraid that having three different
> views for subscription stats will be too much.

Yeah, I have the same feeling.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

2021-09-29 Thread Ranier Vilela

Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand 
escreveu:

> Hi,
> On 9/29/21 12:59 PM, Ranier Vilela wrote:
>
>
> Em qua., 29 de set. de 2021 às 06:55, Drouvot, Bertrand <
> bdrou...@amazon.com> escreveu:
>
>> I'm also inclined to #1.
>>
> I have a stupid question, why duplicate PushActiveSnapshot?
> Wouldn't one function be better?
>
> PushActiveSnapshot(Snapshot snap, int as_level);
>
> Sample calls:
> PushActiveSnapshot(GetTransactionSnapshot(),
> GetCurrentTransactionNestLevel());
> PushActiveSnapshot(queryDesc->snapshot, GetCurrentTransactionNestLevel());
> PushActiveSnapshot(GetTransactionSnapshot(), portal->createSubid);
>
> I would say because that could "break" existing extensions for example.
>
> Adding a new function prevents "updating" existing extensions making use
> of PushActiveSnapshot().
>
Valid argument of course.
But the extensions should also fit the core code.
Duplicating functions is very bad for maintenance and bloats the code
unnecessarily, IMHO.

regards,
Ranier Vilela

Re: Some thoughts about the TAP tests' wait_for_catchup()

2021-09-29 Thread Amit Kapila

On Wed, Sep 29, 2021 at 3:47 AM Tom Lane  wrote:
>
> I noticed that some test scripts, instead of using wait_for_catchup
> to wait for replication catchup, use ad-hoc code like
>
> my $primary_lsn =
>   $primary->safe_psql('postgres', 'select pg_current_wal_lsn()');
> $standby->poll_query_until('postgres',
> qq{SELECT '$primary_lsn'::pg_lsn <= pg_last_wal_replay_lsn()})
>   or die "standby never caught up";
>
> This does not look much like what wait_for_catchup does, which
> typically ends up issuing queries like
>
> SELECT pg_current_wal_lsn() <= replay_lsn AND state = 'streaming'
> FROM pg_catalog.pg_stat_replication WHERE application_name = 'standby';
>
> It seems to me that for most purposes wait_for_catchup's approach is
> strictly worse, for two reasons:
>
> 1. It continually recomputes the primary's pg_current_wal_lsn().
> Thus, if anything is happening on the primary (e.g. autovacuum),
> we're moving the goalposts as to how much the standby is required
> to replay before we continue.  That slows down the tests, makes
> them less reproducible, and could help to mask actual bugs of the
> form this-hasn't-been-done-when-it-should-have.
>
> 2. It's querying the primary's view of the standby's state, which
> introduces a reporting delay.  This has exactly the same three
> problems as the other point.
>

I can't comment on all the use cases of wait_for_catchup() but I think
there are some use cases in logical replication where we need the
publisher to use wait_for_catchup after setting up the replication to
ensure that wal sender is started and in-proper state by checking its
state (which should be 'streaming'). That also implicitly checks if
the wal receiver has responded to initial ping requests by sending
replay location. The typical use is as below where after setting up
initial replication we wait for a publisher to catch up and then check
if the initial table sync is finished. There is no use in checking the
second till the first statement is completed.

subscription/t/001_rep_changes.pl
...
$node_publisher->wait_for_catchup('tap_sub');

# Also wait for initial table sync to finish
my $synced_query =
  "SELECT count(1) = 0 FROM pg_subscription_rel WHERE srsubstate NOT
IN ('r', 's');";
$node_subscriber->poll_query_until('postgres', $synced_query)
  or die "Timed out while waiting for subscriber to synchronize data";

I am not sure in such tests if checking solely the subscriber's
wal_replay_lsn would be sufficient. So, I think there are cases
especially in physical replication tests where we can avoid using
wait_for_catchup but I am not sure if we can completely eliminate it.

-- 
With Regards,
Amit Kapila.

Re: Failed transaction statistics to measure the logical replication progress

2021-09-29 Thread Amit Kapila

On Wed, Sep 29, 2021 at 4:49 PM Masahiko Sawada  wrote:
>
> On Tue, Sep 28, 2021 at 7:04 PM Amit Kapila  wrote:
> >
> > On Tue, Sep 28, 2021 at 11:35 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila  
> > > wrote:
> > > >
> > > > >
> > > > > Then, if, we proceed in this direction,
> > > > > the place to implement those stats
> > > > > would be on the LogicalRepWorker struct, instead ?
> > > > >
> > > >
> > > > Or, we can make existing stats persistent and then add these stats on
> > > > top of it. Sawada-San, do you have any thoughts on this matter?
> > >
> > > I think that making existing stats including received_lsn and
> > > last_msg_receipt_time persistent by using stats collector could cause
> > > massive reporting messages. We can report these messages with a
> > > certain interval to reduce the amount of messages but we will end up
> > > seeing old stats on the view.
> > >
> >
> > Can't we keep the current and new stats both in-memory and persist on
> > disk? So, the persistent stats data will be used to fill the in-memory
> > counters after restarting of workers, otherwise, we will always refer
> > to in-memory values.
>
> Interesting. Probably we can have apply workers and table sync workers
> send their statistics to the stats collector at exit (before the stats
> collector shutting down)? And the startup process will restore them at
> restart?
>

I think we do need to send at the exit but we should probably send at
some other regular interval as well to avoid losing all the stats
after the crash-restart situation.

-- 
With Regards,
Amit Kapila.

Re: On login trigger: take three

2021-09-29 Thread Teodor Sigaev


Hi!

Nice feature, but, sorry, I see some design problem in suggested feature. AFAIK, 
there is two use cases for this feature:

1 A permission / prohibition to login some users
2 Just a logging of facts of user's login

Suggested patch proposes prohibition of login only by failing of login trigger 
and it has at least two issues:
1 In case of prohibition to login, there is no clean way to store information 
about unsuccessful  login. Ok, it could be solved by dblink module but it seems 
to scary.
2 For logging purpose failing of trigger will cause impossibility to login, it 
could be workarounded by catching error in trigger function, but it's not so 
obvious for DBA.


some other issues:
3 It's easy to create security hole, see attachment where non-privileged user 
can close access to database even for superuser.
4 Feature is not applicable for logging unsuccessful login with wrong 
username/password or not matched by pg_hba.conf. Oracle could operate with such 
cases. But I don't think that this issue is a stopper for the patch.


May be, to solve that issues we could introduce return value of trigger and/or 
add something like IGNORE ERROR to CREATE EVENT TRIGGER command.


On 15.09.2021 16:32, Greg Nancarrow wrote:

On Wed, Sep 8, 2021 at 10:56 PM Daniel Gustafsson  wrote:


I took a look at this, and while I like the proposed feature I think the patch
has a bit more work required.



Thanks for reviewing the patch.
I am not the original patch author (who no longer seems active) but
I've been contributing a bit and keeping the patch alive because I
think it's a worthwhile feature.



+
+-- 2) Initialize some user session data
+   CREATE TEMP TABLE session_storage (x float, y integer);
The example in the documentation use a mix of indentations, neither of which is
the 2-space indentation used elsewhere in the docs.



Fixed, using 2-space indentation.
(to be honest, the indentation seems inconsistent elsewhere in the
docs e.g. I'm seeing a nearby case of 2-space on 1st indent, 6-space
on 2nd indent)



+   /* Get the command tag. */
+   tag = parsetree ? CreateCommandTag(parsetree) : CMDTAG_CONNECT;
This is hardcoding knowledge that currently only this trigger wont have a
parsetree, and setting the tag accordingly.  This should instead check the
event and set CMDTAG_UNKNOWN if it isn't the expected one.



I updated that, but maybe not exactly how you expected?



+   /* database has on-login triggers */
+   booldathaslogontriggers;
This patch uses three different names for the same thing: client connection,
logontrigger and login trigger.  Since this trigger only fires after successful
authentication it’s not accurate to name it client connection, as that implies
it running on connections rather than logins.  The nomenclature used in the
tree is "login" so the patch should be adjusted everywhere to use that.



Fixed.



+/* Hook for plugins to get control at start of session */
+client_connection_hook_type client_connection_hook = EventTriggerOnConnect;
I don't see the reason for adding core functionality by hooks.  Having a hook
might be a useful thing on its own (to be discussed in a new thread, not hidden
here), but core functionality should not depend on it IMO.



Fair enough, I removed the hook.



+   {"enable_client_connection_trigger", PGC_SU_BACKEND, DEVELOPER_OPTIONS,
+   gettext_noop("Enables the client_connection event trigger."),
+   gettext_noop("In case of errors in the ON client_connection EVENT 
TRIGGER procedure, "
  ..and..
+   /*
+* Try to ignore error for superuser to make it possible to login even 
in case of errors
+* during trigger execution
+*/
+   if (!is_superuser)
+   PG_RE_THROW();
This patch adds two ways for superusers to bypass this event trigger in case of
it being faulty, but for every other event trigger we've documented to restart
in single-user mode and fixing it there.  Why does this need to be different?
This clearly has a bigger chance of being a footgun but I don't see that as a
reason to add a GUC and a bypass that other footguns lack.



OK, I removed those bypasses. We'll see what others think.



+   elog(NOTICE, "client_connection trigger failed with message: %s", 
error->message);
Calling elog() in a PG_CATCH() block isn't allowed is it?



I believe it is allowed (e.g. there's a case in libpq), but I removed
this anyway as part of the superuser bypass.



+   /* Runtlist is empty: clear dathaslogontriggers flag
+*/
s/Runtlist/Runlist/ and also commenting style.



Fixed.



The logic for resetting the pg_database flag when firing event trigger in case
the trigger has gone away is messy and a problem waiting to happen IMO.  For
normal triggers we don't bother with that on the grounds of it being racy, and
instead perform relhastrigger removal during VACUUM.  Another approach is using
a counter as propose upthread, since updating that can

RE: Added schema level support for publication.

2021-09-29 Thread houzj.f...@fujitsu.com

On Wed, Sep 29, 2021 5:14 PM Amit Kapila  wrote:
> On Wed, Sep 29, 2021 at 11:59 AM houzj.f...@fujitsu.com
>  wrote:
> >
> > On Wed, Sep 29, 2021 12:34 PM Amit Kapila 
> > > On Wed, Sep 29, 2021 at 9:07 AM houzj.f...@fujitsu.com
> > >  wrote:
> > > >
> > > > On Tues, Sep 28, 2021 10:46 PM vignesh C 
> wrote:
> > > > > Attached v34 patch has the changes for the same.
> > > > How about we check this case like the following ?
> > > >
> > > > List   *schemaPubids = GetSchemaPublications(nspOid);
> > > > List   *relPubids = GetRelationPublications(RelationGetRelid(rel));
> > > > if (list_intersection(schemaPubids, relPubids))
> > > > ereport(ERROR, ...
> > > >
> > >
> > > Won't this will allow changing one of the partitions for which only
> > > partitioned table is part of the target schema?
> >
> > I think it still disallow changing partition's schema to the published one.
> > I tested with the following SQLs.
> > -
> > create schema sch1;
> > create schema sch2;
> > create schema sch3;
> >
> > create table sch1.tbl1 (a int) partition by range ( a ); create table
> > sch2.tbl1_part1 partition of sch1.tbl1 for values from (1) to (101);
> > create table sch3.tbl1_part2 partition of sch1.tbl1 for values from
> > (101) to (200); create publication pub for ALL TABLES IN schema sch1,
> > TABLE sch2.tbl1_part1; alter table sch2.tbl1_part1 set schema sch1;
> > ---* It will report an error here *
> > -
> >
> 
> Use all steps before "create publication" and then try below. These will give 
> an
> error with the patch proposed but if I change it to what you are proposing 
> then
> it won't give an error.
> create publication pub for ALL TABLES IN schema sch2, Table sch1.tbl1; alter
> table sch3.tbl1_part2 set schema sch2;
> 
> But now again thinking about it, I am not sure if we really want to give 
> error in
> this case. What do you think?

Personally, I think we can allow the above case.

Because if user specify the partitioned table in the publication like above,
they cannot drop the partition separately. And the partitioned table is the
actual one in pg_publication_rel. So, I think allowing this case seems won't
make people feel confused.

Besides, in the current patch, we have allowed similar case in CREATE/ALTER
PUBLICATION cases. In this SQL: "create publication pub for ALL TABLES IN
schema sch2, Table sch1.tbl1;", one of the partitions ' sch2.tbl1_part1' is
from schema 'sch2' which is published. It might be better to make the behavior
consistent.

> Also, if we use list_intersection trick, then how will
> we tell the publication due to which this problem has occurred, or do you 
> think
> we should leave that as an exercise for the user?

I thought list_intersection will return a puboids list in which the puboid 
exists in both input list.
We can choose the first puboid and output it in the error message which seems 
the same as
the current patch.

But I noticed list_intersection doesn't support T_OidList, so we might need to 
search the puboid
Manually. Like:

foreach(cell, relPubids)
{
if (list_member_oid(schemaPubids, lfirst_oid(cell)))
ereport(ERROR,

errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot move table 
\"%s\" to schema \"%s\"",
   
RelationGetRelationName(rel), stmt->newschema),
errdetail("Altering table will 
result in having schema \"%s\" and same schema's table \"%s\" in the 
publication \"%s\" which is not supported.",
  
stmt->newschema,
  
RelationGetRelationName(rel),
  
get_publication_name(lfirst_oid(cell), false)));
}

Best regards,
Hou zj

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-28, Tom Lane wrote:

> 1. Provide a variant of PushActiveSnapshot that allows the caller
> to specify the as_level to use, and then have
> EnsurePortalSnapshotExists, as well as other places that create
> Portal-associated snapshots, use this with as_level equal to the
> Portal's createSubid.  The main hazard I see here is that this could
> theoretically allow the ActiveSnapshot stack to end up with
> out-of-order as_level values, causing issues.  For the moment we
> could probably just add assertions to check that the passed as_level
> is >= next level, or maybe even that this variant is only called with
> empty ActiveSnapshot stack.

I don't see anything wrong with this idea offhand.

I didn't try to create scenarios with out-of-order as_level active
snapshots, but with all the creativity out there in the world, and
considering that it's possible to write procedures in C, I think that
asserting that the order is maintained is warranted.

Now if we do meet a case with out-of-order levels, what do we do?  I
suppose we'd need to update AtSubCommit_Snapshot and AtSubAbort_Snapshot
to cope with that (but by all means let's wait until we have a test case
where that happens ...)

> 2. Provide a way for AtSubAbort_Portals to detect whether a
> portalSnapshot pointer points to a snap that's going to be
> deleted by AtSubAbort_Snapshot, and then just have it clear any
> portalSnapshots that are about to become dangling.  (This'd amount
> to accepting the possibility that portalSnapshot is of a different
> subxact level from the portal, and dealing with the situation.)
> 
> I initially thought #2 would be the way to go, but it turns out
> to be a bit messy since what we have is a Snapshot pointer not an
> ActiveSnapshotElt pointer.  We'd have to do something like search the
> ActiveSnapshot stack looking for pointer equality to the caller's
> Snapshot pointer, which seems fragile --- do we assume as_snap is
> unique for any other purpose?

I don't remember what patch it was, but I remember contemplating
sometime during the past year the possibility of snapshots being used
twice in the active stack.  Now maybe this is not possible in practice
because in most cases we create a copy, but I couldn't swear that that's
always the case.  I wouldn't rely on that.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Las cosas son buenas o malas segun las hace nuestra opinión" (Lisias)

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-29, Ranier Vilela wrote:

> Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand 
> escreveu:

> > Adding a new function prevents "updating" existing extensions making use
> > of PushActiveSnapshot().
> >
> Valid argument of course.
> But the extensions should also fit the core code.
> Duplicating functions is very bad for maintenance and bloats the code
> unnecessarily, IMHO.

Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
updated in the patch.  Given that six sevenths of the calls continue to
use the existing function and that it is less verbose than the new one,
that seems sufficient argument to keep it.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep."(Robert Davidson)
   http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php

Re: pgbench bug candidate: negative "initial connection time"

2021-09-29 Thread Fujii Masao





On 2021/09/24 11:26, Fujii Masao wrote:



On 2021/09/24 7:26, Yugo NAGATA wrote:

That makes sense. Failures of setup connection or initial connection doesn't
seem 'static problems'. I rewrote this description to explain exit status 1
indicates also interal errors and early errors.

   Exit status 1 indicates static problems such as invalid command-line options
   or internal errors which are supposed to never occur.  Early errors that 
occur
   when starting benchmark such as initial connection failures also exit with
   status 1.


LGTM. Barring any objection, I will commit the patch.


I extracted two changes from the patch and pushed (also back-patched) them.

The remainings are the changes of handling of initial connection or
logfile open failures. I agree to push them at least for the master.
But I'm not sure if they should be back-patched. Without these changes,
even when those failures happen, pgbench proceeds the benchmark and
reports the result. But with the changes, pgbench exits immediately in
that case. I'm not sure if there are people who expect this behavior,
but if there are, maybe we should not change it at least at stable branches.
Thought?

BTW, when logfile fails to be opened, pgbench gets stuck due to commit
aeb57af8e6. So even if we decided not to back-patch those changes,
we should improve the handling of logfile open failure, to fix the issue.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Column Filtering in Logical Replication

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-28, Amit Kapila wrote:

> But it is not allowed to create schema or table with the name
> CURRENT_SCHEMA, so not sure if we need to do anything special for it.

Oh?  You certainly can.

alvherre=# create schema "CURRENT_SCHEMA";
CREATE SCHEMA
alvherre=# \dn
Listado de esquemas
 Nombre |   Dueño   
+---
 CURRENT_SCHEMA | alvherre
 public | pg_database_owner
 temp   | alvherre
(3 filas)

alvherre=# create table "CURRENT_SCHEMA"."CURRENT_SCHEMA" ("bother amit for a 
while" int);
CREATE TABLE
alvherre=# \d "CURRENT_SCHEMA".*
  Tabla «CURRENT_SCHEMA.CURRENT_SCHEMA»
 Columna |  Tipo   | Ordenamiento | Nulable | Por omisión 
-+-+--+-+-
 bother amit for a while | integer |  | | 


> Now, during post-processing, the PUBLICATIONOBJ_CONTINUATION will be
> distinguished as CURRENT_SCHEMA because both rangeVar and name will be
> NULL. Do you have other ideas to deal with it?

That sounds plausible.  There's no need for a name-free object of any other
kind AFAICS, so there should be no conflict.  If we ever do find a
conflict, we can add another struct member to disambiguate.

Thanks

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Doing what he did amounts to sticking his fingers under the hood of the
implementation; if he gets his fingers burnt, it's his problem."  (Tom Lane)

Re: when the startup process doesn't (logging startup delays)

2021-09-29 Thread Nitin Jadhav

> It is common to mention what the default is as part of the
> documentation of a GUC. I don't see why this one should be an
> exception, especially since not mentioning it reduces the length of
> the documentation by exactly one word.
>
> I don't mind the sentence you added at the end to clarify the default
> units, but the way you've rewritten the first sentence makes it, in my
> opinion, much less clear.

Ok. I have kept your documentation as it is and added the sentence at
the end to clarify the default units.

> v9 was posted on August 3rd. I told you that it wasn't working on
> September 23rd. You posted a new version that still does not work on
> September 27th. I think you should have tested each version of your
> patch before posting it, and especially after any major refactorings.
> And if for whatever reason you didn't, then certainly after I told you
> on September 23rd that it didn't work, I think you should have fixed
> it before posting a new version.

Sorry. There was a misunderstanding about this and for the patch
shared on September 27th, I had tested for the value '0' and observed
that no progress messages were getting logged, probably the time at
which 'enable_timeout_at' is getting called is past the 'next_timeout'
value. This behaviour is completely dependent on the system. Now added
an extra condition to disable the feature in case of '0' setting.

> I think this comment can be worded better.  It says it "decides", but it
> doesn't actually decide on any action to take -- it just reports whether
> the timer expired or not, to allow its caller to make the decision.  In
> such situations we just say something like "Report whether startup
> progress has caused a timeout, return true and rearm the timer if it
> did, or just return false otherwise"; and we don't indicate what the
> value is going to be used *for*.  Then the caller can use the boolean
> return value to make a decision, such as whether something is going to
> be logged.  This function can be oblivious to details such as this:
>
> here we can just say "No timeout has occurred" and make no inference
> about what's going to happen or not happen.

Modified the comment.

> Also, for functions that do things like this we typically use English
> sentence structure with the function name starting with the verb --
> perhaps has_startup_progress_timeout_expired().  Sometimes we are lax
> about this if we have some sort of poor-man's modularisation by using a
> common prefix for several functions doing related things, which perhaps
> could be "startup_progress_*" in your case, but your other functions are
> already not doing that (such as begin_startup_progress_phase) so it's
> not clear why you would not use the most natural name for this one.

I agree that has_startup_progress_timeout_expired() is better than the
existing function name. So I changed the function name accordingly.

> Please make sure to add ereport_startup_progress() as a translation
> trigger in src/backend/nls.mk.

I have added ereport_startup_progress() under the section
GETTEXT_TRIGGERS and GETTEXT_FLAGS in src/backend/nls.mk. Also
verified the messages in src/backend/po/postgres.pot file.

Kindly let me know if I have missed anything.

Thanks & Regards,
Nitin Jadhav

On Tue, Sep 28, 2021 at 8:29 PM Robert Haas  wrote:
>
> On Tue, Sep 28, 2021 at 8:06 AM Nitin Jadhav
>  wrote:
> > I thought mentioning the unit in milliseconds and the example in
> > seconds would confuse the user, so I changed the example to
> > milliseconds.The message behind the above description looks good to me
> > however I feel some sentences can be explained in less words. The
> > information related to the units is missing and I feel it should be
> > mentioned in the document. The example says, if the setting has the
> > default value of 10 seconds, then it explains the behaviour. I feel it
> > may not be the default value, it can be any value set by the user. So
> > mentioning 'default' in the example does not look good to me. I feel
> > we just have to mention "if this setting is set to the value of 10
> > seconds". Below is the modified version of the above information.
>
> It is common to mention what the default is as part of the
> documentation of a GUC. I don't see why this one should be an
> exception, especially since not mentioning it reduces the length of
> the documentation by exactly one word.
>
> I don't mind the sentence you added at the end to clarify the default
> units, but the way you've rewritten the first sentence makes it, in my
> opinion, much less clear.
>
> > I had added additional code to check the value of the
> > 'log_startup_progress_interval' variable and  disable the feature in
> > case of -1 in the earlier versions of the patch (Specifically
> > v9.patch). There was a review comment for v9 patch and it resulted in
> > major refactoring of the patch.
> ...
> > The answer for the question of "I don't understand why you posted the
> > previous version of the pa

Re: prevent immature WAL streaming

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-24, Alvaro Herrera wrote:

> Here's the set for all branches, which I think are really final, in case
> somebody wants to play and reproduce their respective problem scenarios.
> Nathan already confirmed that his reproducer no longer shows a problem,
> and this version shouldn't affect that.

Pushed.  Watching for buildfarm fireworks now.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: prevent immature WAL streaming

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-24, Hannu Krosing wrote:

Hi Hannu

> In the first email you write
> 
> > As mentioned in the course of thread [1], we're missing a fix for
> streaming replication to avoid sending records that the primary hasn't
> fully flushed yet.  This patch is a first attempt at fixing that problem
> by retreating the LSN reported as FlushPtr whenever a segment is
> registered, based on the understanding that if no registration exists
> then the LogwrtResult.Flush pointer can be taken at face value; but if a
> registration exists, then we have to stream only till the start LSN of
> that registered entry.
> 
> So did we end up holding back the wal_sender to not send anything that
> is not confirmed as flushed on master

No.  We eventually realized that that approach was a dead end, so I
abandoned the whole thing and attacked the problem differently.  So your
other questions don't apply.  I tried to make the commit message explain
both the problem and the solution in as much detail as possible; please
have a look at that and let me know if something is unclear.

Thanks

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)

Re: pgsql: Document XLOG_INCLUDE_XID a little better

2021-09-29 Thread Robert Haas

On Tue, Sep 21, 2021 at 6:48 PM Alvaro Herrera  wrote:
> Document XLOG_INCLUDE_XID a little better
>
> I noticed that commit 0bead9af484c left this flag undocumented in
> XLogSetRecordFlags, which led me to discover that the flag doesn't
> actually do what the one comment on it said it does.  Improve the
> situation by adding some more comments.
>
> Backpatch to 14, where the aforementioned commit appears.

I'm not sure that saying something is a "hack" is really all that
useful as documentation.

But more to the point, I think this hack is ugly and needs to be
replaced with something less hacky.

Amit?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Some thoughts about the TAP tests' wait_for_catchup()

2021-09-29 Thread Tom Lane

Amit Kapila  writes:
> On Wed, Sep 29, 2021 at 3:47 AM Tom Lane  wrote:
>> It seems to me that for most purposes wait_for_catchup's approach is
>> strictly worse, for two reasons:
>> 1. It continually recomputes the primary's pg_current_wal_lsn().
>> 2. It's querying the primary's view of the standby's state, which
>> introduces a reporting delay.

> I can't comment on all the use cases of wait_for_catchup() but I think
> there are some use cases in logical replication where we need the
> publisher to use wait_for_catchup after setting up the replication to
> ensure that wal sender is started and in-proper state by checking its
> state (which should be 'streaming'). That also implicitly checks if
> the wal receiver has responded to initial ping requests by sending
> replay location.

Yeah, for logical replication we can't look at the subscriber's WAL
positions because they could be totally different.  What I'm on
about is the tests that use physical replication.  I think examining
the standby's state directly is better in that case, for the reasons
I cited.

I guess the question of interest is whether it's sufficient to test
the walreceiver feedback mechanisms in the context of logical
replication, or whether we feel that the physical-replication code
path is enough different that there should be a specific test for
that combination too.

regards, tom lane

Re: Patch to allow pg_filedump to support reading of pg_filenode.map

2021-09-29 Thread Christoph Berg

Re: Richard Yen
> Ah, thanks for the tip.  That's right -- I can't assume the user's input is
> a valid file.  Updated patch here.

Hi Richard,

sorry for the very late response here.

Thanks for the patch which I just merged, and thanks Justin for the
reviews!

Christoph

Re: Document spaces in .pgpass need to be escaped

2021-09-29 Thread Tom Lane

James Coleman  writes:
> A coworker has a space in a Postgres password and noticed .pgpass
> didn't work; escaping it fixed the issue. That requirement wasn't
> documented (despite other escaping requirements being documented), so
> I've attached a patch to add that comment.

I looked at passwordFromFile() and I don't see any indication that
it treats spaces specially.  Nor does a quick test here confirm
this report.  So I'm pretty certain that this proposed doc change
is wrong.  Perhaps there's some other issue to investigate, though?

regards, tom lane

Re: Patch to allow pg_filedump to support reading of pg_filenode.map

2021-09-29 Thread Richard Yen



> On Sep 29, 2021, at 9:01 AM, Christoph Berg  wrote:
> 
> Re: Richard Yen
>> Ah, thanks for the tip.  That's right -- I can't assume the user's input is
>> a valid file.  Updated patch here.
> 
> Hi Richard,
> 
> sorry for the very late response here.
> 
> Thanks for the patch which I just merged, and thanks Justin for the
> reviews!

Thank you! Was a great coding exercise :)

-Richard

Re: Gather performance analysis

2021-09-29 Thread Robert Haas

On Tue, Sep 28, 2021 at 2:49 AM Dilip Kumar  wrote:
>   16k 64k   256k1024k
> Head   1232.779804.241134.723901.257
> Patch  1371.4931277.705862.598783.481
>
> So what I have noticed is that in most of the cases on head as well as
> with the patch, increasing the queue size make it faster, but with
> head suddenly for this particular combination of rows, column and
> thread the execution time is very low for 64k queue size and then
> again the execution time increased with 256k queue size and then
> follow the pattern.  So this particular dip in the execution time on
> the head looks a bit suspicious to me.  I mean how could we justify
> this sudden big dip in execution time w.r.t the other pattern.

Oh, interesting. So there's not really a performance regression here
so much as that one particular case ran exceptionally fast on the
unpatched code.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: when the startup process doesn't (logging startup delays)

2021-09-29 Thread Alvaro Herrera

So, I've wondered about this part all along:

> +/*
> + * Calculates the timestamp at which the next timer should expire and enables
> + * the timer accordingly.
> + */
> +static void
> +reset_startup_progress_timeout(TimestampTz now)
> +{
> + TimestampTz next_timeout;
> +
> + next_timeout = 
> TimestampTzPlusMilliseconds(scheduled_startup_progress_timeout,
> + 
>log_startup_progress_interval);
> + if (next_timeout < now)
> + {
> + /*
> +  * Either the timeout was processed so late that we missed an
> +  * entire cycle or system clock was set backwards.
> +  */
> + next_timeout = TimestampTzPlusMilliseconds(now, 
> log_startup_progress_interval);
> + }
> +
> + enable_timeout_at(STARTUP_PROGRESS_TIMEOUT, next_timeout);

Why is it that we set the next timeout to fire not at "now + interval"
but at "when-it-should-have-fired-but-didn't + interval"?  As a user, if
I request a message to be logged every N milliseconds, and one
of them is a little bit delayed, then (assuming I set it to 10s) I still
expect the next one to occur at now+10s.  I don't expect the next at
"now+5s" if one is delayed 5s.

In other words, I think this function should just be
  enable_timeout_after(STARTUP_PROGRESS_TIMEOUT, log_startup_progress_interval);

This means you can remove the scheduled_startup_progress_timeout
variable.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)

Re: when the startup process doesn't (logging startup delays)

2021-09-29 Thread Robert Haas

On Wed, Sep 29, 2021 at 10:08 AM Nitin Jadhav
 wrote:
> Sorry. There was a misunderstanding about this and for the patch
> shared on September 27th, I had tested for the value '0' and observed
> that no progress messages were getting logged, probably the time at
> which 'enable_timeout_at' is getting called is past the 'next_timeout'
> value. This behaviour is completely dependent on the system. Now added
> an extra condition to disable the feature in case of '0' setting.

Oh, interesting. I failed to consider that the behavior might vary
from one system to another.

I just noticed something else which I realize is the indirect result
of my own suggestion but which doesn't actually look all that nice.
You've now got a call to RegisterTimeout(STARTUP_PROGRESS_TIMEOUT,
...) in InitPostgres, guarded by ! IsPostmasterEnvironment, and then
another one in StartupProcessMain(). I think that's so that the
feature works in single-user mode, but that means that technically,
we're not reporting on the progress of the startup process. We're
reporting progress on the startup operations that are normally
performed by the startup process. But that means that the
documentation isn't quite accurate (because it mentions the startup
process specifically) and that the placement of the code in startup.c
is suspect (because that's specifically for the startup process) and
that basically every instance of the word "process" in the patch is
technically a little bit wrong. I'm not sure if that's a big enough
problem to be worth worrying about or exactly what we ought to do
about it, but it doesn't seem fantastic. At a minimum, I think we
should probably try to eliminate as many references to the startup
process as we can, and talk about startup progress or startup
operations or something like that.

+ * Start timestamp of the operation that occur during startup process.

This is not correct grammar - it would need to be "operations that
occur" or "operation that occurs". But neither seems particularly
clear about what the variable actually does. How about "Time at which
the most recent startup operation started"?

+ * Indicates the timestamp at which the timer was supposed to expire.

"Indicates" can be deleted, but also I think it would be better to
state the purpose of the timer i.e. "Timestamp at which the next
startup progress message should be logged."

+ enable_timeout_at(STARTUP_PROGRESS_TIMEOUT, next_timeout);
+ scheduled_startup_progress_timeout = next_timeout;
+ startup_progress_timer_expired = false;

I think you should set startup_progress_timer_expired to false before
calling enable_timeout_at. Otherwise there's a race condition. It's
unlikely that the timer could expire and the interrupt handler fire
before we reach startup_progress_timer_expired = false, but it seems
like there's no reason to take a chance.

+ * Calculates the timestamp at which the next timer should expire and enables

So in some places you have verbs with an "s" on the end, like here,
and in other places without, like in the next example. In "telegraph
style" comments like this, this implicit subject is "it" or "this,"
but you don't write that. However you write the rest of the sentence
as if it were there. So this should say "calculate" and "enable"
rather than "calculates" and "enables".

+ * Schedule a wakeup call for logging the progress of startup process.

This isn't really an accurate description, I think. It's not
scheduling anything, and I don't know what a "wakeup call" is anyway.
"Set a flag indicating that it's time to log a progress report" might
be better.

+ * Sets the start timestamp of the current operation and also enables the

Set. enable.

+ * timeout for logging the progress of startup process.

I think you could delete "for logging the progress of startup process"
here; that seems clear enough, and this reads a bit awkwardly. If you
want to keep something like this I wrote write "...enable the timeout
so that the progress of the startup progress will be logged."

+ * the timer if it did, otheriwse return false.

otherwise

+ * Begin the startup progress phase to report the progress of syncing
+ * data directory (syncfs).

All of the comments that start with "Begin the startup progress phase"
should instead start with "Begin startup progress phase".

I think this could be condensed down to "Prepare to report progress
syncing the data directory via syncfs", and likewise "Prepare to
report progress of the pre-fsync phase", "Prepare to report progress
resetting unlogged relations," etc.

+ gettext_noop("Sets the time interval between each progress update "
+ "of the startup process."),

This is actually inaccurate. It's sort of the same problem I was
worried about with respect to the documentation: it's NOT the interval
between progress updates, because it applies separately to each
operation. We need to say something that makes that clear, or at least
that doesn't get overtly the opposite impression. It's hard to do that
briefly

Re: when the startup process doesn't (logging startup delays)

2021-09-29 Thread Robert Haas

On Wed, Sep 29, 2021 at 1:36 PM Alvaro Herrera  wrote:
> Why is it that we set the next timeout to fire not at "now + interval"
> but at "when-it-should-have-fired-but-didn't + interval"?  As a user, if
> I request a message to be logged every N milliseconds, and one
> of them is a little bit delayed, then (assuming I set it to 10s) I still
> expect the next one to occur at now+10s.  I don't expect the next at
> "now+5s" if one is delayed 5s.

Well, this was my suggestion, because if you don't do this, you get
drift, which I think looks weird. Like the timestamps will be:

13:41:05.012456
13:41:15.072484
13:41:25.149632

...and it gets further and further off as it goes on.'

I guess my expectation is different from yours: I expect that if I ask
for a message every 10 seconds, the time between messages is going to
be 10s, at least on average, not 10s + however much latency the system
has.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: when the startup process doesn't (logging startup delays)

2021-09-29 Thread Justin Pryzby

On Wed, Sep 29, 2021 at 02:36:14PM -0300, Alvaro Herrera wrote:
> Why is it that we set the next timeout to fire not at "now + interval"
> but at "when-it-should-have-fired-but-didn't + interval"?  As a user, if
> I request a message to be logged every N milliseconds, and one
> of them is a little bit delayed, then (assuming I set it to 10s) I still
> expect the next one to occur at now+10s.  I don't expect the next at
> "now+5s" if one is delayed 5s.
> 
> In other words, I think this function should just be
>   enable_timeout_after(STARTUP_PROGRESS_TIMEOUT, 
> log_startup_progress_interval);
> 
> This means you can remove the scheduled_startup_progress_timeout
> variable.

Robert requested the current behavior here.
https://www.postgresql.org/message-id/CA%2BTgmoYkS1ZeWdSMFMBecMNxWonHk6J5eoX4FEQrpKtvEbXqGQ%40mail.gmail.com

It's confusing (at least) to get these kind of requests to change the behavior
back and forth.

-- 
Justin

Re: when the startup process doesn't (logging startup delays)

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-29, Robert Haas wrote:

> Well, this was my suggestion, because if you don't do this, you get
> drift, which I think looks weird. Like the timestamps will be:
> 
> 13:41:05.012456
> 13:41:15.072484
> 13:41:25.149632
> 
> ...and it gets further and further off as it goes on.'

Right ... I actually *expect* this drift to occur.  Maybe people
generally don't like this, it just seems natural to me.  Are there other
opinions on this aspect?

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Nadie está tan esclavizado como el que se cree libre no siéndolo" (Goethe)

Re: when the startup process doesn't (logging startup delays)

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-29, Justin Pryzby wrote:

> Robert requested the current behavior here.
> https://www.postgresql.org/message-id/CA%2BTgmoYkS1ZeWdSMFMBecMNxWonHk6J5eoX4FEQrpKtvEbXqGQ%40mail.gmail.com
> 
> It's confusing (at least) to get these kind of requests to change the behavior
> back and forth.

Well, I did scan the thread to see if this had been discussed, and I
overlooked that message.  But there was no reply to that message, so
it's not clear whether this was just Robert's opinion or consensus; in
fact we now have exactly two votes on it (mine and Robert's).

I think one person casting an opinion on one aspect does not set that
aspect in stone.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
http://smylers.hates-software.com/2005/09/08/1995c749.html

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

2021-09-29 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Sep-29, Ranier Vilela wrote:
>> Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand 
>> escreveu:
>> Duplicating functions is very bad for maintenance and bloats the code
>> unnecessarily, IMHO.

> Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
> updated in the patch.  Given that six sevenths of the calls continue to
> use the existing function and that it is less verbose than the new one,
> that seems sufficient argument to keep it.

Seeing that we have to back-patch this, changing the ABI of
PushActiveSnapshot seems like a complete non-starter.

The idea I'd had to avoid code duplication was to make
PushActiveSnapshot a wrapper for the extended function:

void
PushActiveSnapshot(Snapshot snap)
{
PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel());
}

This would add one function call to the common code path, but there
are enough function calls in PushActiveSnapshot that I don't think
that's a big concern.

Another point is that this'd also add the as_level ordering assertion
to the common code path, but on the whole I think that's good not bad.

BTW, this is not great code:

+   if (ActiveSnapshot != NULL && ActiveSnapshot->as_next != NULL)
+   Assert(as_level >= ActiveSnapshot->as_next->as_level);

You want it all wrapped in the Assert, so that there's not any code
left in a non-assert build (which the compiler may or may not optimize
away, perhaps after complaining about a side-effect-free statement).

Actually, it's plain wrong, because you should be looking at the
top as_level not the next one.  So more like

Assert(ActiveSnapshot == NULL ||
   snap_level >= ActiveSnapshot->as_level);

regards, tom lane

Re: when the startup process doesn't (logging startup delays)

2021-09-29 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Sep-29, Robert Haas wrote:
>> Well, this was my suggestion, because if you don't do this, you get
>> drift, which I think looks weird. Like the timestamps will be:
>> 
>> 13:41:05.012456
>> 13:41:15.072484
>> 13:41:25.149632
>> 
>> ...and it gets further and further off as it goes on.'

> Right ... I actually *expect* this drift to occur.  Maybe people
> generally don't like this, it just seems natural to me.  Are there other
> opinions on this aspect?

FWIW, I agree with Robert that it's nicer if the timeout doesn't drift.
There's a limit to how much complexity I'm willing to tolerate for that,
but it doesn't seem like this exceeds it.

The real comment I'd have here, though, is that writing one-off
code for this purpose is bad.  If we have a need for a repetitive
timeout, it'd be better to add the feature to timeout.c explicitly.
That would probably also remove the need for extra copies of the
timeout time.

regards, tom lane

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

2021-09-29 Thread Ranier Vilela

Em qua., 29 de set. de 2021 às 15:01, Tom Lane  escreveu:

> Alvaro Herrera  writes:
> > On 2021-Sep-29, Ranier Vilela wrote:
> >> Em qua., 29 de set. de 2021 às 08:12, Drouvot, Bertrand <
> bdrou...@amazon.com>
> >> escreveu:
> >> Duplicating functions is very bad for maintenance and bloats the code
> >> unnecessarily, IMHO.
>
> > Well, there are 42 calls of PushActiveSnapshot currently, and only 6 are
> > updated in the patch.  Given that six sevenths of the calls continue to
> > use the existing function and that it is less verbose than the new one,
> > that seems sufficient argument to keep it.
>
> Seeing that we have to back-patch this, changing the ABI of
> PushActiveSnapshot seems like a complete non-starter.
>
> The idea I'd had to avoid code duplication was to make
> PushActiveSnapshot a wrapper for the extended function:
>
> void
> PushActiveSnapshot(Snapshot snap)
> {
> PushActiveSnapshotWithLevel(snap, GetCurrentTransactionNestLevel());
> }
>
> Much better.

regards,
Ranier Vilela

Enabling deduplication with system catalog indexes

2021-09-29 Thread Peter Geoghegan

System catalog indexes do not support deduplication as a matter of
policy. I chose to do things that way during the Postgres 13
development cycle due to the restriction on using storage parameters
with system catalog indexes. At the time I felt that *forcing* the use
of deduplication with system catalog indexes might expose users to
problems. But this is something that seems worth revisiting now. (I
haven't actually investigated what it would take to make system
catalogs support the 'deduplicate_items' parameter, but that may not
matter now.)

I would like to enable deduplication within system catalog indexes for
Postgres 15. Leaving it disabled forever seems kind of arbitrary at
best. In general enabling deduplication (or not disabling it) has only
a fixed, small downside in the worst case. It has a huge upside in
favorable cases. Deduplication is part of our high level strategy for
avoiding nbtree index bloat from version churn (non-HOT updates with
several indexes that are never "logically modified"). It effectively
cooperates with and enhances the new enhancements to index deletion in
Postgres 14. Plus these recent index deletion enhancements more or
less eliminated a theoretical downside of deduplication: now it
doesn't really matter that posting list tuples only have a single
LP_DEAD bit (if it ever did). This is because we can now do granular
posting list TID deletion, provided the deletion process visits the
same heap block in passing.

I can find no evidence that even one single user found it useful to
disable deduplication while using Postgres 13 in production (by
searching for "deduplicate_items" on Google). While I myself said that
there might be a regression of up to 2% of throughput back in early
2020, that was under highly unrealistic conditions, that could never
apply to system catalogs -- I was being conservative. Most system
catalog indexes are unique indexes, where there is no possible
overhead from deduplication unless we already know for sure that the
index is subject to some kind of version churn (and so have high
confidence that deduplication will be at least somewhat effective at
buying time for VACUUM). The non-unique system catalog indexes seem
pretty likely to benefit from deduplication in the usual obvious way
(not so much because of versioning and bloat). The two pg_depend
non-unique indexes tend to have a fair number of duplicates.

-- 
Peter Geoghegan

Re: Avoiding smgrimmedsync() during nbtree index builds

2021-09-29 Thread Melanie Plageman

On Mon, May 3, 2021 at 5:24 PM Melanie Plageman
 wrote:
> On Thu, Jan 21, 2021 at 5:51 PM Andres Freund  wrote:
> > On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:
> > > On 21/01/2021 22:36, Andres Freund wrote:
> > > >
> > > > Thinking through the correctness of replacing smgrimmedsync() with sync
> > > > requests, the potential problems that I can see are:
> > > >
> > > > 1) redo point falls between the log_newpage() and the
> > > > write()/register_dirty_segment() in smgrextend/smgrwrite.
> > > > 2) redo point falls between write() and register_dirty_segment()
> > > >
> > > > But both of these are fine in the context of initially filling a newly
> > > > created relfilenode, as far as I can tell? Otherwise the current
> > > > smgrimmedsync() approach wouldn't be safe either, as far as I can tell?
> > >
> > > Hmm. If the redo point falls between write() and the
> > > register_dirty_segment(), and the checkpointer finishes the whole 
> > > checkpoint
> > > before register_dirty_segment(), you are not safe. That can't happen with
> > > write from the buffer manager, because the checkpointer would block 
> > > waiting
> > > for the flush of the buffer to finish.
> >
> > Hm, right.
> >
> > The easiest way to address that race would be to just record the redo
> > pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
> > checkpoint started since the start of the index build.
> >
> > Another approach would be to utilize PGPROC.delayChkpt, but I would
> > rather not unnecessarily expand the use of that.
> >
> > It's kind of interesting - in my aio branch I moved the
> > register_dirty_segment() to before the actual asynchronous write (due to
> > availability of the necessary data), which ought to be safe because of
> > the buffer interlocking. But that doesn't work here, or for other places
> > doing writes without going through s_b.  It'd be great if we could come
> > up with a general solution, but I don't immediately see anything great.
> >
> > The best I can come up with is adding helper functions to wrap some of
> > the complexity for "unbuffered" writes of doing an immedsync iff the
> > redo pointer changed. Something very roughly like
> >
> > typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} 
> > UnbufferedWriteState;
> > void unbuffered_prep(UnbufferedWriteState* state);
> > void unbuffered_write(UnbufferedWriteState* state, ...);
> > void unbuffered_extend(UnbufferedWriteState* state, ...);
> > void unbuffered_finish(UnbufferedWriteState* state);
> >
> > which wouldn't just do the dance to avoid the immedsync() if possible,
> > but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
> > if we get that [1]).
> >
>
> Regarding the implementation, I think having an API to do these
> "unbuffered" or "direct" writes outside of shared buffers is a good
> idea. In this specific case, the proposed API would not change the code
> much. I would just wrap the small diffs I added to the beginning and end
> of _bt_load() in the API calls for unbuffered_prep() and
> unbuffered_finish() and then tuck away the second half of
> _bt_blwritepage() in unbuffered_write()/unbuffered_extend(). I figured I
> would do so after ensuring the correctness of the logic in this patch.
> Then I will work on a patch which implements the unbuffered_write() API
> and demonstrates its utility with at least a few of the most compelling
> most compelling use cases in the code.
>

I've taken a pass at writing the API for "direct" or "unbuffered" writes
and extends. It introduces the suggested functions: unbuffered_prep(),
unbuffered_finish(), unbuffered_write(), and unbuffered_extend().

This is a rough cut -- corrections welcome and encouraged!

unbuffered_prep() saves the xlog redo pointer at the time it is called.
Then, if the redo pointer hasn't changed by the time unbuffered_finish()
is called, the backend can avoid calling smgrimmedsync(). Note that this
only works if intervening calls to smgrwrite() and smgrextend() pass
skipFsync=False.

unbuffered_write() and unbuffered_extend() might be able to be used even
if unbuffered_prep() and unbuffered_finish() are not used -- for example
hash indexes do something I don't entirely understand in which they call
smgrextend() directly when allocating buckets but then initialize the
new bucket pages using the bufmgr machinery.

I also intend to move accounting of pages written and extended into the
unbuffered_write() and unbuffered_extend() functions using the functions
I propose in [1] to populate a new view, pg_stat_buffers. Then this
"unbuffered" IO would be counted in stats.

I picked the name "direct" for the directory in /src/backend/storage
because I thought that these functions are analogous to direct IO in
Linux -- in that they are doing IO without going through Postgres bufmgr
-- unPGbuffered, basically. Other suggestions were "raw" and "relIO".
Raw seemed confusing since raw device IO is pretty far from what is
happeni

Re: when the startup process doesn't (logging startup delays)

2021-09-29 Thread Robert Haas

On Wed, Sep 29, 2021 at 1:52 PM Alvaro Herrera  wrote:
> I think one person casting an opinion on one aspect does not set that
> aspect in stone.

Of course not. I was just explaining that how the patch ended up like it did.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

jsonb crash

2021-09-29 Thread Jaime Casanova

Hi,

I found a crash (segmentation fault) on jsonb.
This is the best I could do to reduce the query:

"""
select  
  75 as c1
from 
  public.pagg_tab_ml as ref_0,
  lateral (select  
ref_0.a as c5 
  from generate_series(1, 300) as sample_0
  fetch first 78 rows only
  ) as subq_0
where case when (subq_0.c5 < 2) 
   then cast(null as jsonb) 
   else cast(null as jsonb) 
  end ? ref_0.c
"""

And because it needs pagg_tab_ml it should be run a regression database.
This affects at least 14 and 15.

Attached is the backtrace.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL
#0  0x556895e8b402 in getJsonbOffset (jc=0x5568971c2dd4, index=) at jsonb_util.c:146
offset = 0
i = 3158063
#1  0x556895e8ba44 in JsonbIteratorNext (it=it@entry=0x7fff141bac68, 
val=val@entry=0x7fff141bac40, skipNested=skipNested@entry=false) at 
jsonb_util.c:926
__func__ = "JsonbIteratorNext"
#2  0x556895e89f47 in jsonb_hash (fcinfo=0x7fff141baca0) at jsonb_op.c:274
jb = 0x5568971c2dd0
it = 0x5568971c5958
v = {type = jbvObject, val = {numeric = 0x556800303030, boolean = 48, 
string = {len = 3158064, val = 0x7fff141bac70 ""}, array = {nElems = 3158064, 
elems = 0x7fff141bac70, rawScalar = 136}, object = {
  nPairs = 3158064, pairs = 0x7fff141bac70}, binary = {len = 
3158064, data = 0x7fff141bac70}, datetime = {value = 93905168117808, typid = 
337357936, typmod = 32767, tz = -1782147448}}}
r = 
hash = 0
__func__ = "jsonb_hash"
#3  0x556895f56bba in FunctionCall1Coll 
(flinfo=flinfo@entry=0x5568971c2d38, collation=, arg1=) at fmgr.c:1138
fcinfodata = {fcinfo = {flinfo = 0x5568971c2d38, context = 0x0, 
resultinfo = 0x0, fncollation = 100, isnull = false, nargs = 1, args = 
0x7fff141bacc0}, 
  fcinfo_data = "8-\034\227hU", '\000' , 
"d\000\000\000\000\000\001\000x\277A\213\321\177\000\000\000'\033\227hU\000"}
fcinfo = 0x7fff141baca0
result = 
__func__ = "FunctionCall1Coll"
#4  0x556895c8830e in MemoizeHash_hash (tb=tb@entry=0x5568971c9490, 
key=key@entry=0x0) at nodeMemoize.c:175
hkey = 
i = 0
mstate = 
pslot = 0x5568971c2ca0
hashkey = 0
numkeys = 2
hashfunctions = 0x5568971c2d38
collations = 0x5568971a10e0
#5  0x556895c88b4f in memoize_insert (found=0x7fff141badbf, key=0x0, 
tb=0x5568971c9490) at ../../../src/include/lib/simplehash.h:758
hash = 
hash = 
#6  cache_lookup (mstate=mstate@entry=0x55689718c488, 
found=found@entry=0x7fff141badbf) at nodeMemoize.c:423
key = 
entry = 
oldcontext = 
#7  0x556895c89893 in ExecMemoize (pstate=0x55689718c488) at 
nodeMemoize.c:609
entry = 
outerslot = 
found = false
node = 0x55689718c488
outerNode = 
slot = 
__func__ = "ExecMemoize"
#8  0x556895c62dc5 in ExecProcNodeFirst (node=0x55689718c488) at 
execProcnode.c:463
No locals.
#9  0x556895c918f1 in ExecProcNode (node=0x55689718c488) at 
../../../src/include/executor/executor.h:257
No locals.
#10 ExecNestLoop (pstate=0x55689718aba8) at nodeNestloop.c:160
node = 0x55689718aba8
nl = 0x5568971a1158
innerPlan = 0x55689718c488
outerPlan = 0x55689718adb0
outerTupleSlot = 
innerTupleSlot = 
joinqual = 0x0
otherqual = 0x0
econtext = 0x55689718acc0
lc = 
#11 0x556895c62dc5 in ExecProcNodeFirst (node=0x55689718aba8) at 
execProcnode.c:463
No locals.
#12 0x556895c5af50 in ExecProcNode (node=0x55689718aba8) at 
../../../src/include/executor/executor.h:257
No locals.
#13 ExecutePlan (estate=estate@entry=0x55689718a8a0, planstate=0x55689718aba8, 
use_parallel_mode=, operation=operation@entry=CMD_SELECT, 
sendTuples=sendTuples@entry=true, 
numberTuples=numberTuples@entry=0, direction=ForwardScanDirection, 
dest=0x5568971a2938, execute_once=true) at execMain.c:1551
slot = 
current_tuple_count = 0
#14 0x556895c5bbf1 in standard_ExecutorRun (queryDesc=0x5568971772f0, 
direction=ForwardScanDirection, count=0, execute_once=) at 
execMain.c:361
estate = 0x55689718a8a0
operation = CMD_SELECT
dest = 0x5568971a2938
sendTuples = true
oldcontext = 0x5568971771d0
__func__ = "standard_ExecutorRun"
#15 0x556895c5bcba in ExecutorRun 
(queryDesc=queryDesc@entry=0x5568971772f0, 
direction=direction@entry=ForwardScanDirection, count=count@entry=0, 
execute_once=) at execMain.c:305
No locals.
#16 0x556895e17fe9 in PortalRunSelect (portal=portal@entry=0x5568971138c0, 
forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x5568971a2938) at pquery.c:919
queryDesc = 0x5568971772f0
direction = 
nprocessed = 
__func__ = "PortalRunSelect"
#17 0x556895e19a27 in Po

Re: jsonb crash

2021-09-29 Thread Ranier Vilela

Em qua., 29 de set. de 2021 às 15:55, Jaime Casanova <
jcasa...@systemguards.com.ec> escreveu:

> Hi,
>
> I found a crash (segmentation fault) on jsonb.
> This is the best I could do to reduce the query:
>
> """
> select
>   75 as c1
> from
>   public.pagg_tab_ml as ref_0,
>   lateral (select
> ref_0.a as c5
>   from generate_series(1, 300) as sample_0
>   fetch first 78 rows only
>   ) as subq_0
> where case when (subq_0.c5 < 2)
>then cast(null as jsonb)
>else cast(null as jsonb)
>   end ? ref_0.c
> """
>
> And because it needs pagg_tab_ml it should be run a regression database.
> This affects at least 14 and 15.
>
> Attached is the backtrace.
>
Yeah, Coverity has a report about this at function:

JsonbValue *
pushJsonbValue(JsonbParseState **pstate, JsonbIteratorToken seq,
   JsonbValue *jbval)

1. CID undefined: Dereference after null check (FORWARD_NULL)
return pushJsonbValueScalar(pstate, seq, jbval);

2. CID undefined (#1 of 1): Dereference after null check (FORWARD_NULL)16.
var_deref_model:
Passing pstate to pushJsonbValueScalar, which dereferences null *pstate

res = pushJsonbValueScalar(pstate, tok,
   tok <
WJB_BEGIN_ARRAY ||
   (tok ==
WJB_BEGIN_ARRAY &&
v.
val.array.rawScalar) ? &v : NULL);

regards,
Ranier Vilela

Re: Empty string in lexeme for tsvector

2021-09-29 Thread Tom Lane

Jean-Christophe Arnu  writes:
> [ empty_string_in_tsvector_v4.patch ]

I looked through this patch a bit.  I don't agree with adding
these new error conditions to tsvector_setweight_by_filter and
tsvector_delete_arr.  Those don't prevent bad lexemes from being
added to tsvectors, so AFAICS they can have no effect other than
breaking existing applications.  In fact, tsvector_delete_arr is
one thing you could use to fix existing bad tsvectors, so making
it throw an error seems actually counterproductive.

(By the same token, I think there's a good argument for
tsvector_delete_arr to just ignore nulls, not throw an error.
That's a somewhat orthogonal issue, though.)

What I'm wondering about more than that is whether array_to_tsvector
is the only place that can inject an empty lexeme ... don't we have
anything else that can add lexemes without going through the parser?

regards, tom lane

Re: jsonb crash

2021-09-29 Thread Tom Lane

Jaime Casanova  writes:
> I found a crash (segmentation fault) on jsonb.
> This is the best I could do to reduce the query:

> """
> select  
>   75 as c1
> from 
>   public.pagg_tab_ml as ref_0,
>   lateral (select  
> ref_0.a as c5 
>   from generate_series(1, 300) as sample_0
>   fetch first 78 rows only
>   ) as subq_0
> where case when (subq_0.c5 < 2) 
>then cast(null as jsonb) 
>  else cast(null as jsonb) 
>   end ? ref_0.c
> """

I think this must be a memoize bug.  AFAICS, nowhere in this query
can we be processing a non-null JSONB value, so what are we doing
in jsonb_hash?  Something down-stack must have lost the information
that the Datum is actually null.

regards, tom lane

Re: [BUG] failed assertion in EnsurePortalSnapshotExists()

2021-09-29 Thread Tom Lane

"Drouvot, Bertrand"  writes:
> But make check is now failing on join_hash.sql, I have been able to 
> repro with:

Oh, duh, should have thought a bit harder.  createSubid is a sequential
subtransaction number; it's not the same as the as_level nesting level.

Probably the most effective way to handle this is to add a subtransaction
nesting-level field to struct Portal, so we can pass that.  I don't recall
that xact.c provides any easy way to extract the nesting level of a
subtransaction that's not the most closely nested one.

regards, tom lane

Re: jsonb crash

2021-09-29 Thread Tom Lane

I wrote:
> I think this must be a memoize bug.  AFAICS, nowhere in this query
> can we be processing a non-null JSONB value, so what are we doing
> in jsonb_hash?  Something down-stack must have lost the information
> that the Datum is actually null.

After further inspection, "what are we doing in jsonb_hash?" is
indeed a relevant question, but it seems like it's a type mismatch
not a nullness issue.  EXPLAIN VERBOSE shows

   ->  Memoize  (cost=0.01..1.96 rows=1 width=4)
 Output: subq_0.c5
 Cache Key: ref_0.c, ref_0.a
 ->  Subquery Scan on subq_0  (cost=0.00..1.95 rows=1 width=4)
   Output: subq_0.c5
   Filter: (CASE WHEN (subq_0.c5 < 2) THEN NULL::jsonb ELSE 
NULL::jsonb END ? ref_0.c)
   ->  Limit  (cost=0.00..0.78 rows=78 width=4)
 Output: (ref_0.a)
 ->  Function Scan on pg_catalog.generate_series sample_0  
(cost=0.00..3.00 rows=300 width=4)
   Output: ref_0.a
   Function Call: generate_series(1, 300)

so unless the "Cache Key" output is a complete lie, the cache key
types we should be concerned with are text and integer.  The Datum
that's being passed to jsonb_hash looks suspiciously like it is a
text value '', too, which matches the "c" value from the first
row of pagg_tab_ml.  I now think some part of Memoize is looking in
completely the wrong place to discover the cache key datatypes.

regards, tom lane

Re: jsonb crash

2021-09-29 Thread Andrew Dunstan



On 9/29/21 4:00 PM, Tom Lane wrote:
> Jaime Casanova  writes:
>> I found a crash (segmentation fault) on jsonb.
>> This is the best I could do to reduce the query:
>> """
>> select  
>>   75 as c1
>> from 
>>   public.pagg_tab_ml as ref_0,
>>   lateral (select  
>> ref_0.a as c5 
>>   from generate_series(1, 300) as sample_0
>>   fetch first 78 rows only
>>   ) as subq_0
>> where case when (subq_0.c5 < 2) 
>>then cast(null as jsonb) 
>> else cast(null as jsonb) 
>>   end ? ref_0.c
>> """
> I think this must be a memoize bug.  AFAICS, nowhere in this query
> can we be processing a non-null JSONB value, so what are we doing
> in jsonb_hash?  Something down-stack must have lost the information
> that the Datum is actually null.


Yeah, confirmed that this is not failing in release 13.


cheers


andrew


-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: prevent immature WAL streaming

2021-09-29 Thread Tom Lane

Alvaro Herrera  writes:
> Pushed.  Watching for buildfarm fireworks now.

jacana didn't like it:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2021-09-29%2018%3A25%3A53

The relevant info seems to be

# Running: pg_basebackup -D 
/home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
 -h 127.0.0.1 -p 59502 --checkpoint fast --no-sync
pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: 
FATAL:  no pg_hba.conf entry for replication connection from host "127.0.0.1", 
user "pgrunner", no encryption
Bail out!  system pg_basebackup failed

which looks like a pretty straightforward bogus-connection-configuration
problem, except why wouldn't other BF members show it?

regards, tom lane

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2021-09-29 Thread Melanie Plageman

On Mon, Sep 27, 2021 at 2:58 PM Melanie Plageman
 wrote:
>
> On Fri, Sep 24, 2021 at 5:58 PM Melanie Plageman
>  wrote:
> >
> > On Thu, Sep 23, 2021 at 5:05 PM Melanie Plageman
> >  wrote:
> > The only remaining TODOs are described in the commit message.
> > most critical one is that the reset message doesn't work.
>
> v10 is attached with updated comments and some limited code refactoring

v11 has fixed the oversize message issue by sending a reset message for
each backend type. Now, we will call GetCurrentTimestamp
BACKEND_NUM_TYPES times, so maybe I should add some kind of flag to the
reset message that indicates the first message so that all the "do once"
things can be done at that point.

I've also fixed a few style/cosmetic issues and updated the commit
message with a link to the thread [1] where I proposed smgrwrite() and
smgrextend() wrappers (which is where I propose to call
pgstat_incremement_buffer_access_type() for unbuffered writes and
extends).

- Melanie

[1] 
https://www.postgresql.org/message-id/CAAKRu_aw72w70X1P%3Dba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g%40mail.gmail.com
From b2023adb80f2920081a6f35c19e0276d38ae3a15 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 29 Sep 2021 15:44:51 -0400
Subject: [PATCH v11 3/3] Remove superfluous bgwriter stats

Remove stats from pg_stat_bgwriter which are now more clearly expressed
in pg_stat_buffers.

TODO:
- make pg_stat_checkpointer view and move relevant stats into it
- add additional stats to pg_stat_bgwriter
---
 doc/src/sgml/monitoring.sgml  | 47 ---
 src/backend/catalog/system_views.sql  |  6 +---
 src/backend/postmaster/checkpointer.c | 26 ---
 src/backend/postmaster/pgstat.c   |  5 ---
 src/backend/storage/buffer/bufmgr.c   |  6 
 src/backend/utils/adt/pgstatfuncs.c   | 30 -
 src/include/catalog/pg_proc.dat   | 22 -
 src/include/pgstat.h  | 10 --
 src/test/regress/expected/rules.out   |  5 ---
 9 files changed, 1 insertion(+), 156 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 75753c3339..5852c45246 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3416,24 +3416,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
- 
-  
-   buffers_checkpoint bigint
-  
-  
-   Number of buffers written during checkpoints
-  
- 
-
- 
-  
-   buffers_clean bigint
-  
-  
-   Number of buffers written by the background writer
-  
- 
-
  
   
maxwritten_clean bigint
@@ -3444,35 +3426,6 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
   
  
 
- 
-  
-   buffers_backend bigint
-  
-  
-   Number of buffers written directly by a backend
-  
- 
-
- 
-  
-   buffers_backend_fsync bigint
-  
-  
-   Number of times a backend had to execute its own
-   fsync call (normally the background writer handles those
-   even when the backend does its own write)
-  
- 
-
- 
-  
-   buffers_alloc bigint
-  
-  
-   Number of buffers allocated
-  
- 
-
  
   
stats_reset timestamp with time zone
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 30280d520b..c45c261f4b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1058,18 +1058,14 @@ CREATE VIEW pg_stat_archiver AS
 s.stats_reset
 FROM pg_stat_get_archiver() s;
 
+-- TODO: make separate pg_stat_checkpointer view
 CREATE VIEW pg_stat_bgwriter AS
 SELECT
 pg_stat_get_bgwriter_timed_checkpoints() AS checkpoints_timed,
 pg_stat_get_bgwriter_requested_checkpoints() AS checkpoints_req,
 pg_stat_get_checkpoint_write_time() AS checkpoint_write_time,
 pg_stat_get_checkpoint_sync_time() AS checkpoint_sync_time,
-pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
-pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
 pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
-pg_stat_get_buf_written_backend() AS buffers_backend,
-pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
-pg_stat_get_buf_alloc() AS buffers_alloc,
 pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
 CREATE VIEW pg_stat_buffers AS
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c0c4122fd5..829f52cc8f 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -90,17 +90,9 @@
  * requesting backends since the last checkpoint start.  The flags are
  * chosen so that OR'ing is the correct way to combine multiple requests.
  *
- * num_backend_writes is used to count the number of

Re: jsonb crash

2021-09-29 Thread David Rowley

On Thu, 30 Sept 2021 at 09:24, Tom Lane  wrote:
> After further inspection, "what are we doing in jsonb_hash?" is
> indeed a relevant question, but it seems like it's a type mismatch
> not a nullness issue.  EXPLAIN VERBOSE shows

I think you're right here. It should be hashing text.  That seems to
be going wrong in check_memoizable() because it assumes it's always
fine to use the left side's type of the OpExpr to figure out the hash
function to use.

Maybe we can cache the left and the right type's hash function and use
the correct one in paraminfo_get_equal_hashops().

David

Re: when the startup process doesn't (logging startup delays)

2021-09-29 Thread Robert Haas

On Wed, Sep 29, 2021 at 2:06 PM Tom Lane  wrote:
> The real comment I'd have here, though, is that writing one-off
> code for this purpose is bad.  If we have a need for a repetitive
> timeout, it'd be better to add the feature to timeout.c explicitly.
> That would probably also remove the need for extra copies of the
> timeout time.

I'm not sure that really helps very much, honestly.  I mean it would
be useful in this particular case, but there are other cases where we
have logic like this already, and this wouldn't do anything about
those. For example, consider autoprewarm_main(). Like this code, that
code thinks (perhaps just because I'm the one who reviewed it) that
the next time should be measured from the last time ... but an
enhancement to the timeout machinery wouldn't help it at all. I
suspect there are other cases like this elsewhere, because this is
what I personally tend to think is the right behavior and I feel like
it comes up in patch reviews from time to time, but I'm not finding
any at the moment. Even if I'm right that they exist, I'm not sure
they look much like each other or can easily reuse any code.

And then again on the other hand, BackgroundWriterMain() thinks that
the next time should be measured from the time we got around to doing
it, not the scheduled time. I guess we don't really have any
consistent practice here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: DELETE CASCADE

2021-09-29 Thread David Christensen



Tom Lane  writes:

> [ a couple of random thoughts after quickly scanning this thread ... ]
>
> David Christensen  writes:
>> I assume this would look something like:
>> ALTER TABLE foo ALTER CONSTRAINT my_fkey ON UPDATE CASCADE ON DELETE RESTRICT
>> with omitted referential_action implying preserving the existing one.
>
> I seem to remember somebody working on exactly that previously, though
> it's evidently not gotten committed.  In any case, we already have
>
>   ALTER TABLE ... ALTER CONSTRAINT constraint_name
>   [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY 
> IMMEDIATE ]
>
> which has to modify pg_trigger rows, so it's hard to see why it'd
> be at all difficult to implement this using code similar to that
> (maybe even sharing much of the code).

Sure; this was my assumption as well.

> Returning to the original thought of a DML statement option to temporarily
> override the referential_action, I wonder why only temporarily-set-CASCADE
> was considered.  It seems to me like there might also be use-cases for
> temporarily selecting the SET NULL or SET DEFAULT actions.

Agreed; DELETE CASCADE had been the originating use case, but no reason to 
limit it to just that
action. (I'd later expanded it to add RESTRICT as well, but barring 
implementation issues, probably
no reason to limit any of referential action.)

> Another angle is that if we consider the deferrability properties as
> precedent, there already is a way to override an FK constraint's
> deferrability for the duration of a transaction: see SET CONSTRAINTS.
> So I wonder if maybe the way to treat this is to invent something like
>
>   SET CONSTRAINTS my_fk_constraint [,...] ON DELETE referential_action
>
> which would override the constraints' action for the remainder of the
> transaction.  (Permission needs TBD, but probably the same as you
> would need to create a new FK constraint on the relevant table.)
>
> In comparison to the original proposal, this'd force you to be explicit
> about which constraint(s) you intend to override, but TBH I think that's
> a good thing.

I can see the argument for this in terms of being cautious/explicit about what 
gets removed, however
the utility in this particular form was related to being able to *avoid* having 
to manually figure out
the relationship chains and the specific constraints involved.  Might there be 
some sort of middle
ground here?

> One big practical problem, which we've never addressed in the context of
> SET CONSTRAINTS but maybe it's time to tackle, is that the SQL spec
> defines the syntax like that because it thinks constraint names are
> unique per-schema; thus a possibly-schema-qualified name is sufficient
> ID.  Of course we say that constraint names are only unique per-table,
> so SET CONSTRAINTS has always had this issue of not being very carefully
> targeted.  I think we could do something like extending the syntax
> to be
>
>   SET CONSTRAINTS conname [ON tablename] [,...] new_properties

This part seems reasonable.  I need to look at how the existing SET CONSTRAINTS 
is implemented;
would be interesting to see how the settings per-table/session are managed, as 
that would be
illustrative to additional transient state like this.

> Anyway, just food for thought --- I'm not necessarily set on any
> of this.

Sure thing; I appreciate even a little bit of your attention here.

Best,

David

--

Re: prevent immature WAL streaming

2021-09-29 Thread Andrew Dunstan



On 9/29/21 4:33 PM, Tom Lane wrote:
> Alvaro Herrera  writes:
>> Pushed.  Watching for buildfarm fireworks now.
> jacana didn't like it:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2021-09-29%2018%3A25%3A53
>
> The relevant info seems to be
>
> # Running: pg_basebackup -D 
> /home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
>  -h 127.0.0.1 -p 59502 --checkpoint fast --no-sync
> pg_basebackup: error: connection to server at "127.0.0.1", port 59502 failed: 
> FATAL:  no pg_hba.conf entry for replication connection from host 
> "127.0.0.1", user "pgrunner", no encryption
> Bail out!  system pg_basebackup failed
>
> which looks like a pretty straightforward bogus-connection-configuration
> problem, except why wouldn't other BF members show it?
>
>   


This:

# Second test: a standby that receives WAL via archive/restore commands.
$node = PostgresNode->new('primary2');
$node->init(
    has_archiving => 1,
    extra => ['--wal-segsize=1']);


doesn't have "allows_streaming => 1".


cheers


andrew


-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: jsonb crash

2021-09-29 Thread Tom Lane

David Rowley  writes:
> On Thu, 30 Sept 2021 at 09:24, Tom Lane  wrote:
>> After further inspection, "what are we doing in jsonb_hash?" is
>> indeed a relevant question, but it seems like it's a type mismatch
>> not a nullness issue.  EXPLAIN VERBOSE shows

> I think you're right here. It should be hashing text.  That seems to
> be going wrong in check_memoizable() because it assumes it's always
> fine to use the left side's type of the OpExpr to figure out the hash
> function to use.

> Maybe we can cache the left and the right type's hash function and use
> the correct one in paraminfo_get_equal_hashops().

Um ... it seems to have correctly identified the cache key expressions,
so why isn't it just doing exprType on those?  The jsonb_exists operator
seems entirely irrelevant here.

regards, tom lane

Re: when the startup process doesn't (logging startup delays)

2021-09-29 Thread Tom Lane

Robert Haas  writes:
> On Wed, Sep 29, 2021 at 2:06 PM Tom Lane  wrote:
>> The real comment I'd have here, though, is that writing one-off
>> code for this purpose is bad.  If we have a need for a repetitive
>> timeout, it'd be better to add the feature to timeout.c explicitly.
>> That would probably also remove the need for extra copies of the
>> timeout time.

> I'm not sure that really helps very much, honestly.

I didn't claim there are any other places that could use the feature
*today*.  But once we've got one, it seems like there could be more
tomorrow.  In any case, I dislike keeping timeout state data outside
timeout.c, because it's so likely to get out-of-sync that way.

regards, tom lane

Re: jsonb crash

2021-09-29 Thread David Rowley

On Thu, 30 Sept 2021 at 10:09, Tom Lane  wrote:
>
> David Rowley  writes:
> > Maybe we can cache the left and the right type's hash function and use
> > the correct one in paraminfo_get_equal_hashops().
>
> Um ... it seems to have correctly identified the cache key expressions,
> so why isn't it just doing exprType on those?  The jsonb_exists operator
> seems entirely irrelevant here.

This is down to the caching stuff I added to RestrictInfo to minimise
the amount of work done during the join search. I cached the hash
equal function in RestrictInfo so I didn't have to check what that was
each time we consider a join.  The problem is, that I did a bad job of
taking inspiration from check_hashjoinable() which just looks at the
left type.

David

Re: jsonb crash

2021-09-29 Thread Tom Lane

David Rowley  writes:
> On Thu, 30 Sept 2021 at 10:09, Tom Lane  wrote:
>> Um ... it seems to have correctly identified the cache key expressions,
>> so why isn't it just doing exprType on those?  The jsonb_exists operator
>> seems entirely irrelevant here.

> This is down to the caching stuff I added to RestrictInfo to minimise
> the amount of work done during the join search. I cached the hash
> equal function in RestrictInfo so I didn't have to check what that was
> each time we consider a join.  The problem is, that I did a bad job of
> taking inspiration from check_hashjoinable() which just looks at the
> left type.

I'm still confused.  AFAICS, the top-level operator of the qual clause has
exactly nada to do with the cache keys, as this example makes plain.

regards, tom lane

Re: prevent immature WAL streaming

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-29, Andrew Dunstan wrote:

> > The relevant info seems to be
> >
> > # Running: pg_basebackup -D 
> > /home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
> >  -h 127.0.0.1 -p 59502 --checkpoint fast --no-sync
> > pg_basebackup: error: connection to server at "127.0.0.1", port 59502 
> > failed: FATAL:  no pg_hba.conf entry for replication connection from host 
> > "127.0.0.1", user "pgrunner", no encryption
> > Bail out!  system pg_basebackup failed
> >
> > which looks like a pretty straightforward bogus-connection-configuration
> > problem, except why wouldn't other BF members show it?
> 
> This:
> 
> # Second test: a standby that receives WAL via archive/restore commands.
> $node = PostgresNode->new('primary2');
> $node->init(
>     has_archiving => 1,
>     extra => ['--wal-segsize=1']);
> 
> doesn't have "allows_streaming => 1".

Hmm, but I omitted allows_streaming on purpose -- I only wanted
archiving, not streaming.  I understand that your point is that
set_replication_conf is not called unless allows_streaming is set.

So, do we take the stance that we have no right to expect pg_basebackup
to work if we didn't pass allow_streaming => 1?  If so, the fix is to
add it.  But my preferred fix would be to call set_replication_conf if
either allows_streaming or has_archiving are given.

Another easy fix would be to call $primary2->set_replication_conf in the
test file, but then you'd complain that that's supposed to be an
internal method :-)

(This reminds me that I had to add something that seemed like it should
have been unnecessary: wal_level=replica should become set if I request
archiving, right?  Otherwise the WAL archive is useless.  I also had to
add max_wal_senders=2 so that pg_basebackup would work, but I'm on the
fence about setting that automatically if has_archiving is given.)

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"At least to kernel hackers, who really are human, despite occasional
rumors to the contrary" (LWN.net)

Re: prevent immature WAL streaming

2021-09-29 Thread Tom Lane

Andrew Dunstan  writes:
> On 9/29/21 4:33 PM, Tom Lane wrote:
>> which looks like a pretty straightforward bogus-connection-configuration
>> problem, except why wouldn't other BF members show it?

> This:
> ...
> doesn't have "allows_streaming => 1".

Oh, and that only breaks things on Windows, cf set_replication_conf.

... although I wonder why the fact that sub init otherwise sets
wal_level = minimal doesn't cause a problem for this test.
Maybe the test script is cowboy-ishly overriding that?

regards, tom lane

Re: jsonb crash

2021-09-29 Thread David Rowley

On Thu, 30 Sept 2021 at 10:20, Tom Lane  wrote:
>
> David Rowley  writes:
> > On Thu, 30 Sept 2021 at 10:09, Tom Lane  wrote:
> >> Um ... it seems to have correctly identified the cache key expressions,
> >> so why isn't it just doing exprType on those?  The jsonb_exists operator
> >> seems entirely irrelevant here.
>
> > This is down to the caching stuff I added to RestrictInfo to minimise
> > the amount of work done during the join search. I cached the hash
> > equal function in RestrictInfo so I didn't have to check what that was
> > each time we consider a join.  The problem is, that I did a bad job of
> > taking inspiration from check_hashjoinable() which just looks at the
> > left type.
>
> I'm still confused.  AFAICS, the top-level operator of the qual clause has
> exactly nada to do with the cache keys, as this example makes plain.

You're right that it does not. The lateral join condition could be
anything.  We just need to figure out the hash function and which
equality function so that we can properly find any cached tuples when
we're probing the hash table.  We need the equal function too as we
can't just return any old cache tuples that match the same hash value.

Maybe recording the operator is not the best thing to do. Maybe I
should have just recorded the regproc's Oid for the equal function.
That would save from calling get_opcode() in ExecInitMemoize().

David

Re: prevent immature WAL streaming

2021-09-29 Thread Alvaro Herrera

On 2021-Sep-29, Tom Lane wrote:

> ... although I wonder why the fact that sub init otherwise sets
> wal_level = minimal doesn't cause a problem for this test.
> Maybe the test script is cowboy-ishly overriding that?

It is.  (I claim that it should be set otherwise when has_archiving=>1).

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"El Maquinismo fue proscrito so pena de cosquilleo hasta la muerte"
(Ijon Tichy en Viajes, Stanislaw Lem)

Re: prevent immature WAL streaming

2021-09-29 Thread Andrew Dunstan



On 9/29/21 5:27 PM, Alvaro Herrera wrote:
> On 2021-Sep-29, Andrew Dunstan wrote:
>
>>> The relevant info seems to be
>>>
>>> # Running: pg_basebackup -D 
>>> /home/pgrunner/bf/root/REL_14_STABLE/pgsql.build/src/test/recovery/tmp_check/t_026_overwrite_contrecord_primary2_data/backup/backup
>>>  -h 127.0.0.1 -p 59502 --checkpoint fast --no-sync
>>> pg_basebackup: error: connection to server at "127.0.0.1", port 59502 
>>> failed: FATAL:  no pg_hba.conf entry for replication connection from host 
>>> "127.0.0.1", user "pgrunner", no encryption
>>> Bail out!  system pg_basebackup failed
>>>
>>> which looks like a pretty straightforward bogus-connection-configuration
>>> problem, except why wouldn't other BF members show it?
>> This:
>>
>> # Second test: a standby that receives WAL via archive/restore commands.
>> $node = PostgresNode->new('primary2');
>> $node->init(
>>     has_archiving => 1,
>>     extra => ['--wal-segsize=1']);
>>
>> doesn't have "allows_streaming => 1".
> Hmm, but I omitted allows_streaming on purpose -- I only wanted
> archiving, not streaming.  I understand that your point is that
> set_replication_conf is not called unless allows_streaming is set.
>
> So, do we take the stance that we have no right to expect pg_basebackup
> to work if we didn't pass allow_streaming => 1?  If so, the fix is to
> add it.  But my preferred fix would be to call set_replication_conf if
> either allows_streaming or has_archiving are given.
>
>
> Another easy fix would be to call $primary2->set_replication_conf in the
> test file, but then you'd complain that that's supposed to be an
> internal method :-)


It claims that but it's also used here:


src/bin/pg_basebackup/t/010_pg_basebackup.pl


(Also, good perl style would start purely internal method names with an
underscore.)


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: jsonb crash

2021-09-29 Thread David Rowley

On Thu, 30 Sept 2021 at 09:48, David Rowley  wrote:
> Maybe we can cache the left and the right type's hash function and use
> the correct one in paraminfo_get_equal_hashops().

Here's a patch of what I had in mind for the fix.  It's just hot off
the press, so really only intended to assist discussion at this stage.

David

fix_incorrect_hashfunction_in_memoize.patch
Description: Binary data

Re: jsonb crash

2021-09-29 Thread Tom Lane

David Rowley  writes:
> On Thu, 30 Sept 2021 at 10:20, Tom Lane  wrote:
>> I'm still confused.  AFAICS, the top-level operator of the qual clause has
>> exactly nada to do with the cache keys, as this example makes plain.

> You're right that it does not. The lateral join condition could be
> anything.

Actually, the more I look at this the more unhappy I get, because
it's becoming clear that you have made unfounded semantic
assumptions.  The hash functions generally only promise that they
will distinguish values that are distinguishable by the associated
equality operator.  We have plenty of data types in which that does
not map to bitwise equality ... you need not look further than
float8 for an example.  And in turn, that means that there are lots
of functions/operators that *can* distinguish hash-equal values.
The fact that you're willing to treat this example as cacheable
means that memoize will fail on such clauses.

So I'm now thinking you weren't that far wrong to be looking at
hashability of the top-level qual operator.  What is missing is
that you have to restrict candidate cache keys to be the *direct*
arguments of such an operator.  Looking any further down in the
expression introduces untenable assumptions.

regards, tom lane

Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-09-29 Thread Michail Nikolaev

Hello, Antonin.

> I'm trying to continue the review, sorry for the delay. Following are a few
> question about the code:

Thanks for the review :) And sorry for the delay too :)

> * Does the masking need to happen in the AM code, e.g. _bt_killitems()?
>   I'd expect that the RmgrData.rm_fpi_mask can do all the work.

RmgrData.rm_fpi_mask clears a single BTP_LP_SAFE_ON_STANDBY bit only
to indicate that hints bit are not safe to be used on standby.
Why do not clear LP_DEAD bits in rm_fpi_mask? There is no sense
because we could get such bits in multiple ways:

* the standby was created from the base backup of the primary
* some pages were changed by pg_rewind
* the standby was updated to the version having this feature (so, old
pages still contains LP_DEAD)

So, AM code needs to know when and why clear LP_DEAD bits if
BTP_LP_SAFE_ON_STANDBY is not set.
Also, the important moment here is pg_memory_barrier() usage.

> * How about modifying rm_mask() instead of introducing rm_fpi_mask()? Perhaps
>  a boolean argument can be added to distinguish the purpose of the masking.

I have tried this way but the code was looking dirty and complicated.
Also, the separated fpi_mask provides some semantics to the function.

> * Are you sure it's o.k. to use mask_lp_flags() here? It sets the item flags
> to LP_UNUSED unconditionally, which IMO should only be done by VACUUM.
Oh, good catch. I made mask_lp_dead for this. Also, added such a
situation to the test.

> ** is bufmgr.c the best location for this function?
Moved to indexam.c and made static (is_index_lp_dead_allowed).

 > should probably be
 > /* It is always allowed on primary if ->all_dead. */
Fixed.

> * comment: XLOG_HEAP2_CLEAN has been renamed to XLOG_HEAP2_PRUNE in PG14.
Fixed.

> * Is the purpose of the repeatable read (RR) snapshot to test that
>  heap_hot_search_buffer() does not set deadness->all_dead if some transaction
>  can still see a tuple of the chain?

The main purpose is to test xactStartedInRecovery logic after the promotion.
For example -
> if (scan->xactStartedInRecovery && !RecoveryInProgress())`

> * Unless I miss something, the tests check that the hint bits are not
>  propagated from primary (or they are propagated but marked non-safe),
>  however there's no test to check that standby does set the hint bits itself.

It is tested on different standby, see
> is(hints_num($node_standby_2), qq(10), 'index hint bits already
set on second standby 2');

Also, I added checks for BTP_LP_SAFE_ON_STANDBY to make sure
everything in the test goes by scenario.

Thanks a lot,
Michail.
From cfb45d1a9cbf30be6098b2df95c4257c036c69ac Mon Sep 17 00:00:00 2001
From: Michail Nikolaev 
Date: Thu, 30 Sep 2021 00:36:21 +0300
Subject: [PATCH v4 3/3] doc

---
 src/backend/access/nbtree/README | 35 ++--
 src/backend/storage/page/README  |  8 +---
 2 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 2a7332d07c..67b3e38ace 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -714,17 +714,30 @@ lax about how same-level locks are acquired during recovery (most kinds
 of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.
 
-During recovery all index scans start with ignore_killed_tuples = false
-and we never set kill_prior_tuple. We do this because the oldest xmin
-on the standby server can be older than the oldest xmin on the primary
-server, which means tuples can be marked LP_DEAD even when they are
-still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
-they can still appear in the standby because of full page writes. So
-we must always ignore them in standby, and that means it's not worth
-setting them either.  (When LP_DEAD-marked tuples are eventually deleted
-on the primary, the deletion is WAL-logged.  Queries that run on a
-standby therefore get much of the benefit of any LP_DEAD setting that
-takes place on the primary.)
+There is some complexity in using LP_DEAD bits during recovery. Generally,
+bits could be set and read by scan, but there is a possibility to meet
+the bit applied on the primary. We don't WAL log tuple LP_DEAD bits, but
+they can still appear on the standby because of the full-page writes. Such
+a cause could cause MVCC failures because the oldest xmin on the standby
+server can be older than the oldest xmin on the primary server, which means
+tuples can be marked LP_DEAD even when they are still visible on the standby.
+
+To prevent such failure, we mark pages with LP_DEAD bits set by standby with a
+special hint. In the case of FPW from primary the hint is always cleared while
+applying the full page write, so, LP_DEAD received from primary is ignored on
+standby. Also, standby clears all LP_DEAD set by primary on the page before
+setting of own bits.
+
+There are restrictions on settings LP_DEA

Re: jsonb crash

2021-09-29 Thread Tom Lane

I wrote:
> So I'm now thinking you weren't that far wrong to be looking at
> hashability of the top-level qual operator.  What is missing is
> that you have to restrict candidate cache keys to be the *direct*
> arguments of such an operator.  Looking any further down in the
> expression introduces untenable assumptions.

Hmm ... I think that actually, a correct statement of the semantic
restriction is

To be eligible for memoization, the inside of a join can use the
passed-in parameters *only* as direct arguments of hashable equality
operators.

In order to exploit RestrictInfo-based caching, you could make the
further restriction that all such equality operators appear at the
top level of RestrictInfo clauses.  But that's not semantically
necessary.

As an example, assuming p1 and p2 are the path parameters,

(p1 = x) xor (p2 = y)

is semantically safe to memoize, despite the upper-level xor
operator.  But the example we started with, with a parameter
used as an argument of jsonb_exists, is not safe to memoize
because we have no grounds to suppose that two hash-equal values
will act the same in jsonb_exists.

regards, tom lane

Re: Enabling deduplication with system catalog indexes

2021-09-29 Thread Peter Geoghegan

On Wed, Sep 29, 2021 at 11:27 AM Peter Geoghegan  wrote:
> I would like to enable deduplication within system catalog indexes for
> Postgres 15.

I decided to run a simple experiment, to give us some idea of what
benefits my proposal gives users: I ran "make installcheck" on a newly
initdb'd database (master branch), and then with the attached patch
(which enables deduplication with system catalog indexes) applied.

I ran a query that shows the 20 largest system catalog indexes in each
case. I'm interested in when and where we see improvements to space
utilization. Any reduction in index size must be a result of index
deduplication (excluding any noise-level changes).

Master branch:

regression=# SELECT
  pg_size_pretty(pg_relation_size(c.oid)) as sz,
  c.relname
FROM pg_index i
JOIN pg_opclass op ON i.indclass[0] = op.oid
JOIN pg_am am ON op.opcmethod = am.oid
JOIN pg_class c ON i.indexrelid = c.oid
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog'
AND c.relkind = 'i' AND i.indisready AND i.indisvalid
ORDER BY pg_relation_size(c.oid) DESC LIMIT 20;
   sz|  relname
-+---
 1088 kB | pg_attribute_relid_attnam_index
 928 kB  | pg_depend_depender_index
 800 kB  | pg_attribute_relid_attnum_index
 736 kB  | pg_depend_reference_index
 352 kB  | pg_proc_proname_args_nsp_index
 216 kB  | pg_description_o_c_o_index
 200 kB  | pg_class_relname_nsp_index
 184 kB  | pg_type_oid_index
 176 kB  | pg_class_tblspc_relfilenode_index
 160 kB  | pg_type_typname_nsp_index
 104 kB  | pg_proc_oid_index
 64 kB   | pg_class_oid_index
 64 kB   | pg_statistic_relid_att_inh_index
 56 kB   | pg_collation_name_enc_nsp_index
 48 kB   | pg_constraint_conname_nsp_index
 48 kB   | pg_amop_fam_strat_index
 48 kB   | pg_amop_opr_fam_index
 48 kB   | pg_largeobject_loid_pn_index
 48 kB   | pg_operator_oprname_l_r_n_index
 48 kB   | pg_index_indexrelid_index
(20 rows)

Patch:

   sz|  relname
-+---
 1048 kB | pg_attribute_relid_attnam_index
 888 kB  | pg_depend_depender_index
 752 kB  | pg_attribute_relid_attnum_index
 616 kB  | pg_depend_reference_index
 352 kB  | pg_proc_proname_args_nsp_index
 216 kB  | pg_description_o_c_o_index
 192 kB  | pg_class_relname_nsp_index
 184 kB  | pg_type_oid_index
 152 kB  | pg_type_typname_nsp_index
 144 kB  | pg_class_tblspc_relfilenode_index
 104 kB  | pg_proc_oid_index
 72 kB   | pg_class_oid_index
 56 kB   | pg_collation_name_enc_nsp_index
 56 kB   | pg_statistic_relid_att_inh_index
 48 kB   | pg_index_indexrelid_index
 48 kB   | pg_amop_fam_strat_index
 48 kB   | pg_amop_opr_fam_index
 48 kB   | pg_largeobject_loid_pn_index
 48 kB   | pg_operator_oprname_l_r_n_index
 40 kB   | pg_index_indrelid_index
(20 rows)

The improvements to space utilization for the larger indexes
(especially the two pg_depends non-unique indexes) is smaller than I
remember from last time around, back in early 2020. This is probably
due to a combination of the Postgres 14 work and the pg_depend PIN
optimization work from commit a49d0812.

The single biggest difference is the decrease in the size of
pg_depend_reference_index -- it goes from 736 kB to 616 kB. Another
notable difference is that pg_class_tblspc_relfilenode_index shrinks,
going from 176 kB to 144 kB. These are not huge differences, but they
still seem worth having.

The best argument in favor of my proposal is definitely the index
bloat argument, which this test case tells us little or nothing about.
I'm especially concerned about scenarios where logical replication is
used, or where index deletion and VACUUM are inherently unable to
remove older index tuple versions for some other reason.

--
Peter Geoghegan

v1-0001-Enable-deduplication-in-system-catalog-indexes.patch
Description: Binary data

Re: jsonb crash

2021-09-29 Thread David Rowley

On Thu, 30 Sept 2021 at 11:20, Tom Lane  wrote:
>
> I wrote:
> > So I'm now thinking you weren't that far wrong to be looking at
> > hashability of the top-level qual operator.  What is missing is
> > that you have to restrict candidate cache keys to be the *direct*
> > arguments of such an operator.  Looking any further down in the
> > expression introduces untenable assumptions.
>
> Hmm ... I think that actually, a correct statement of the semantic
> restriction is
>
> To be eligible for memoization, the inside of a join can use the
> passed-in parameters *only* as direct arguments of hashable equality
> operators.
>
> In order to exploit RestrictInfo-based caching, you could make the
> further restriction that all such equality operators appear at the
> top level of RestrictInfo clauses.  But that's not semantically
> necessary.
>
> As an example, assuming p1 and p2 are the path parameters,
>
> (p1 = x) xor (p2 = y)
>
> is semantically safe to memoize, despite the upper-level xor
> operator.  But the example we started with, with a parameter
> used as an argument of jsonb_exists, is not safe to memoize
> because we have no grounds to suppose that two hash-equal values
> will act the same in jsonb_exists.

I'm not really sure if I follow your comment about the top-level qual
operator. I'm not really sure why that has anything to do with it.
Remember that we *never* do any hashing of any values from the inner
side of the join.  If we're doing a parameterized nested loop and say
our parameter has the value of 1, the first time through we don't find
any cached tuples, so we run the plan from the inner side of the
nested loop join and cache all the tuples that we get from it. When
the parameter changes, we check if the current value of the parameter
has any tuples cached.  This is what the hashing and equality
comparison does. If the new parameter value is 2, then we'll hash that
and probe the hash table. Since we've only seen value 1 so far, we
won't get a cache hit.  If at some later point in time we see the
parameter value of 1 again, we hash that, find something in the hash
bucket for that value then do an equality test to ensure the values
are actually the same and not just the same hash bucket or hash value.

At no point do we do any hashing on the actual cached tuples.

This allows us to memoize any join expression, not just equality
expressions. e.g if the SQL is: SELECT * FROM t1 INNER JOIN t2 on t1.a
> t2.a;  assuming t2 is on the inner side of the nested loop join,
then the only thing we hash is the t1.a parameter and the only thing
we do an equality comparison on is the current value of t1.a vs some
previous value of t1.a that is stored in the hash table. You can see
here that if t1.a and t2.a are not the same data type then that's of
no relevance as we *never* hash or do any equality comparisons on t2.a
in the memoize code.

The whole thing just hangs together by the assumption that parameters
with the same value will always yield the same tuples. If that's
somehow a wrong assumption, then we have a problem.

I'm not sure if this helps explain how it's meant to work, or if I
just misunderstood you.

David

Re: prevent immature WAL streaming

2021-09-29 Thread Tom Lane

Andrew Dunstan  writes:
> On 9/29/21 5:27 PM, Alvaro Herrera wrote:
>> So, do we take the stance that we have no right to expect pg_basebackup
>> to work if we didn't pass allow_streaming => 1?  If so, the fix is to
>> add it.  But my preferred fix would be to call set_replication_conf if
>> either allows_streaming or has_archiving are given.
>> 
>> Another easy fix would be to call $primary2->set_replication_conf in the
>> test file, but then you'd complain that that's supposed to be an
>> internal method :-)

> It claims that but it's also used here:
> src/bin/pg_basebackup/t/010_pg_basebackup.pl

Given that precedent, it seems like calling set_replication_conf
is a good quick-fix for getting the buildfarm green again.
But +1 for then refactoring this to get rid of these hacks (both
with respect to pg_hba.conf and the postgresql.conf parameters)
in both of these tests.

regards, tom lane

Re: jsonb crash

2021-09-29 Thread Tom Lane

David Rowley  writes:
> On Thu, 30 Sept 2021 at 11:20, Tom Lane  wrote:
>> Hmm ... I think that actually, a correct statement of the semantic
>> restriction is
>> To be eligible for memoization, the inside of a join can use the
>> passed-in parameters *only* as direct arguments of hashable equality
>> operators.

> I'm not really sure if I follow your comment about the top-level qual
> operator. I'm not really sure why that has anything to do with it.
> Remember that we *never* do any hashing of any values from the inner
> side of the join.  If we're doing a parameterized nested loop and say
> our parameter has the value of 1, the first time through we don't find
> any cached tuples, so we run the plan from the inner side of the
> nested loop join and cache all the tuples that we get from it. When
> the parameter changes, we check if the current value of the parameter
> has any tuples cached.

Right, and the point is that if you *do* get a hit, you are assuming
that the inner side would return the same values as it returned for
the previous hash-equal value.  You are doing yourself no good by
thinking about simple cases like integers.  Think about float8,
and ask yourself whether, if you cached a result for +0, that result
is still good for -0.  In general we can only assume that for applications
of the hash equality operator itself (or members of its hash opfamily).
Anything involving a cast to text, for example, would fail on such a case.

> This allows us to memoize any join expression, not just equality
> expressions.

I am clearly failing to get through to you.  Do I need to build
an example?

regards, tom lane

Re: Remove page-read callback from XLogReaderState.

2021-09-29 Thread Kyotaro Horiguchi

At Mon, 27 Sep 2021 17:31:03 +1300, Thomas Munro  wrote 
in 
> On Thu, Jul 15, 2021 at 4:48 PM Kyotaro Horiguchi
>  wrote:
> > Gah... Thank you for noticing me.  I thought that I have sent the
> > rebased version. This is the rebased version on the current master.
> 
> Hi Kyotaro,
> 
> Did you see this?
> 
> https://www.postgresql.org/message-id/20210429022553.4h5qii5jb5eclu4i%40alap3.anarazel.de

Thank you for pinging me. I haven't noticed of that.
I'll check on that line.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

should frontend tools use syncfs() ?

2021-09-29 Thread Justin Pryzby

Forking this thread in which Thomas implemented syncfs for the startup process
(61752afb2).
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BSG9jSW3ekwib0cSdC0yD-jReJ21X4bZAmqxoWTLTc2A%40mail.gmail.com

Is there any reason that initdb/pg_basebackup/pg_checksums/pg_rewind shouldn't
use syncfs()  ?

do_syncfs() is in src/backend/ so would need to be duplicated^Wimplemented in
common.

They can't use the GUC, so need to add an cmdline option or look at an
environment variable.
>From 747ec6011d438a655ee26105a315119b75036c10 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 29 Sep 2021 14:22:32 -0500
Subject: [PATCH] Allow use of syncfs() in frontend tools

Like initdb/basebackup/checksums/rewind.

See also: 61752afb26404dfc99a535c7a53f7f04dc110263
---
 src/backend/utils/misc/guc.c  |  1 +
 src/bin/initdb/initdb.c   | 10 +++-
 src/bin/pg_basebackup/pg_basebackup.c |  2 +-
 src/bin/pg_checksums/pg_checksums.c   |  2 +-
 src/bin/pg_rewind/file_ops.c  |  2 +-
 src/common/file_utils.c   | 72 ++-
 src/include/common/file_utils.h   |  2 +-
 7 files changed, 83 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d2ce4a8450..e37f6941f4 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -50,6 +50,7 @@
 #include "commands/user.h"
 #include "commands/vacuum.h"
 #include "commands/variable.h"
+#include "common/file_utils.h"
 #include "common/string.h"
 #include "funcapi.h"
 #include "jit/jit.h"
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 1ed4808d53..79048f3949 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -142,6 +142,7 @@ static bool noclean = false;
 static bool noinstructions = false;
 static bool do_sync = true;
 static bool sync_only = false;
+static bool use_syncfs = false;
 static bool show_setting = false;
 static bool data_checksums = false;
 static char *xlog_dir = NULL;
@@ -2200,6 +2201,7 @@ usage(const char *progname)
 	printf(_("  --no-instructions do not print instructions for next steps\n"));
 	printf(_("  -s, --showshow internal settings\n"));
 	printf(_("  -S, --sync-only   only sync database files to disk, then exit\n"));
+	printf(_(", --syncfs  use syncfs() rather than recursive fsync()\n"));
 	printf(_("\nOther options:\n"));
 	printf(_("  -V, --version output version information, then exit\n"));
 	printf(_("  -?, --helpshow this help, then exit\n"));
@@ -2869,6 +2871,7 @@ main(int argc, char *argv[])
 		{"no-sync", no_argument, NULL, 'N'},
 		{"no-instructions", no_argument, NULL, 13},
 		{"sync-only", no_argument, NULL, 'S'},
+		{"syncfs", no_argument, NULL, 15},
 		{"waldir", required_argument, NULL, 'X'},
 		{"wal-segsize", required_argument, NULL, 12},
 		{"data-checksums", no_argument, NULL, 'k'},
@@ -3020,6 +3023,9 @@ main(int argc, char *argv[])
 		 extra_options,
 		 "-c debug_discard_caches=1");
 break;
+			case 15:
+use_syncfs = true;
+break;
 			default:
 /* getopt_long already emitted a complaint */
 fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
@@ -3064,7 +3070,7 @@ main(int argc, char *argv[])
 
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		fsync_pgdata(pg_data, PG_VERSION_NUM);
+		fsync_pgdata(pg_data, PG_VERSION_NUM, use_syncfs);
 		check_ok();
 		return 0;
 	}
@@ -3153,7 +3159,7 @@ main(int argc, char *argv[])
 	{
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		fsync_pgdata(pg_data, PG_VERSION_NUM);
+		fsync_pgdata(pg_data, PG_VERSION_NUM, use_syncfs);
 		check_ok();
 	}
 	else
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 669aa207a3..0b59b1fe32 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2192,7 +2192,7 @@ BaseBackup(void)
 		}
 		else
 		{
-			(void) fsync_pgdata(basedir, serverVersion);
+			(void) fsync_pgdata(basedir, serverVersion, false);
 		}
 	}
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index e833c5d75e..021a4b18f9 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -687,7 +687,7 @@ main(int argc, char *argv[])
 		if (do_sync)
 		{
 			pg_log_info("syncing data directory");
-			fsync_pgdata(DataDir, PG_VERSION_NUM);
+			fsync_pgdata(DataDir, PG_VERSION_NUM, false);
 		}
 
 		pg_log_info("updating control file");
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index c50f283ede..c50d5945ab 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -296,7 +296,7 @@ sync_target_dir(void)
 	if (!do_sync || dry_run)
 		return;
 
-	fsync_pgdata(datadir_target, PG_VERSION_NUM);
+	fsync_pgdata(datadir_target, PG_VERSION_NUM, false);
 }
 
 
diff --git a/src/common

Re: Add some tests for pg_stat_statements compatibility verification under contrib

2021-09-29 Thread Michael Paquier

On Wed, Aug 25, 2021 at 04:16:08PM +0900, Michael Paquier wrote:
> I was just looking at your patch, and I think that you should move all
> the past compatibility tests into a separate test file, in a way
> consistent to what we do in contrib/pageinspect/ for
> oldextversions.sql.  I would suggest to use the same file names, while
> on it.

The current commit fest is ending, and it would be a waste to do
nothing here, so I have looked at what you proposed and reworked it.
The patch was blindly testing pg_stat_statements_reset() in all the
versions bumped with the same query on pg_stat_statements done each
time, which does not help in checking the actual parts of the code
that have changed, and there are two of them:
- pg_stat_statements_reset() execution got authorized for
pg_read_all_stats once in 1.6.
- pg_stat_statements() has been extended in 1.8, so we could just have
one query stressing this function in the tests for <= 1.7.

There is also no need for tests on 1.9, which is the latest version.
Tests for this one should be added once we bump the code to the next
version.  At the end I finish with the attached, counting for the
back-and-forth game with pg_read_all_stats.
--
Michael
From 9742f2a8b717310ff75ac061f274cbc0502ac8b8 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Thu, 30 Sep 2021 11:10:09 +0900
Subject: [PATCH v4] Add some tests for past versions of pg_stat_statements

---
 contrib/pg_stat_statements/Makefile   |   2 +-
 .../expected/oldextversions.out   | 139 ++
 .../pg_stat_statements/sql/oldextversions.sql |  39 +
 3 files changed, 179 insertions(+), 1 deletion(-)
 create mode 100644 contrib/pg_stat_statements/expected/oldextversions.out
 create mode 100644 contrib/pg_stat_statements/sql/oldextversions.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 3ec627b956..7fabd96f38 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -16,7 +16,7 @@ PGFILEDESC = "pg_stat_statements - execution statistics of SQL statements"
 LDFLAGS_SL += $(filter -lm, $(LIBS))
 
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
-REGRESS = pg_stat_statements
+REGRESS = pg_stat_statements oldextversions
 # Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
new file mode 100644
index 00..515d8fd736
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -0,0 +1,139 @@
+-- test old extension version entry points
+CREATE EXTENSION pg_stat_statements WITH VERSION '1.4';
+-- Execution of pg_stat_statements_reset() is granted to pg_read_all_stats
+-- since 1.5, so this fails.
+SET SESSION AUTHORIZATION pg_read_all_stats;
+SELECT pg_stat_statements_reset();
+ERROR:  permission denied for function pg_stat_statements_reset
+RESET SESSION AUTHORIZATION;
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.5';
+-- Execution of pg_stat_statements_reset() should be granted to
+-- pg_read_all_stats now, so this works.
+SET SESSION AUTHORIZATION pg_read_all_stats;
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+RESET SESSION AUTHORIZATION;
+-- And in 1.6 it got restricted back to superusers.
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.6';
+SET SESSION AUTHORIZATION pg_read_all_stats;
+SELECT pg_stat_statements_reset();
+ERROR:  permission denied for function pg_stat_statements_reset
+RESET SESSION AUTHORIZATION;
+SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);
+  pg_get_functiondef   
+---
+ CREATE OR REPLACE FUNCTION public.pg_stat_statements_reset() +
+  RETURNS void+
+  LANGUAGE c  +
+  PARALLEL SAFE   +
+ AS '$libdir/pg_stat_statements', $function$pg_stat_statements_reset$function$+
+ 
+(1 row)
+
+-- New function for pg_stat_statements_reset introduced, still
+-- restricted for non-superusers.
+AlTER EXTENSION pg_stat_statements UPDATE TO '1.7';
+SET SESSION AUTHORIZATION pg_read_all_stats;
+SELECT pg_stat_statements_reset();
+ERROR:  permission denied for function pg_stat_statements_reset
+RESET SESSION AUTHORIZATION;
+SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);
+   pg_get_functiondef   
+--

Re: On login trigger: take three

2021-09-29 Thread Greg Nancarrow

On Wed, Sep 29, 2021 at 10:14 PM Teodor Sigaev  wrote:
>
> Nice feature, but, sorry, I see some design problem in suggested feature. 
> AFAIK,
> there is two use cases for this feature:
> 1 A permission / prohibition to login some users
> 2 Just a logging of facts of user's login
>
> Suggested patch proposes prohibition of login only by failing of login trigger
> and it has at least two issues:
> 1 In case of prohibition to login, there is no clean way to store information
> about unsuccessful  login. Ok, it could be solved by dblink module but it 
> seems
> to scary.

It's an area that could be improved, but the patch is more intended to
allow additional actions on successful login, as described by the
following (taken from the doc updates in the patch):

+   
+The event trigger on the login event can be
+useful for logging user logins, for verifying the connection and
+assigning roles according to current circumstances, or for some
session data
+initialization.
+   

> 2 For logging purpose failing of trigger will cause impossibility to login, it
> could be workarounded by catching error in trigger function, but it's not so
> obvious for DBA.
>

If you look back on the past discussions on this thread, you'll see
that originally the patch added a GUC for the purpose of allowing
superusers to disable the event trigger, to allow login in this case.
However it was argued that there was already a documented way of
bypassing buggy event triggers (i.e. restart in single-user mode), so
that GUC was removed.
Also previously in the patch, login trigger errors for superusers were
ignored in order to allow them to login in this case, but it was
argued that it could well be unsafe to continue on after an error, so
that too was removed.
See below:

>
> +   {"enable_client_connection_trigger", PGC_SU_BACKEND, 
> DEVELOPER_OPTIONS,
> +   gettext_noop("Enables the client_connection event trigger."),
> +   gettext_noop("In case of errors in the ON client_connection 
> EVENT TRIGGER procedure, "
>  ..and..
> +   /*
> +* Try to ignore error for superuser to make it possible to login 
> even in case of errors
> +* during trigger execution
> +*/
> +   if (!is_superuser)
> +   PG_RE_THROW();
> This patch adds two ways for superusers to bypass this event trigger in case 
> of
> it being faulty, but for every other event trigger we've documented to restart
> in single-user mode and fixing it there.  Why does this need to be different?
> This clearly has a bigger chance of being a footgun but I don't see that as a
> reason to add a GUC and a bypass that other footguns lack.
>

So I really don't think that a failing event trigger will cause an
"impossibility to login".
The patch updates the documentation to explain the use of single-user
mode to fix buggy login event triggers. See below:

+   
+ The login event occurs when a user logs in to the
+ system.
+ Any bugs in a trigger procedure for this event may prevent successful
+ login to the system. Such bugs may be fixed after first restarting the
+ system in single-user mode (as event triggers are disabled in this mode).
+ See the  reference page for details about
+ using single-user mode.
+   
+

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Failed transaction statistics to measure the logical replication progress

2021-09-29 Thread Masahiko Sawada

On Wed, Sep 29, 2021 at 8:55 PM Amit Kapila  wrote:
>
> On Wed, Sep 29, 2021 at 4:49 PM Masahiko Sawada  wrote:
> >
> > On Tue, Sep 28, 2021 at 7:04 PM Amit Kapila  wrote:
> > >
> > > On Tue, Sep 28, 2021 at 11:35 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > >
> > > > > > Then, if, we proceed in this direction,
> > > > > > the place to implement those stats
> > > > > > would be on the LogicalRepWorker struct, instead ?
> > > > > >
> > > > >
> > > > > Or, we can make existing stats persistent and then add these stats on
> > > > > top of it. Sawada-San, do you have any thoughts on this matter?
> > > >
> > > > I think that making existing stats including received_lsn and
> > > > last_msg_receipt_time persistent by using stats collector could cause
> > > > massive reporting messages. We can report these messages with a
> > > > certain interval to reduce the amount of messages but we will end up
> > > > seeing old stats on the view.
> > > >
> > >
> > > Can't we keep the current and new stats both in-memory and persist on
> > > disk? So, the persistent stats data will be used to fill the in-memory
> > > counters after restarting of workers, otherwise, we will always refer
> > > to in-memory values.
> >
> > Interesting. Probably we can have apply workers and table sync workers
> > send their statistics to the stats collector at exit (before the stats
> > collector shutting down)? And the startup process will restore them at
> > restart?
> >
>
> I think we do need to send at the exit but we should probably send at
> some other regular interval as well to avoid losing all the stats
> after the crash-restart situation.

But we clear all statistics collected by the stats collector during
crash recovery. No?

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Added schema level support for publication.

2021-09-29 Thread Amit Kapila

On Wed, Sep 29, 2021 at 5:48 PM houzj.f...@fujitsu.com
 wrote:
>
> On Wed, Sep 29, 2021 5:14 PM Amit Kapila  wrote:
> > On Wed, Sep 29, 2021 at 11:59 AM houzj.f...@fujitsu.com
> >  wrote:
> > >
> > > On Wed, Sep 29, 2021 12:34 PM Amit Kapila 
> > > > On Wed, Sep 29, 2021 at 9:07 AM houzj.f...@fujitsu.com
> > > >  wrote:
> > > > >
> > > > > On Tues, Sep 28, 2021 10:46 PM vignesh C 
> > wrote:
> > > > > > Attached v34 patch has the changes for the same.
> > > > > How about we check this case like the following ?
> > > > >
> > > > > List   *schemaPubids = GetSchemaPublications(nspOid);
> > > > > List   *relPubids = 
> > > > > GetRelationPublications(RelationGetRelid(rel));
> > > > > if (list_intersection(schemaPubids, relPubids))
> > > > > ereport(ERROR, ...
> > > > >
> > > >
> > > > Won't this will allow changing one of the partitions for which only
> > > > partitioned table is part of the target schema?
> > >
> > > I think it still disallow changing partition's schema to the published 
> > > one.
> > > I tested with the following SQLs.
> > > -
> > > create schema sch1;
> > > create schema sch2;
> > > create schema sch3;
> > >
> > > create table sch1.tbl1 (a int) partition by range ( a ); create table
> > > sch2.tbl1_part1 partition of sch1.tbl1 for values from (1) to (101);
> > > create table sch3.tbl1_part2 partition of sch1.tbl1 for values from
> > > (101) to (200); create publication pub for ALL TABLES IN schema sch1,
> > > TABLE sch2.tbl1_part1; alter table sch2.tbl1_part1 set schema sch1;
> > > ---* It will report an error here *
> > > -
> > >
> >
> > Use all steps before "create publication" and then try below. These will 
> > give an
> > error with the patch proposed but if I change it to what you are proposing 
> > then
> > it won't give an error.
> > create publication pub for ALL TABLES IN schema sch2, Table sch1.tbl1; alter
> > table sch3.tbl1_part2 set schema sch2;
> >
> > But now again thinking about it, I am not sure if we really want to give 
> > error in
> > this case. What do you think?
>
> Personally, I think we can allow the above case.
>
> Because if user specify the partitioned table in the publication like above,
> they cannot drop the partition separately. And the partitioned table is the
> actual one in pg_publication_rel. So, I think allowing this case seems won't
> make people feel confused.
>

Yeah, I also thought on similar lines. So, let's allow this case.

>
> > Also, if we use list_intersection trick, then how will
> > we tell the publication due to which this problem has occurred, or do you 
> > think
> > we should leave that as an exercise for the user?
>
> I thought list_intersection will return a puboids list in which the puboid 
> exists in both input list.
> We can choose the first puboid and output it in the error message which seems 
> the same as
> the current patch.
>
> But I noticed list_intersection doesn't support T_OidList, so we might need 
> to search the puboid
> Manually. Like:
>
> foreach(cell, relPubids)
> {
> if (list_member_oid(schemaPubids, lfirst_oid(cell)))
> ereport(ERROR,
> 
> errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> errmsg("cannot move table 
> \"%s\" to schema \"%s\"",
>
> RelationGetRelationName(rel), stmt->newschema),
> errdetail("Altering table 
> will result in having schema \"%s\" and same schema's table \"%s\" in the 
> publication \"%s\" which is not supported.",
>   
> stmt->newschema,
>   
> RelationGetRelationName(rel),
>   
> get_publication_name(lfirst_oid(cell), false)));
> }
>

Looks good to me.


-- 
With Regards,
Amit Kapila.

Re: Failed transaction statistics to measure the logical replication progress

2021-09-29 Thread Amit Kapila

On Thu, Sep 30, 2021 at 8:18 AM Masahiko Sawada  wrote:
>
> On Wed, Sep 29, 2021 at 8:55 PM Amit Kapila  wrote:
> >
> > On Wed, Sep 29, 2021 at 4:49 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Sep 28, 2021 at 7:04 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Tue, Sep 28, 2021 at 11:35 AM Masahiko Sawada 
> > > >  wrote:
> > > > >
> > > > > On Tue, Sep 28, 2021 at 1:54 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Then, if, we proceed in this direction,
> > > > > > > the place to implement those stats
> > > > > > > would be on the LogicalRepWorker struct, instead ?
> > > > > > >
> > > > > >
> > > > > > Or, we can make existing stats persistent and then add these stats 
> > > > > > on
> > > > > > top of it. Sawada-San, do you have any thoughts on this matter?
> > > > >
> > > > > I think that making existing stats including received_lsn and
> > > > > last_msg_receipt_time persistent by using stats collector could cause
> > > > > massive reporting messages. We can report these messages with a
> > > > > certain interval to reduce the amount of messages but we will end up
> > > > > seeing old stats on the view.
> > > > >
> > > >
> > > > Can't we keep the current and new stats both in-memory and persist on
> > > > disk? So, the persistent stats data will be used to fill the in-memory
> > > > counters after restarting of workers, otherwise, we will always refer
> > > > to in-memory values.
> > >
> > > Interesting. Probably we can have apply workers and table sync workers
> > > send their statistics to the stats collector at exit (before the stats
> > > collector shutting down)? And the startup process will restore them at
> > > restart?
> > >
> >
> > I think we do need to send at the exit but we should probably send at
> > some other regular interval as well to avoid losing all the stats
> > after the crash-restart situation.
>
> But we clear all statistics collected by the stats collector during
> crash recovery.
>

Right. So, maybe if we do what you are suggesting is sufficient.

-- 
With Regards,
Amit Kapila.

RE: (LOCK TABLE options) “ONLY” and “NOWAIT” are not yet implemented

2021-09-29 Thread Shinya11.Kato

>Thank you for your feedback.
>I might have added whitespace when I was checking the patch file.
>I attach a new patch to this mail.
Thank you for the update!

>   else if (Matches("LOCK", MatchAny, "IN", "ACCESS|ROW") ||
>-   Matches("LOCK", "TABLE", MatchAny, "IN", "ACCESS|ROW"))
>+   Matches("LOCK", "TABLE", MatchAny, "IN", "ACCESS|ROW") 
>||
>+   Matches("LOCK", "ONLY", MatchAny, "IN", "ACCESS|ROW") 
>||
>+   Matches("LOCK", "TABLE", "ONLY", MatchAny, "IN", 
>"ACCESS|ROW"))
I think this code is redundant, so I change following.
---
else if (HeadMatches("LOCK") && TailMatches("IN", "ACCESS|ROW"))
---
I created the patch, and attached it. Do you think?

>> 2. The command "LOCK TABLE a, b;" can be executed, but tab-completion
>> doesn't work properly. Is it OK?
>It's OK for now.
>But it should be able to handle a case of multiple tables in the future.
OK. I agreed.

Regards,
Shinya Kato


fix_tab_completion_of_lock.patch
Description: fix_tab_completion_of_lock.patch

Re: should frontend tools use syncfs() ?

2021-09-29 Thread Michael Paquier

On Wed, Sep 29, 2021 at 07:43:41PM -0500, Justin Pryzby wrote:
> Forking this thread in which Thomas implemented syncfs for the startup process
> (61752afb2).
> https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BSG9jSW3ekwib0cSdC0yD-jReJ21X4bZAmqxoWTLTc2A%40mail.gmail.com
> 
> Is there any reason that initdb/pg_basebackup/pg_checksums/pg_rewind shouldn't
> use syncfs()  ?

That makes sense.

> do_syncfs() is in src/backend/ so would need to be duplicated^Wimplemented in
> common.

The fd handling in the backend makes things tricky if trying to plug
in a common interface, so I'd rather do that as this is frontend-only
code.

> They can't use the GUC, so need to add an cmdline option or look at an
> environment variable.

fsync_pgdata() is going to manipulate many inodes anyway, because
that's a code path designed to do so.  If we know that syncfs() is
just going to be better, I'd rather just call it by default if
available and not add new switches to all the frontend tools in need
of flushing the data folder, switches that are not documented in your
patch.
--
Michael


signature.asc
Description: PGP signature

Re: should frontend tools use syncfs() ?

2021-09-29 Thread Thomas Munro

On Thu, Sep 30, 2021 at 4:49 PM Michael Paquier  wrote:
> fsync_pgdata() is going to manipulate many inodes anyway, because
> that's a code path designed to do so.  If we know that syncfs() is
> just going to be better, I'd rather just call it by default if
> available and not add new switches to all the frontend tools in need
> of flushing the data folder, switches that are not documented in your
> patch.

If we want this it should be an option, because it flushes out data
other than the pgdata dir, and it doesn't report errors on old
kernels.

Re: Failed transaction statistics to measure the logical replication progress

2021-09-29 Thread Amit Kapila

On Thu, Sep 30, 2021 at 8:22 AM Hou, Zhijie/侯 志杰  wrote:
>
> On Tues, Sep 28, 2021 6:05 PM Amit Kapila  wrote:
> >
> > Can't we keep the current and new stats both in-memory and persist on
> > disk? So, the persistent stats data will be used to fill the in-memory
> > counters after restarting of workers, otherwise, we will always refer
> > to in-memory values.
>
> I think this approach works, but I have another concern about it.
>
> The current pg_stat_subscription view is listed as "Dynamic Statistics Views" 
> in
> the document, the data in it seems about the worker process, and the view data
> shows what the current worker did. But if we keep the new xact stat persist,
> then it's not what the current worker did, it looks more related to the
> subscription historic data.
>

I see your point.

> Adding a new view seems resonalble, but it will bring another subscription
> related view which might be too much. OTOH, I can see there are already some
> different views[1] including xact stat, maybe adding another one is 
> accepatble ?
>

These all views are related to untransmitted to the collector but what
we really need is a view similar to pg_stat_archiver or
pg_stat_bgwriter which gives information about background workers.
Now, the problem as I see is if we go that route then
pg_stat_subscription will no longer remain dynamic view and one might
consider that as a compatibility break. The other idea I have shared
is that we display these stats under the new view introduced by
Sawada-San's patch [1] and probably rename that view as
pg_stat_subscription_worker where all the stats (xact info and last
failure information) about each worker will be displayed. Do you have
any opinion on that idea or do you see any problem with it?

Sure, we can introduce a new view but I want to avoid it if possible.

[1] - 
https://www.postgresql.org/message-id/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK%3D30xJfUVihNZDA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Some thoughts about the TAP tests' wait_for_catchup()

2021-09-29 Thread Amit Kapila

On Wed, Sep 29, 2021 at 9:29 PM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Wed, Sep 29, 2021 at 3:47 AM Tom Lane  wrote:
> >> It seems to me that for most purposes wait_for_catchup's approach is
> >> strictly worse, for two reasons:
> >> 1. It continually recomputes the primary's pg_current_wal_lsn().
> >> 2. It's querying the primary's view of the standby's state, which
> >> introduces a reporting delay.
>
> > I can't comment on all the use cases of wait_for_catchup() but I think
> > there are some use cases in logical replication where we need the
> > publisher to use wait_for_catchup after setting up the replication to
> > ensure that wal sender is started and in-proper state by checking its
> > state (which should be 'streaming'). That also implicitly checks if
> > the wal receiver has responded to initial ping requests by sending
> > replay location.
>
> Yeah, for logical replication we can't look at the subscriber's WAL
> positions because they could be totally different.  What I'm on
> about is the tests that use physical replication.  I think examining
> the standby's state directly is better in that case, for the reasons
> I cited.
>
> I guess the question of interest is whether it's sufficient to test
> the walreceiver feedback mechanisms in the context of logical
> replication, or whether we feel that the physical-replication code
> path is enough different that there should be a specific test for
> that combination too.
>

There is a difference in the handling of feedback messages for
physical and logical replication code paths. It is mainly about how we
advance slot's lsn based on wal flushed. See
ProcessStandbyReplyMessage, towards end, we call different functions
based on slot_type. So, I think it is better to have a test for the
physical replication feedback mechanism.

-- 
With Regards,
Amit Kapila.

Re: Skipping logical replication transactions on subscriber side

2021-09-29 Thread Masahiko Sawada

On Mon, Sep 27, 2021 at 2:55 PM Amit Kapila  wrote:
>
> On Mon, Sep 27, 2021 at 11:02 AM Masahiko Sawada  
> wrote:
> >
> > On Mon, Sep 27, 2021 at 12:50 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Sep 27, 2021 at 12:24 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Sep 27, 2021 at 6:21 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Sat, Sep 25, 2021 at 4:23 PM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > Sure, but each tablesync worker must have a separate relid. Why 
> > > > > > can't
> > > > > > we have a single hash table for both apply and table sync workers
> > > > > > which are hashed by sub_id + rel_id? For apply worker, the rel_id 
> > > > > > will
> > > > > > always be zero (InvalidOId) and tablesync workers will have a unique
> > > > > > OID for rel_id, so we should be able to uniquely identify each of
> > > > > > apply and table sync workers.
> > > > >
> > > > > What I imagined is to extend the subscription statistics, for
> > > > > instance, transaction stats[1]. By having a hash table for
> > > > > subscriptions, we can store those statistics into an entry of the hash
> > > > > table and we can think of subscription errors as also statistics of
> > > > > the subscription. So we can have another hash table for errors in an
> > > > > entry of the subscription hash table. For example, the subscription
> > > > > entry struct will be something like:
> > > > >
> > > > > typedef struct PgStat_StatSubEntry
> > > > > {
> > > > > Oid subid; /* hash key */
> > > > >
> > > > > HTAB *errors;/* apply and table sync errors */
> > > > >
> > > > > /* transaction stats of subscription */
> > > > > PgStat_Counter xact_commit;
> > > > > PgStat_Counter xact_commit_bytes;
> > > > > PgStat_Counter xact_error;
> > > > > PgStat_Counter xact_error_bytes;
> > > > > PgStat_Counter xact_abort;
> > > > > PgStat_Counter xact_abort_bytes;
> > > > > PgStat_Counter failure_count;
> > > > > } PgStat_StatSubEntry;
> > > > >
> > > >
> > > > I think these additional stats will be displayed via
> > > > pg_stat_subscription, right? If so, the current stats of that view are
> > > > all in-memory and are per LogicalRepWorker which means that for those
> > > > stats also we will have different entries for apply and table sync
> > > > worker. If this understanding is correct, won't it be better to
> > > > represent this as below?
> > >
> > > I was thinking that we have a different stats view for example
> > > pg_stat_subscription_xacts that has entries per subscription. But your
> > > idea seems better to me.
> >
> > I mean that showing statistics (including transaction statistics and
> > errors) per logical replication worker seems better to me, no matter
> > what view shows these statistics. I'll change the patch in that way.
> >
>

I've attached updated patches that incorporate all comments I got so
far. Please review them.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


v15-0003-Add-skip_xid-option-to-ALTER-SUBSCRIPTION.patch
Description: Binary data


v15-0002-Add-RESET-command-to-ALTER-SUBSCRIPTION-command.patch
Description: Binary data


v15-0001-Add-a-subscription-errors-statistics-view-pg_sta.patch
Description: Binary data

Re: Logical replication keepalive flood

2021-09-29 Thread Amit Kapila

On Thu, Sep 30, 2021 at 8:49 AM Hou, Zhijie/侯 志杰  wrote:
>
> I noticed another patch that Horiguchi-San posted earlier[1].
>
> The approach in that patch is to splits the sleep into two
> pieces. If the first sleep reaches the timeout then send a keepalive
> then sleep for the remaining time.
>
> I tested that patch and can see the keepalive message is reduced and
> the patch won't cause the current regression test fail.
>
> Since I didn't find some comments about that patch,
> I wonder did we find some problems about that patch ?
>

I am not able to understand some parts of that patch.

+ If the sleep is shorter
+ * than KEEPALIVE_TIMEOUT milliseconds, we skip sending a keepalive to
+ * prevent it from getting too-frequent.
+ */
+ if (MyWalSnd->flush < sentPtr &&
+ MyWalSnd->write < sentPtr &&
+ !waiting_for_ping_response)
+ {
+ if (sleeptime > KEEPALIVE_TIMEOUT)
+ {
+ int r;
+
+ r = WalSndWait(wakeEvents, KEEPALIVE_TIMEOUT,
+WAIT_EVENT_WAL_SENDER_WAIT_WAL);
+
+ if (r != 0)
+ continue;
+
+ sleeptime -= KEEPALIVE_TIMEOUT;
+ }
+
+ WalSndKeepalive(false);

It claims to skip sending keepalive if the sleep time is shorter than
KEEPALIVE_TIMEOUT (a new threshold) but the above code seems to
suggest it sends in both cases. What am I missing?

Also, more to the point this special keep_alive seems to be sent for
synchronous replication and walsender shutdown as they can expect
updated locations. You haven't given any reason (theory) why those two
won't be impacted due to this change? I am aware that for synchronous
replication, we wake waiters while ProcessStandbyReplyMessage but I am
not sure how it helps with wal sender shutdown? I think we need to
know the reasons for this message and then need to see if the change
has any impact on the same.

-- 
With Regards,
Amit Kapila.

Re: Logical replication keepalive flood

2021-09-29 Thread Greg Nancarrow

On Thu, Sep 30, 2021 at 3:56 PM Amit Kapila  wrote:
>
> I am not able to understand some parts of that patch.
>
> + If the sleep is shorter
> + * than KEEPALIVE_TIMEOUT milliseconds, we skip sending a keepalive to
> + * prevent it from getting too-frequent.
> + */
> + if (MyWalSnd->flush < sentPtr &&
> + MyWalSnd->write < sentPtr &&
> + !waiting_for_ping_response)
> + {
> + if (sleeptime > KEEPALIVE_TIMEOUT)
> + {
> + int r;
> +
> + r = WalSndWait(wakeEvents, KEEPALIVE_TIMEOUT,
> +WAIT_EVENT_WAL_SENDER_WAIT_WAL);
> +
> + if (r != 0)
> + continue;
> +
> + sleeptime -= KEEPALIVE_TIMEOUT;
> + }
> +
> + WalSndKeepalive(false);
>
> It claims to skip sending keepalive if the sleep time is shorter than
> KEEPALIVE_TIMEOUT (a new threshold) but the above code seems to
> suggest it sends in both cases. What am I missing?
>

The comment does seem to be wrong.
The way I see it, if the calculated sleeptime is greater than
KEEPALIVE_TIMEOUT, then it first sleeps for up to KEEPALIVE_TIMEOUT
milliseconds and skips sending a keepalive if something happens (i.e.
the socket becomes readable/writeable) during that time (WalSendWait
will return non-zero in that case). Otherwise, it sends a keepalive
and sleeps for (sleeptime - KEEPALIVE_TIMEOUT), then loops around
again ...

> Also, more to the point this special keep_alive seems to be sent for
> synchronous replication and walsender shutdown as they can expect
> updated locations. You haven't given any reason (theory) why those two
> won't be impacted due to this change? I am aware that for synchronous
> replication, we wake waiters while ProcessStandbyReplyMessage but I am
> not sure how it helps with wal sender shutdown? I think we need to
> know the reasons for this message and then need to see if the change
> has any impact on the same.
>

Yes, I'm not sure about the possible impacts, still looking at it.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Implementing Incremental View Maintenance

2021-09-29 Thread Yugo NAGATA

Hello Takahashi-san,

On Wed, 22 Sep 2021 18:53:43 +0900
Yugo NAGATA  wrote:

> Hello Takahashi-san,
> 
> On Thu, 5 Aug 2021 08:53:47 +
> "r.takahash...@fujitsu.com"  wrote:
> 
> > Hi Nagata-san,
> > 
> > 
> > Thank you for your reply.
> > 
> > > I'll investigate this more, but we may have to prohibit views on 
> > > partitioned
> > > table and partitions.
> > 
> > I think this restriction is strict.
> > This feature is useful when the base table is large and partitioning is 
> > also useful in such case.
> 
> One reason of this issue is the lack of triggers on partitioned tables or 
> partitions that
> are not specified in the view definition. 
> 
> However, even if we create triggers recursively on the parents or children, 
> we would still
> need more consideration. This is because we will have to convert the format 
> of tuple of
> modified table to the format of the table specified in the view for cases 
> that the parent
> and some children have different format. 
> 
> I think supporting partitioned tables can be left for the next release.
> 
> > 
> > I have several additional comments on the patch.
> > 
> > 
> > (1)
> > The following features are added to transition table.
> > - Prolong lifespan of transition table
> > - If table has row security policies, set them to the transition table
> > - Calculate pre-state of the table
> > 
> > Are these features only for IVM?
> > If there are other useful case, they should be separated from IVM patch and
> > should be independent patch for transition table.
> 
> Maybe. However, we don't have good idea about use cases other than IVM of
> them for now...
> 
> > 
> > (2)
> > DEPENDENCY_IMMV (m) is added to deptype of pg_depend.
> > What is the difference compared with existing deptype such as 
> > DEPENDENCY_INTERNAL (i)?
> 
> DEPENDENCY_IMMV was added to clear that a certain trigger is related to IMMV.
> We dropped the IVM trigger and its dependencies from IMMV when REFRESH ... 
> WITH NO DATA
> is executed. Without the new deptype, we may accidentally delete a dependency 
> created
> with an intention other than the IVM trigger.
> 
> > (3)
> > Converting from normal materialized view to IVM or from IVM to normal 
> > materialized view is not implemented yet.
> > Is it difficult?
> > 
> > I think create/drop triggers and __ivm_ columns can achieve this feature.
> 
> I think it is harder than you expected. When an IMMV is switched to a normal
> materialized view, we needs to drop hidden columns (__ivm_count__ etc.), and 
> in
> the opposite case, we need to create them again. The former (IMMV->IVM) might 
> be
> easer, but for the latter (IVM->IMMV) I wonder we would need to re-create 
> IMMV.

I am sorry but I found a mistake in the above description. 
"IMMV->IVM" and "IVM->IMMV" were wrong. I've should use "IMMV->MV" and 
"MV->IMMV"
where MV means normal materialized view.w.

Regards,
Yugo Nagata

-- 
Yugo NAGATA

pg_stat_replication_slots docs

2021-09-29 Thread Amit Kapila

Today, I noticed that we have mentioned pg_stat_replication_slots
under "Dynamic Statistics Views". I think it should be under
"Collected Statistics Views"?

-- 
With Regards,
Amit Kapila.


fix_pg_stat_replication_slots_doc.patch
Description: Binary data

Re: should frontend tools use syncfs() ?

2021-09-29 Thread Michael Paquier

On Thu, Sep 30, 2021 at 05:08:24PM +1300, Thomas Munro wrote:
> If we want this it should be an option, because it flushes out data
> other than the pgdata dir, and it doesn't report errors on old
> kernels.

Oh, OK, thanks.  That's the part about 5.8.  The only option
controlling if sync is used now in those binaries is --no-sync.
Should we use a different design for the option rather than a 
--syncfs?  Something like --sync={on,off,syncfs,fsync} could be a
possibility, for example.
--
Michael


signature.asc
Description: PGP signature

87 matches

Mail list logo