Re: alter table set TABLE ACCESS METHOD

2021-07-28 Thread Michael Paquier
On Wed, Jul 28, 2021 at 01:05:10PM -0700, Jeff Davis wrote:
> On Wed, 2021-07-28 at 14:02 +0900, Michael Paquier wrote:
>> Right.  Isn't that an older issue though?  A rewrite involved after a
>> change of relpersistence does not call the hook either.  It looks to
>> me that this should be after finish_heap_swap() to match with
>> ATExecSetTableSpace() in ATRewriteTables().  The only known user of
>> object_access_hook in the core code is sepgsql, so this would
>> involve a change of behavior.  And I don't recall any backpatching
>> that added a post-alter hook.
> 
> Sounds like it should be a different patch. Thank you.

Doing any checks around the hooks of objectaccess.h is very annoying,
because we have no modules to check after them easily except sepgsql.
Anyway, I have been checking that, with the hack-ish module attached
and tracked down that swap_relation_files() calls
InvokeObjectPostAlterHookArg() already (as you already spotted?), but
that's an internal change when it comes to SET LOGGED/UNLOGGED/ACCESS
METHOD :(

Attached is a small module I have used for those tests, for
reference.  It passes on HEAD, and with the patch attached you can see
the extra entries.

>>> Also, I agree with Justin that it should fail when there are
>>> multiple
>>> SET ACCESS METHOD subcommands consistently, regardless of whether
>>> one
>>> is a no-op, and it should probably throw a syntax error to match
>>> SET
>>> TABLESPACE.
>> 
>> Hmm.  Okay.

I'd still disagree with that.  One example is SET LOGGED that would
not fail when repeated multiple times.  Also, tracking down if a SET
ACCESS METHOD subcommand has been passed down requires an extra field
in each tab entry so I am not sure that this is worth the extra
complication.

I can see benefits and advantages one way or the other, and I would
tend to keep the code a maximum simple as we never store InvalidOid
for a table AM.  Anyway, I won't fight the majority either.

>>> Minor nit: in tab-complete.c, why does it say ""? Is that just
>>> a
>>> typo or is there a reason it's different from everything else,
>>> which
>>> uses ""? And what does "sth" mean anyway?
>> 
>> "Something".  That should be "" to be consistent with the area.
> 
> These two issues are pretty minor.

Fixed this one, while not forgetting about it.  Thanks.
--
Michael


object_hooks.tar.gz
Description: application/gzip
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fcd778c62a..b18de38e73 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5475,6 +5475,8 @@ ATRewriteTables(AlterTableStmt *parsetree, List **wqueue, LOCKMODE lockmode,
 			 RecentXmin,
 			 ReadNextMultiXactId(),
 			 persistence);
+
+			InvokeObjectPostAlterHook(RelationRelationId, tab->relid, 0);
 		}
 		else
 		{


signature.asc
Description: PGP signature


Re: needless complexity in StartupXLOG

2021-07-28 Thread Amul Sul
On Mon, Jul 26, 2021 at 9:43 PM Robert Haas  wrote:
>
> StartupXLOG() has code beginning around line 7900 of xlog.c that
> decides, at the end of recovery, between four possible courses of
> action. It either writes an end-of-recovery record, or requests a
> checkpoint, or creates a checkpoint, or does nothing, depending on the
> value of 3 flag variables, and also on whether we're still able to
> read the last checkpoint record:
>
> checkPointLoc = ControlFile->checkPoint;
>
> /*
>  * Confirm the last checkpoint is available for us to recover
>  * from if we fail.
>  */
> record = ReadCheckpointRecord(xlogreader,
> checkPointLoc, 1, false);
> if (record != NULL)
> {
> promoted = true;
>
> It seems to me that ReadCheckpointRecord() should never fail here. It
> should always be true, even when we're not in recovery, that the last
> checkpoint record is readable. If there's ever a situation where that
> is not true, even for an instant, then a crash at that point will be
> unrecoverable. Now, one way that it could happen is if we've got a bug
> in the code someplace that removes WAL segments too soon. However, if
> we have such a bug, we should fix it. What this code does is says "oh,
> I guess we removed the old checkpoint too soon, no big deal, let's
> just be more aggressive about getting the next one done," which I do
> not think is the right response. Aside from a bug, the only other way
> I can see it happening is if someone is manually removing WAL segments
> as the server is running through recovery, perhaps as some last-ditch
> play to avoid running out of disk space. I don't think the server
> needs to have - or should have - code to cater to such crazy
> scenarios. Therefore I think that this check, at least in its current
> form, is not sensible.
>
> My first thought was that we should do the check unconditionally,
> rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and
> ERROR if it fails. But then I wondered what the point of that would be
> exactly. If we have such a bug -- and to the best of my knowledge
> there's no evidence that we do -- there's no particular reason it
> should only happen at the end of recovery. It could happen any time
> the system -- or the user, or malicious space aliens -- remove files
> from pg_wal, and we have no real idea about the timing of malicious
> space alien activity, so doing the test here rather than anywhere else
> just seems like a shot in the dark. Perhaps the most logical place
> would be to move it to CreateCheckPoint() just after we remove old
> xlog files, but we don't have the xlogreader there, and anyway I don't
> see how it's really going to help. What bug do we expect to catch by
> removing files we think we don't need and then checking that we didn't
> remove the files we think we do need? That seems more like grasping at
> straws than a serious attempt to make things work better.
>
> So at the moment I am leaning toward the view that we should just
> remove this check entirely, as in the attached, proposed patch.
>

Can we have an elog() fatal error or warning to make sure that the
last checkpoint is still readable? Since the case where the user
(knowingly or unknowingly) or some buggy code has removed the WAL file
containing the last checkpoint could be possible. If it is then we
would have a hard time finding out when we get further unexpected
behavior due to this. Thoughts?

> Really, I think we should consider going further. If it's safe to
> write an end-of-recovery record rather than a checkpoint, why not do
> so all the time? Right now we fail to do that in the above-described
> "impossible" scenario where the previous checkpoint record can't be
> read, or if we're exiting archive recovery for some reason other than
> a promotion request, or if we're in single-user mode, or if we're in
> crash recovery. Presumably, people would like to start up the server
> quickly in all of those scenarios, so the only reason not to use this
> technology all the time is if we think it's safe in some scenarios and
> not others. I can't think of a reason why it matters why we're exiting
> archive recovery, nor can I think of a reason why it matters whether
> we're in single user mode. The distinction between crash recovery and
> archive recovery does seem to matter, but if anything the crash
> recovery scenario seems simpler, because then there's only one
> timeline involved.
>
> I realize that conservatism may have played a role in this code ending
> up looking the way that it does; someone seems to have thought it
> would be better not to rely on a new idea in all cases. From my point
> of view, though, it's scary to have so many cases, especially cases
> that don't seem like they should ever be reached. I think that
> simplifying the logic here and trying to do the same things in as many
> 

Re: pg_receivewal starting position

2021-07-28 Thread Kyotaro Horiguchi
At Wed, 28 Jul 2021 12:57:39 +0200, Ronan Dunklau  
wrote in 
> Le mercredi 28 juillet 2021, 08:22:30 CEST Kyotaro Horiguchi a écrit :
> > At Tue, 27 Jul 2021 07:50:39 +0200, Ronan Dunklau 
> > wrote in
> > > I don't know if it should be the default, toggled by a command line flag,
> > > or if we even should let the user provide a LSN.
> > 
> > *I* think it is completely reasonable (or at least convenient or less
> > astonishing) that pg_receivewal starts from the restart_lsn of the
> > replication slot to use.  The tool already decides the clean-start LSN
> > a bit unusual way. And it seems to me that proposed behavior can be
> > the default when -S is specified.
> > 
> 
> As of now we can't get the replication_slot restart_lsn with a replication 
> connection AFAIK.
> 
> This implies that the patch could require the user to specify a 
> maintenance-db 
> parameter, and we would use that if provided to fetch the replication slot 
> info, or fallback to the previous behaviour. I don't really like this 
> approach 
> as the behaviour changing wether we supply a maintenance-db parameter or not 
> is error-prone for the user.
>
> Another option would be to add a new replication command (for example 
> ACQUIRE_REPLICATION_SLOT ) to set the replication slot as the 
> current one, and return some info about it (restart_lsn at least for a 
> physical slot).

I didn't thought in details.  But I forgot that ordinary SQL commands
have been prohibited in physical replication connection.  So we need a
new replication command but it's not that a big deal.

> I don't see any reason not to make it work for logical replication 
> connections 
> / slots, but it wouldn't be that useful since we can query the database in 
> that case.

Ordinary SQL queries are usable on a logical replication slot so
I'm not sure how logical replication connection uses the command.
However, like you, I wouldn't bother restricting the command to
physical replication, but perhaps the new command should return the
slot type.

> Acquiring the replication slot instead of just reading it would make sure 
> that 
> no other process could start the replication between the time we read the 
> restart_lsn and when we issue START_REPLICATION. START_REPLICATION could then 
> check if we already have a replication slot, and ensure it is the same one as 
> the one we're trying to use. 

I'm not sure it's worth adding complexity for such strictness.
START_REPLICATION safely fails if someone steals the slot meanwhile.
In the first place there's no means to protect a slot from others
while idle.  One possible problem is the case where START_REPLICATION
successfully acquire the slot after the new command failed.  But that
case doesn't seem worse than the case someone advances the slot while
absence.  So I think READ_REPLICATION_SLOT is sufficient.

> From pg_receivewal point of view, this would amount to:
> 
>  - check if we currently have wal in the target directory.
>- if we do, proceed as currently done, by computing the start lsn and 
> timeline from the last archived wal
>   - if we don't, and we have a slot, run ACQUIRE_REPLICATION_SLOT. Use the 
> restart_lsn as the start lsn if there is one, and don't provide a timeline
>   - if we still don't have a start_lsn, fallback to using the current server 
> wal position as is done. 

That's pretty much it.

> What do you think ? Which information should we provide about the slot ?

We need the timeline id to start with when using restart_lsn.  The
current timeline can be used in most cases but there's a case where
the LSN is historical.

pg_receivewal doesn't send a replication status report when a segment
is finished. So after pg_receivewal stops just after a segment is
finished, the slot stays at the beginning of the last segment.  Thus
next time it will start from there, creating a duplicate segment.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




[postgres_fdw] add local pid to fallback_application_name

2021-07-28 Thread kuroda.hay...@fujitsu.com
Hi Hackers,

I propose adding trackable information in postgres_fdw, in order to track 
remote query correctly.

## Background and motivation

Currently postgres_fdw connects remote servers by using connect_pg_server(). 
However the function just calls PQconnectdbParams() with 
fallback_application_name = "postgres_fdw."
Therefore, if two or more servers connect to one data server and two queries 
arrive at the data server, the database administrator cannot determine which 
queries came from which server.
This problem prevents some workload analysis because it cannot track the flow 
of queries.

## Implementation

I just added local backend's pid to fallback_application_name. This is the key 
for seaching and matching two logs.
In order to use the feature and track remote transactions, user must add 
backend-pid and application_name to log_line_prefix, like

```
log_line_prefix = '%m [%p] [%a] '
```

Here is the output example. Assume that remote server has a table "foo," and 
local server imports the schema.
When local server executes foregin scan, the following line was output in the 
local's logfile.

```
2021-07-29 03:18:50.630 UTC [21572] [psql] LOG:  duration: 23.366 ms  
statement: select * from foo;
```

And in the remote's one, the following lines were appered.

```
2021-07-29 03:18:50.628 UTC [21573] [postgres_fdw for remote PID: 21572] LOG:  
duration: 0.615 ms  parse : DECLARE c1 CURSOR FOR
SELECT id FROM public.foo
```

Two lines have same pid, so we can track the log and analyze workloads 
correctly.
I will write docs later, but now I want you to review the motivation and 
implementation.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



add_pid.patch
Description: add_pid.patch


Re: needless complexity in StartupXLOG

2021-07-28 Thread Julien Rouhaud
On Thu, Jul 29, 2021 at 3:28 AM Andres Freund  wrote:
>
> [1] I really wish somebody had the energy to just remove single user and
> bootstrap modes. The degree to which they increase complexity in the rest of
> the system is entirely unreasonable. There's not actually any reason
> bootstrapping can't happen with checkpointer et al running, it's just
> autovacuum that'd need to be blocked.

Any objection to adding an entry for that in the wiki TODO?




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-28 Thread David G. Johnston
On Wed, Jul 28, 2021 at 6:52 PM Bruce Momjian  wrote:

> I came up with the attached patch.
>

Thank you.  It is an improvement but I think more could be done here (not
exactly sure what - though removing the "copy binaries for contrib modules
from the old server" seems like a decent second step.)

I'm not sure it really needs a parenthetical, and I personally dislike
using "Consider" to start the sentence.

"Bringing extensions up to the newest version available on the new server
can be done later using ALTER EXTENSION UPGRADE (after ensuring the correct
binaries are installed)."

David J.


Re: pgbench bug candidate: negative "initial connection time"

2021-07-28 Thread Yugo NAGATA
Hello,

On Fri, 18 Jun 2021 15:58:48 +0200 (CEST)
Fabien COELHO  wrote:

> Attached Yugo-san patch with some updates discussed in the previous mails, 
> so as to move things along.

I attached the patch rebased to a change due to 856de3b39cf.

Regards,
Yugo Nagata



-- 
Yugo NAGATA 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 55d14604c0..e460dc7e5b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3181,6 +3181,7 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 	if ((st->con = doConnect()) == NULL)
 	{
+		/* as the bench is already running, we do not abort the process */
 		pg_log_error("client %d aborted while establishing connection", st->id);
 		st->state = CSTATE_ABORTED;
 		break;
@@ -4418,7 +4419,10 @@ runInitSteps(const char *initialize_steps)
 	initPQExpBuffer(&stats);
 
 	if ((con = doConnect()) == NULL)
+	{
+		pg_log_fatal("connection for initialization failed");
 		exit(1);
+	}
 
 	setup_cancel_handler(NULL);
 	SetCancelConn(con);
@@ -6361,7 +6365,10 @@ main(int argc, char **argv)
 	/* opening connection... */
 	con = doConnect();
 	if (con == NULL)
+	{
+		pg_log_fatal("setup connection failed");
 		exit(1);
+	}
 
 	/* report pgbench and server versions */
 	printVersion(con);
@@ -6515,7 +6522,7 @@ main(int argc, char **argv)
 #endif			/* ENABLE_THREAD_SAFETY */
 
 		for (int j = 0; j < thread->nstate; j++)
-			if (thread->state[j].state == CSTATE_ABORTED)
+			if (thread->state[j].state != CSTATE_FINISHED)
 exit_code = 2;
 
 		/* aggregate thread level stats */
@@ -6583,7 +6590,7 @@ threadRun(void *arg)
 		if (thread->logfile == NULL)
 		{
 			pg_log_fatal("could not open logfile \"%s\": %m", logpath);
-			goto done;
+			exit(1);
 		}
 	}
 
@@ -6607,16 +6614,10 @@ threadRun(void *arg)
 		{
 			if ((state[i].con = doConnect()) == NULL)
 			{
-/*
- * On connection failure, we meet the barrier here in place of
- * GO before proceeding to the "done" path which will cleanup,
- * so as to avoid locking the process.
- *
- * It is unclear whether it is worth doing anything rather
- * than coldly exiting with an error message.
- */
-THREAD_BARRIER_WAIT(&barrier);
-goto done;
+/* coldly abort on initial connection failure */
+pg_log_fatal("cannot create connection for client %d",
+			 state[i].id);
+exit(1);
 			}
 		}
 
@@ -6749,7 +6750,7 @@ threadRun(void *arg)
 	continue;
 }
 /* must be something wrong */
-pg_log_fatal("%s() failed: %m", SOCKET_WAIT_METHOD);
+pg_log_error("%s() failed: %m", SOCKET_WAIT_METHOD);
 goto done;
 			}
 		}


Re: Out-of-memory error reports in libpq

2021-07-28 Thread Tom Lane
Ranier Vilela  writes:
> IMO, I think that "char *msg" is unnecessary, isn't it?

> + if (!PQExpBufferBroken(errorMessage))
> + res->errMsg = pqResultStrdup(res, errorMessage->data);
>   else
> - res->errMsg = NULL;
> + res->errMsg = libpq_gettext("out of memory\n");

Please read the comment.

regards, tom lane




Re: Slightly improve initdb --sync-only option's help message

2021-07-28 Thread Gurjeet Singh
On Mon, Jul 26, 2021 at 11:05 AM Bossart, Nathan  wrote:
> Here are my suggestions in patch form.

+printf(_("  -S, --sync-only   safely write all database
files to disk and exit\n"));

Not your patch's fault, but the word "write" does not seem to convey
the true intent of the option, because generally a "write" operation
is still limited to dirtying the OS buffers, and does not guarantee
sync-to-disk.

It'd be better if the help message said, either "flush all database
files to disk and exit",or "sync all database files to disk and exit".

Best regards,
--
Gurjeet Singh http://gurjeet.singh.im/




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-07-28 Thread Bruce Momjian
On Thu, Jul 15, 2021 at 08:15:57AM -0700, David G. Johnston wrote:
> On Thursday, July 15, 2021, David G. Johnston 
>  My uncertainty revolves around core extensions since it seems odd to tell
> the user to overwrite them with versions from an older version of
> PostgreSQL.
> 
> Ok. Just re-read the docs a third time…no uncertainty regarding contrib
> now…following the first part of the instructions means that before one could
> re-run create extension they would need to restore the original contrib 
> library
> files to avoid the new extension code using the old library.  So that whole
> part about recreation is inconsistent with the existing unchanged text.

I came up with the attached patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index a83c63cd98..c042300c8c 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -306,8 +306,9 @@ make prefix=/usr/local/pgsql.new install
  into the new cluster, e.g., pgcrypto.so,
  whether they are from contrib
  or some other source. Do not install the schema definitions, e.g.,
- CREATE EXTENSION pgcrypto, because these will be upgraded
- from the old cluster.
+ CREATE EXTENSION pgcrypto, because these will be copied
+ from the old cluster.  (Consider upgrading the extensions later
+ using ALTER EXTENSION...UPGRADE.)
  Also, any custom full text search files (dictionary, synonym,
  thesaurus, stop words) must also be copied to the new cluster.
 


Re: Failed transaction statistics to measure the logical replication progress

2021-07-28 Thread Masahiko Sawada
On Thu, Jul 8, 2021 at 3:55 PM osumi.takami...@fujitsu.com
 wrote:
>
> Hello, hackers
>
>
> When the current HEAD fails during logical decoding, the failure
> increments txns count in pg_stat_replication_slots - [1] and adds
> the transaction size to the sum of bytes in the same repeatedly
> on the publisher, until the problem is solved.
> One of the good examples is duplication error on the subscriber side
> and this applies to both streaming and spill cases as well.
>
> This update prevents users from grasping the exact number and size of
> successful and unsuccessful transactions. Accordingly, we need to
> have new columns of failed transactions that will work to differentiate
> both of them for all types, which means spill, streaming and normal
> transactions. This will help users to measure the exact status of
> logical replication.

Could you please elaborate on use cases of the proposed statistics?
For example, the current statistics on pg_replication_slots can be
used for tuning logical_decoding_work_mem as well as inferring the
total amount of bytes passed to the output plugin. How will the user
use those statistics?

Also, if we want the stats of successful transactions why don't we
show the stats of successful transactions in the view instead of ones
of failed transactions?

>
> Attached file is the POC patch for this.
> Current design is to save failed stats data in the ReplicationSlot struct.
> This is because after the error, I'm not able to access the ReorderBuffer 
> object.
> Thus, I chose the object where I can interact with at the 
> ReplicationSlotRelease timing.

When discussing the pg_stat_replication_slots view, there was an idea
to store the slot statistics on ReplicationSlot struct. But the idea
was rejected mainly because the struct is on the shared buffer[1]. If
we store those counts on ReplicationSlot struct it increases the usage
of shared memory. And those statistics are used only by logical slots
and don’t necessarily need to be shared among the server processes.
Moreover, if we want to add more statistics on the view in the future,
it further increases the usage of shared memory. If we want to track
the stats of successful transactions, I think it's easier to track
them on the subscriber side rather than the publisher side. We can
increase counters when applying [stream]commit/abort logical changes
on the subscriber.

Regards,

[1] 
https://www.postgresql.org/message-id/CAA4eK1Kuj%2B3G59hh3wu86f4mmpQLpah_mGv2-wfAPyn%2BzT%3DP4A%40mail.gmail.com

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Out-of-memory error reports in libpq

2021-07-28 Thread Ranier Vilela
Em qua., 28 de jul. de 2021 às 21:25, Tom Lane  escreveu:

> I wrote:
> > Andres Freund  writes:
> >> Hm. It seems we should be able to guarantee that the recovery path can
> print
> >> something, at least in the PGconn case. Is it perhaps worth pre-sizing
> >> PGConn->errorMessage so it'd fit an error like this?
> >> But perhaps that's more effort than it's worth.
>
> > Yeah.  I considered changing things so that oom_buffer contains
> > "out of memory\n" rather than an empty string, but I'm afraid
> > that that's making unsupportable assumptions about what PQExpBuffers
> > are used for.
>
> Actually, wait a minute.  There are only a couple of places that ever
> read out the value of conn->errorMessage, so let's make those places
> responsible for dealing with OOM scenarios.  That leads to a nicely
> small patch, as attached, and it looks to me like it makes us quite
> bulletproof against such scenarios.
>
> It might still be worth doing the "pqReportOOM" changes to save a
> few bytes of code space, but I'm less excited about that now.
>
IMO, I think that "char *msg" is unnecessary, isn't it?

+ if (!PQExpBufferBroken(errorMessage))
+ res->errMsg = pqResultStrdup(res, errorMessage->data);
  else
- res->errMsg = NULL;
+ res->errMsg = libpq_gettext("out of memory\n");


>
> regards, tom lane
>
>


Re: Emit namespace in post-copy output

2021-07-28 Thread Michael Paquier
On Wed, Jul 28, 2021 at 04:15:19PM +0200, Daniel Gustafsson wrote:
> Since get_namespace_name() returns a palloced string, this will lead to a 2x
> leak of the namespace length as opposed to the 1x of today.  While hardly a 
> big
> deal, it seems prudent to cap this by storing the returned string locally now
> that we need it twice.

I don't think this matters much.  A quick read of the code shows that
this memory should be allocated within the transaction context running
CLUSTER/VACUUM FULL, and that gets free'd in the internal calls of
CommitTransactionCommand().
--
Michael


signature.asc
Description: PGP signature


Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep

2021-07-28 Thread Michael Paquier
On Wed, Jul 28, 2021 at 08:28:12PM +, Bossart, Nathan wrote:
> On 7/28/21, 11:32 AM, "Tom Lane"  wrote:
>> I follow the idea of using WaitLatch to ensure that the delays are
>> interruptible by postmaster signals, but even that isn't worth a
>> lot given the expected use of these things.  I don't see a need to
>> expend any extra effort on wait-reporting.
> 
> +1.  The proposed patch doesn't make the delay visibility any worse
> than what's already there.

Agreed to just drop the patch (my opinion about this patch is
unchanged).  Not to mention that wait events are not available at SQL
level at this stage yet.
--
Michael


signature.asc
Description: PGP signature


Re: Out-of-memory error reports in libpq

2021-07-28 Thread Tom Lane
I wrote:
> Andres Freund  writes:
>> Hm. It seems we should be able to guarantee that the recovery path can print
>> something, at least in the PGconn case. Is it perhaps worth pre-sizing
>> PGConn->errorMessage so it'd fit an error like this?
>> But perhaps that's more effort than it's worth.

> Yeah.  I considered changing things so that oom_buffer contains
> "out of memory\n" rather than an empty string, but I'm afraid
> that that's making unsupportable assumptions about what PQExpBuffers
> are used for.

Actually, wait a minute.  There are only a couple of places that ever
read out the value of conn->errorMessage, so let's make those places
responsible for dealing with OOM scenarios.  That leads to a nicely
small patch, as attached, and it looks to me like it makes us quite
bulletproof against such scenarios.

It might still be worth doing the "pqReportOOM" changes to save a
few bytes of code space, but I'm less excited about that now.

regards, tom lane

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index e950b41374..49eec3e835 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -6739,6 +6739,14 @@ PQerrorMessage(const PGconn *conn)
 	if (!conn)
 		return libpq_gettext("connection pointer is NULL\n");
 
+	/*
+	 * The errorMessage buffer might be marked "broken" due to having
+	 * previously failed to allocate enough memory for the message.  In that
+	 * case, tell the application we ran out of memory.
+	 */
+	if (PQExpBufferBroken(&conn->errorMessage))
+		return libpq_gettext("out of memory\n");
+
 	return conn->errorMessage.data;
 }
 
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index aca81890bb..87f348f3dc 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -191,7 +191,7 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status)
 /* non-error cases */
 break;
 			default:
-pqSetResultError(result, conn->errorMessage.data);
+pqSetResultError(result, &conn->errorMessage);
 break;
 		}
 
@@ -662,14 +662,28 @@ pqResultStrdup(PGresult *res, const char *str)
  *		assign a new error message to a PGresult
  */
 void
-pqSetResultError(PGresult *res, const char *msg)
+pqSetResultError(PGresult *res, PQExpBuffer errorMessage)
 {
+	char	   *msg;
+
 	if (!res)
 		return;
-	if (msg && *msg)
-		res->errMsg = pqResultStrdup(res, msg);
+
+	/*
+	 * We handle two OOM scenarios here.  The errorMessage buffer might be
+	 * marked "broken" due to having previously failed to allocate enough
+	 * memory for the message, or it might be fine but pqResultStrdup fails
+	 * and returns NULL.  In either case, just make res->errMsg point directly
+	 * at a constant "out of memory" string.
+	 */
+	if (!PQExpBufferBroken(errorMessage))
+		msg = pqResultStrdup(res, errorMessage->data);
+	else
+		msg = NULL;
+	if (msg)
+		res->errMsg = msg;
 	else
-		res->errMsg = NULL;
+		res->errMsg = libpq_gettext("out of memory\n");
 }
 
 /*
@@ -2122,7 +2136,7 @@ PQgetResult(PGconn *conn)
 appendPQExpBuffer(&conn->errorMessage,
   libpq_gettext("PGEventProc \"%s\" failed during PGEVT_RESULTCREATE event\n"),
   res->events[i].name);
-pqSetResultError(res, conn->errorMessage.data);
+pqSetResultError(res, &conn->errorMessage);
 res->resultStatus = PGRES_FATAL_ERROR;
 break;
 			}
diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h
index e9f214b61b..490458adef 100644
--- a/src/interfaces/libpq/libpq-int.h
+++ b/src/interfaces/libpq/libpq-int.h
@@ -637,7 +637,7 @@ extern pgthreadlock_t pg_g_threadlock;
 
 /* === in fe-exec.c === */
 
-extern void pqSetResultError(PGresult *res, const char *msg);
+extern void pqSetResultError(PGresult *res, PQExpBuffer errorMessage);
 extern void *pqResultAlloc(PGresult *res, size_t nBytes, bool isBinary);
 extern char *pqResultStrdup(PGresult *res, const char *str);
 extern void pqClearAsyncResult(PGconn *conn);


Re: CREATE SEQUENCE with RESTART option

2021-07-28 Thread Michael Paquier
On Wed, Jul 28, 2021 at 01:16:19PM -0400, Tom Lane wrote:
> I do not see any RESTART option in SQL:2021 11.72  definition>.  Since we don't document it either, there's really no
> expectation that anyone would use it.

Okay, good point.  I was not aware of that.

> I don't particularly think that we should document it, so I'm with Michael
> that we don't need to do anything.  This is hardly the only undocumented
> corner case in PG.

More regression tests for CREATE SEQUENCE may be interesting, but
that's not an issue either with the ones for ALTER SEQUENCE.  Let's
drop the patch and move on. 
--
Michael


signature.asc
Description: PGP signature


Re: Replace l337sp34k in comments.

2021-07-28 Thread Peter Smith
On Wed, Jul 28, 2021 at 11:32 AM Michael Paquier  wrote:
>
> On Wed, Jul 28, 2021 at 09:39:02AM +1000, Peter Smith wrote:
> > IMO the PG code comments are not an appropriate place for leetspeak 
> > creativity.
> >
> > PSA a patch to replace a few examples that I recently noticed.
> >
> > "up2date" --> "up-to-date"
>
> Agreed that this is a bit cleaner to read, so done.  Just note that
> pgindent has been complaining about the format of some of the updated
> comments.

Thanks for pushing!

BTW, the commit comment [1] attributes most of these to a recent
patch, but I think that is mistaken.  AFAIK they are from when the
file was first introduced 8 years ago [2].

--
[1] 
https://github.com/postgres/postgres/commit/7b7fbe1e8bb4b2a244d1faa618789db411316e55
[2] 
https://github.com/postgres/postgres/commit/b89e151054a05f0f6d356ca52e3b725dd0505e53#diff-034b6d4eaf36425e75d7a7087d09bd6c734dd9ea8398533559d537d13b6b9197

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Out-of-memory error reports in libpq

2021-07-28 Thread Tom Lane
Andres Freund  writes:
> Hm. It seems we should be able to guarantee that the recovery path can print
> something, at least in the PGconn case. Is it perhaps worth pre-sizing
> PGConn->errorMessage so it'd fit an error like this?

Forgot to address this.  Right now, the normal situation is that
PGConn->errorMessage is "pre sized" to 256 bytes, because that's
what pqexpbuffer.c does for all PQExpBuffers.  So unless you've
overrun that, the question is moot.  If you have, and you got
an OOM in trying to expand the PQExpBuffer, then pqexpbuffer.c
will release what it has and substitute the "oom_buffer" empty
string.  If you're really unlucky you might then not be able
to allocate another 256-byte buffer, in which case we end up
with an empty-string result.  I don't think it's probable,
but in a multithread program it could happen.

> But perhaps that's more effort than it's worth.

Yeah.  I considered changing things so that oom_buffer contains
"out of memory\n" rather than an empty string, but I'm afraid
that that's making unsupportable assumptions about what PQExpBuffers
are used for.

For now, I'm content if it's not worse than v13.  We've not
heard a lot of complaints in this area.

regards, tom lane




Re: [PATCH] test/ssl: rework the sslfiles Makefile target

2021-07-28 Thread Daniel Gustafsson
> On 28 Jul 2021, at 23:02, Tom Lane  wrote:
> Jacob Champion  writes:

>> As long as the .SECONDARYEXPANSION magic is clear enough to others, I'm
>> happy.
> 
> After reading the gmake docs about that, I'd have to say it's likely to be 
> next
> door to unmaintainable.


Personally, I don’t think it’s that bad, but mileage varies.  It’s obviously a
show-stopper if maintainers don’t feel comfortable with it.

> And, AFAICT, it's not in 3.80.

That however, is a very good point that I missed.  I think it’s a good tool,
but probably not enough to bump the requirement.

--
Daniel Gustafsson   https://vmware.com/





Re: [PATCH] test/ssl: rework the sslfiles Makefile target

2021-07-28 Thread Tom Lane
Jacob Champion  writes:
> On Wed, 2021-07-28 at 00:24 +0200, Daniel Gustafsson wrote:
>> GNU Make is already a requirement, I don't see this shifting the needle in 
>> any
>> direction.

Um ... the existing requirement is for gmake 3.80 or newer;
if you want to use newer features we'd have to have a discussion
about whether it's worthwhile to move that goalpost.

> As long as the .SECONDARYEXPANSION magic is clear enough to others, I'm
> happy.

After reading the gmake docs about that, I'd have to say it's
likely to be next door to unmaintainable.  Do we really have
to be that cute?  And, AFAICT, it's not in 3.80.

regards, tom lane




Re: Out-of-memory error reports in libpq

2021-07-28 Thread Tom Lane
Andres Freund  writes:
> I should probably know this, but I don't. Nor did I quickly find an answer. I
> assume gettext() reliably and reasonably deals with OOM?

I've always assumed that their fallback in cases of OOM, can't read
the message file, yadda yadda is to return the original string.
I admit I haven't gone and checked their code, but it'd be
unbelievably stupid to do otherwise.

> Looking in the gettext code I'm again scared by the fact that it takes locks
> during gettext (because of stuff like erroring out of signal handlers, not
> OOMs).

Hm.

regards, tom lane




Re: [PATCH] test/ssl: rework the sslfiles Makefile target

2021-07-28 Thread Andrew Dunstan


On 7/28/21 4:10 PM, Jacob Champion wrote:
>
>>> - I am making _heavy_ use of GNU Make-isms, which does not improve
>>> long-term maintainability.
>> GNU Make is already a requirement, I don't see this shifting the needle in 
>> any
>> direction.
> As long as the .SECONDARYEXPANSION magic is clear enough to others, I'm
> happy.
>

We don't currently have any, and so many of us (including me) will have
to learn to understand it. But that's not to say it's unacceptable. If
there's no new infrastructure requirement then I'm OK with it.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep

2021-07-28 Thread Bossart, Nathan
On 7/28/21, 11:32 AM, "Tom Lane"  wrote:
> I'm detecting a certain amount of lily-gilding here.  Neither of these
> delays are meant for anything except debugging purposes, and nobody as
> far as I've heard has ever expressed great concern about identifying
> which process they need to attach to for that purpose.  So I think it
> is a *complete* waste of time to add any cycles to connection startup
> to make these delays more visible.
>
> I follow the idea of using WaitLatch to ensure that the delays are
> interruptible by postmaster signals, but even that isn't worth a
> lot given the expected use of these things.  I don't see a need to
> expend any extra effort on wait-reporting.

+1.  The proposed patch doesn't make the delay visibility any worse
than what's already there.

Nathan



Re: [PATCH] test/ssl: rework the sslfiles Makefile target

2021-07-28 Thread Jacob Champion
On Wed, 2021-07-28 at 00:24 +0200, Daniel Gustafsson wrote:
> > On 4 Mar 2021, at 01:03, Jacob Champion  wrote:
> > Andrew pointed out elsewhere [1] that it's pretty difficult to add new
> > certificates to the test/ssl suite without blowing away the current
> > state and starting over. I needed new cases for the NSS backend work,
> > and ran into the same pain, so here is my attempt to improve the
> > situation.
> 
> Thanks for working on this, I second the pain cited.  I've just started to 
> look
> at this, so only a few comments thus far.
> 
> > The unused server-ss certificate has been removed entirely.
> 
> Nice catch, this seems to have been unused since the original import of the 
> SSL
> test suite.  To cut down scope of the patch (even if only a small bit) I
> propose to apply this separately first, as per the attached.

LGTM.

> > - Serial number collisions are less likely, thanks to Andrew's idea to
> > use the current clock time as the initial serial number in a series.
> 
> +my $serialno = `openssl x509 -serial -noout -in ssl/client.crt`;
> +$serialno =~ s/^serial=//;
> +$serialno = hex($serialno); # OpenSSL prints serial numbers in hexadecimal
> 
> Will that work on Windows?  We don't currently require the openssl binary to 
> be
> in PATH unless one wants to rebuild sslfiles (which it is quite likely to be
> but there should at least be errorhandling covering when it's not).

Hm, that's a good point. It should be easy enough for me to add a
fallback if the invocation fails; I'll give it a shot tomorrow.

> > - I am making _heavy_ use of GNU Make-isms, which does not improve
> > long-term maintainability.
> 
> GNU Make is already a requirement, I don't see this shifting the needle in any
> direction.

As long as the .SECONDARYEXPANSION magic is clear enough to others, I'm
happy.

Thanks!
--Jacob


Re: alter table set TABLE ACCESS METHOD

2021-07-28 Thread Jeff Davis
On Wed, 2021-07-28 at 14:02 +0900, Michael Paquier wrote:
> Arg, sorry about that!  I was unclear what the situation of the patch
> was.

No problem, race condition ;-)

> Right.  Isn't that an older issue though?  A rewrite involved after a
> change of relpersistence does not call the hook either.  It looks to
> me that this should be after finish_heap_swap() to match with
> ATExecSetTableSpace() in ATRewriteTables().  The only known user of
> object_access_hook in the core code is sepgsql, so this would
> involve a change of behavior.  And I don't recall any backpatching
> that added a post-alter hook.

Sounds like it should be a different patch. Thank you.

> > Also, I agree with Justin that it should fail when there are
> > multiple
> > SET ACCESS METHOD subcommands consistently, regardless of whether
> > one
> > is a no-op, and it should probably throw a syntax error to match
> > SET
> > TABLESPACE.
> 
> Hmm.  Okay.
> 
> > Minor nit: in tab-complete.c, why does it say ""? Is that just
> > a
> > typo or is there a reason it's different from everything else,
> > which
> > uses ""? And what does "sth" mean anyway?
> 
> "Something".  That should be "" to be consistent with the area.

These two issues are pretty minor.

Regards,
Jeff Davis






Re: Out-of-memory error reports in libpq

2021-07-28 Thread Andres Freund
Hi,

On 2021-07-27 18:40:48 -0400, Tom Lane wrote:
> The first half of that just saves a few hundred bytes of repetitive
> coding.  However, I think that the addition of recovery logic is
> important for robustness, because as things stand libpq may be
> worse off than before for OOM handling.

Agreed.


> Before ffa2e4670, almost all of these call sites did
> printfPQExpBuffer(..., "out of memory").  That would automatically
> clear the message buffer to empty, and thereby be sure to report the
> out-of-memory failure if at all possible.  Now we might fail to report
> the thing that the user really needs to know to make sense of what
> happened.

Hm. It seems we should be able to guarantee that the recovery path can print
something, at least in the PGconn case. Is it perhaps worth pre-sizing
PGConn->errorMessage so it'd fit an error like this?

But perhaps that's more effort than it's worth.


> +void
> +pqReportOOMBuffer(PQExpBuffer errorMessage)
> +{
> + const char *msg = libpq_gettext("out of memory\n");

I should probably know this, but I don't. Nor did I quickly find an answer. I
assume gettext() reliably and reasonably deals with OOM?

Looking in the gettext code I'm again scared by the fact that it takes locks
during gettext (because of stuff like erroring out of signal handlers, not
OOMs).

It does look like it tries to always return the original string in case of
OOM. Although the code is quite maze-like, so it's not easy to tell..

Greetings,

Andres Freund




Re: needless complexity in StartupXLOG

2021-07-28 Thread Andres Freund
Hi,

On 2021-07-26 12:12:53 -0400, Robert Haas wrote:
> My first thought was that we should do the check unconditionally,
> rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and
> ERROR if it fails. But then I wondered what the point of that would be
> exactly. If we have such a bug -- and to the best of my knowledge
> there's no evidence that we do -- there's no particular reason it
> should only happen at the end of recovery. It could happen any time
> the system -- or the user, or malicious space aliens -- remove files
> from pg_wal, and we have no real idea about the timing of malicious
> space alien activity, so doing the test here rather than anywhere else
> just seems like a shot in the dark.

Yea. The history of that code being added doesn't suggest that there was
a concrete issue being addressed, from what I can tell.


> So at the moment I am leaning toward the view that we should just
> remove this check entirely, as in the attached, proposed patch.

+1


> Really, I think we should consider going further. If it's safe to
> write an end-of-recovery record rather than a checkpoint, why not do
> so all the time?

+many. The current split doesn't make much sense. For one, it often is a huge
issue if crash recovery takes a long time - why should we incur the cost that
we are OK avoiding during promotions? For another, end-of-recovery is a
crucial path for correctness, reducing the number of non-trivial paths is
good.


> Imagine if instead of
> all the hairy logic we have now we just replaced this whole if
> (IsInRecovery) stanza with this:
> 
> if (InRecovery)
> CreateEndOfRecoveryRecord();
> 
> That would be WAY easier to reason about than the rat's nest we have
> here today. Now, I am not sure what it would take to get there, but I
> think that is the direction we ought to be heading.

What are we going to do in the single user ([1]) case in this awesome future?
I guess we could just not create a checkpoint until single user mode is shut
down / creates a checkpoint for other reasons?


Greetings,

Andres Freund


[1] I really wish somebody had the energy to just remove single user and
bootstrap modes. The degree to which they increase complexity in the rest of
the system is entirely unreasonable. There's not actually any reason
bootstrapping can't happen with checkpointer et al running, it's just
autovacuum that'd need to be blocked.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-28 Thread Andres Freund
Hi,

On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:
> Apart from performance and memory usage points of view, we also need
> to consider the reusability of the code. When I started this thread, I
> thought the best data structure would be the one optimized for
> vacuum's dead tuple storage. However, if we can use a data structure
> that can also be used in general, we can use it also for other
> purposes. Moreover, if it's too optimized for the current TID system
> (32 bits block number, 16 bits offset number, maximum block/offset
> number, etc.) it may become a blocker for future changes.

Indeed.


> In that sense, radix tree also seems good since it can also be used in
> gist vacuum as a replacement for intset, or a replacement for hash
> table for shared buffer as discussed before. Are there any other use
> cases?

Yes, I think there are. Whenever there is some spatial locality it has a
decent chance of winning over a hash table, and it will most of the time
win over ordered datastructures like rbtrees (which perform very poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming &
outgoing node) radix trees even can deal quite performantly with
variable width keys.


> On the other hand, I’m concerned that radix tree would be an
> over-engineering in terms of vacuum's dead tuples storage since the
> dead tuple storage is static data and requires only lookup operation,
> so if we want to use radix tree as dead tuple storage, I'd like to see
> further use cases.

I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-28 Thread Andres Freund
Hi,


On 2021-07-25 19:07:18 +0300, Yura Sokolov wrote:
> I've dreamed to write more compact structure for vacuum for three
> years, but life didn't give me a time to.
> 
> Let me join to friendly competition.
> 
> I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

My concern with several of the proposals in this thread is that they
over-optimize for this specific case. It's not actually that crucial to
have a crazily optimized vacuum dead tid storage datatype. Having
something more general that also performs reasonably for the dead tuple
storage, but also performs well in a number of other cases, makes a lot
more sense to me.


> (Bad radix result probably due to smaller cache in notebook's CPU ?)

Probably largely due to the node dispatch. a) For some reason gcc likes
jump tables too much, I get better numbers when disabling those b) the
node type dispatch should be stuffed into the low bits of the pointer.


> select prepare(100, 2, 100, 1);
> 
>attach  size   shuffled
> array6ms12MB   53.42s
> intset  23ms16MB   54.99s
> rtbm   115ms38MB8.19s
> tbm186ms   100MB8.37s
> vtbm   105ms59MB9.08s
> radix   64ms42MB   10.41s
> svtm73ms10MB7.49s

> Test4
> 
> select prepare(100, 100, 1, 1);
> 
>attach  size   shuffled
> array  304ms   600MB   75.12s
> intset 775ms98MB   47.49s
> rtbm   356ms38MB4.11s
> tbm539ms   100MB4.20s
> vtbm   493ms42MB4.44s
> radix  263ms42MB6.05s
> svtm   360ms 8MB3.49s
> 
> Therefore Specialized Vaccum Tid Map always consumes least memory amount
> and usually faster.

Impressive.

Greetings,

Andres Freund




Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep

2021-07-28 Thread Tom Lane
Justin Pryzby  writes:
> On Wed, Jul 28, 2021 at 09:10:35PM +0530, Bharath Rupireddy wrote:
>> Hm. That's a good idea to show up in the ps display.

> Keep in mind that ps is apparently so expensive under windows that the GUC
> defaults to off.
> The admin can leave the ps display off, but I wonder if it's of any concern
> that something so expensive can be caused by an unauthenticated connection.

I'm detecting a certain amount of lily-gilding here.  Neither of these
delays are meant for anything except debugging purposes, and nobody as
far as I've heard has ever expressed great concern about identifying
which process they need to attach to for that purpose.  So I think it
is a *complete* waste of time to add any cycles to connection startup
to make these delays more visible.

I follow the idea of using WaitLatch to ensure that the delays are
interruptible by postmaster signals, but even that isn't worth a
lot given the expected use of these things.  I don't see a need to
expend any extra effort on wait-reporting.

regards, tom lane




Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep

2021-07-28 Thread Justin Pryzby
On Wed, Jul 28, 2021 at 09:10:35PM +0530, Bharath Rupireddy wrote:
> On Wed, Jul 28, 2021 at 8:12 AM Kyotaro Horiguchi  
> wrote:
> >
> > - It seems that the additional wait-event is effectively useless for
> >  most of the processes. Considering that this feature is for debugging
> >  purpose, it'd be better to use ps display instead (or additionally)
> >  if we want to see the wait event anywhere.
> 
> Hm. That's a good idea to show up in the ps display.

Keep in mind that ps is apparently so expensive under windows that the GUC
defaults to off.

The admin can leave the ps display off, but I wonder if it's of any concern
that something so expensive can be caused by an unauthenticated connection.

> > The events of autovacuum workers can be seen in pg_stat_activity properly.
> >
> > For client-backends, that state cannot be seen in
> > pg_stat_activity. That seems inevitable since backends aren't
> > allocated a PGPROC entry yet at that time. (So the wait event is set
> > to local memory as a safety measure in this case.)  On the other hand,
> > I find it inconvenient that the ps display is shown as just "postgres"
> > while in that state.  I think we can show "postgres: preauth waiting"
> > or something.  (It is shown as "postgres: username dbname [conn]
> > initializing" while PostAuthDelay)
> 
> Hm. Is n't it better to show something like below in the ps display?
> for pre_auth_delay: "postgres: pre auth delay"
> for post_auth_delay: "postgres: <> post auth delay"
> 
> But, I'm not sure whether this ps display thing will add any value to
> the end user who always can't see the ps display. So, how about having
> both i.e. ps display (useful for pre auth delay cases) and wait event
> (useful for post auth delay)?




Re: speed up verifying UTF-8

2021-07-28 Thread John Naylor
I wrote:

> On Mon, Jul 26, 2021 at 7:55 AM Vladimir Sitnikov <
sitnikov.vladi...@gmail.com> wrote:
> >
> > >+ utf8_advance(s, state, len);
> > >+
> > >+ /*
> > >+ * If we saw an error during the loop, let the caller handle it. We
treat
> > >+ * all other states as success.
> > >+ */
> > >+ if (state == ERR)
> > >+ return 0;
> >
> > Did you mean state = utf8_advance(s, state, len); there? (reassign
state variable)
>
> Yep, that's a bug, thanks for catching!

Fixed in v21, with a regression test added. Also, utf8_advance() now
directly changes state by a passed pointer rather than returning a value.
Some cosmetic changes:

s/valid_bytes/non_error_bytes/ since the former is kind of misleading now.

Some other var name and symbol changes. In my first DFA experiment, ASC
conflicted with the parser or scanner somehow, but it doesn't here, so it's
clearer to use this.

Rewrote a lot of comments about the state machine and regression tests.
--
John Naylor
EDB: http://www.enterprisedb.com


v21-0001-Add-fast-paths-for-validating-UTF-8-text.patch
Description: Binary data


Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-28 Thread Andres Freund
Hi,

On 2021-07-28 13:37:48 -0400, Melanie Plageman wrote:
> Attached is a patch on top of the AIO branch which does bitmapheapscan
> prefetching using the PgStreamingRead helper already used by sequential
> scan and vacuum on the AIO branch.
>
> The prefetch iterator is removed and the main iterator in the
> BitmapHeapScanState node is now used by the PgStreamingRead helper.

Cool! I'm heartened to see "12 files changed, 272 insertions(+), 495 
deletions(-)"


It's worth calling out that this fixes some abstraction leakyness around
tableam too...


> Each IO will have its own TBMIterateResult allocated and returned by the
> PgStreamingRead helper and freed later by
> heapam_scan_bitmap_next_block() before requesting the next block.
> Previously it was allocated once and saved in the TBMIterator in the
> BitmapHeapScanState node and reused. Because of this, the table AM API
> routine, table_scan_bitmap_next_block() now defines the TBMIterateResult
> as an output parameter.
>
> I haven't made the GIN code reasonable yet either (it uses the TID
> bitmap functions that I've changed).

I don't quite understand the need to change the tidbitmap interface, or
maybe rather I'm not convinced that pessimistically preallocating space
is a good idea?


> I don't see a need for it right now. If you wanted you
> Because the PgStreamingReadHelper needs to be set up with the
> BitmapHeapScanState node but also needs some table AM specific
> functions, I thought it made more sense to initialize it using a new
> table AM API routine. Instead of fully implementing that I just wrote a
> wrapper function, table_bitmap_scan_setup() which just calls
> bitmapheap_pgsr_alloc() to socialize the idea before implementing it.

That makes sense.


>  static bool
>  heapam_scan_bitmap_next_block(TableScanDesc scan,
> -   TBMIterateResult 
> *tbmres)
> +  TBMIterateResult **tbmres)

ISTM that we possibly shouldn't expose the TBMIterateResult outside of
the AM after this change? It feels somewhat like an implementation
detail now. It seems somewhat odd to expose a ** to set a pointer that
nodeBitmapHeapscan.c then doesn't really deal with itself.


> @@ -695,8 +693,7 @@ tbm_begin_iterate(TIDBitmap *tbm)
>* Create the TBMIterator struct, with enough trailing space to serve 
> the
>* needs of the TBMIterateResult sub-struct.
>*/
> - iterator = (TBMIterator *) palloc(sizeof(TBMIterator) +
> -   
> MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber));
> + iterator = (TBMIterator *) palloc(sizeof(TBMIterator));
>   iterator->tbm = tbm;

Hm?


> diff --git a/src/include/storage/aio.h b/src/include/storage/aio.h
> index 9a07f06b9f..8e1aa48827 100644
> --- a/src/include/storage/aio.h
> +++ b/src/include/storage/aio.h
> @@ -39,7 +39,7 @@ typedef enum IoMethod
>  } IoMethod;
>
>  /* We'll default to bgworker. */
> -#define DEFAULT_IO_METHOD IOMETHOD_WORKER
> +#define DEFAULT_IO_METHOD IOMETHOD_IO_URING

I agree with the sentiment, but ... :)

Greetings,

Andres Freund




Re: Have I found an interval arithmetic bug?

2021-07-28 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jul 28, 2021 at 1:05 PM Tom Lane  wrote:
>> That would clearly be a bug fix.  I'm just troubled that there are
>> still behaviors that people will see as bugs.

> That's a reasonable thing to be troubled about, but the date and time
> related datatypes have so many odd and crufty behaviors that I have a
> hard time believing that there's another possible outcome.

There's surely a ton of cruft there, but I think most of it stems from
western civilization's received rules for timekeeping, which we can do
little about.  But the fact that interval_in accepts '1.4 years' when
it cannot do anything very reasonable with that input is entirely
self-inflicted damage.

BTW, I don't have a problem with the "interval * float8" operator
doing equally strange things, because if you don't like what it
does you can always write your own multiplication function that
you like better.  There can be only one interval_in, though,
so I don't think it should be misrepresenting the fundamental
capabilities of the datatype.

regards, tom lane




Re: Asynchronous and "direct" IO support for PostgreSQL.

2021-07-28 Thread Melanie Plageman
On Tue, Feb 23, 2021 at 5:04 AM Andres Freund  wrote:
>
> ## AIO API overview
>
> The main steps to use AIO (without higher level helpers) are:
>
> 1) acquire an "unused" AIO: pgaio_io_get()
>
> 2) start some IO, this is done by functions like
>pgaio_io_start_(read|write|fsync|flush_range)_(smgr|sb|raw|wal)
>
>The (read|write|fsync|flush_range) indicates the operation, whereas
>(smgr|sb|raw|wal) determines how IO completions, errors, ... are handled.
>
>(see below for more details about this design choice - it might or not be
>right)
>
> 3) optionally: assign a backend-local completion callback to the IO
>(pgaio_io_on_completion_local())
>
> 4) 2) alone does *not* cause the IO to be submitted to the kernel, but to be
>put on a per-backend list of pending IOs. The pending IOs can be explicitly
>be flushed pgaio_submit_pending(), but will also be submitted if the
>pending list gets to be too large, or if the current backend waits for the
>IO.
>
>The are two main reasons not to submit the IO immediately:
>- If adjacent, we can merge several IOs into one "kernel level" IO during
>  submission. Larger IOs are considerably more efficient.
>- Several AIO APIs allow to submit a batch of IOs in one system call.
>
> 5) wait for the IO: pgaio_io_wait() waits for an IO "owned" by the current
>backend. When other backends may need to wait for an IO to finish,
>pgaio_io_ref() can put a reference to that AIO in shared memory (e.g. a
>BufferDesc), which can be waited for using pgaio_io_wait_ref().
>
> 6) Process the results of the request. If a callback was registered in 3),
>this isn't always necessary. The results of AIO can be accessed using
>pgaio_io_result() which returns an integer where negative numbers are
>-errno, and positive numbers are the [partial] success conditions
>(e.g. potentially indicating a short read).
>
> 7) release ownership of the io (pgaio_io_release()) or reuse the IO for
>another operation (pgaio_io_recycle())
>
>
> Most places that want to use AIO shouldn't themselves need to care about
> managing the number of writes in flight, or the readahead distance. To help
> with that there are two helper utilities, a "streaming read" and a "streaming
> write".
>
> The "streaming read" helper uses a callback to determine which blocks to
> prefetch - that allows to do readahead in a sequential fashion but importantly
> also allows to asynchronously "read ahead" non-sequential blocks.
>
> E.g. for vacuum, lazy_scan_heap() has a callback that uses the visibility map
> to figure out which block needs to be read next. Similarly lazy_vacuum_heap()
> uses the tids in LVDeadTuples to figure out which blocks are going to be
> needed. Here's the latter as an example:
> https://github.com/anarazel/postgres/commit/a244baa36bfb252d451a017a273a6da1c09f15a3#diff-3198152613d9a28963266427b380e3d4fbbfabe96a221039c6b1f37bc575b965R1906
>

Attached is a patch on top of the AIO branch which does bitmapheapscan
prefetching using the PgStreamingRead helper already used by sequential
scan and vacuum on the AIO branch.

The prefetch iterator is removed and the main iterator in the
BitmapHeapScanState node is now used by the PgStreamingRead helper.

Some notes about the code:

Each IO will have its own TBMIterateResult allocated and returned by the
PgStreamingRead helper and freed later by
heapam_scan_bitmap_next_block() before requesting the next block.
Previously it was allocated once and saved in the TBMIterator in the
BitmapHeapScanState node and reused. Because of this, the table AM API
routine, table_scan_bitmap_next_block() now defines the TBMIterateResult
as an output parameter.

The PgStreamingRead helper pgsr_private parameter for BitmapHeapScan is
now the actual BitmapHeapScanState node. It needed access to the
iterator, the heap scan descriptor, and a few fields in the
BitmapHeapScanState node that could be moved elsewhere or duplicated
(visibility map buffer and can_skip_fetch, for example). So, it is
possible to either create a new struct or move fields around to avoid
this--but, I'm not sure if that would actually be better.

Because the PgStreamingReadHelper needs to be set up with the
BitmapHeapScanState node but also needs some table AM specific
functions, I thought it made more sense to initialize it using a new
table AM API routine. Instead of fully implementing that I just wrote a
wrapper function, table_bitmap_scan_setup() which just calls
bitmapheap_pgsr_alloc() to socialize the idea before implementing it.

I haven't made the GIN code reasonable yet either (it uses the TID
bitmap functions that I've changed).

There are various TODOs in the code posing questions both to the
reviewer and myself for future versions of the patch.

Oh, also, I haven't updated the failing partition_prune regression test
because I haven't had a chance to look at the EXPLAIN code which adds
the text which is not being produced to see if it is 

Re: Have I found an interval arithmetic bug?

2021-07-28 Thread Robert Haas
On Wed, Jul 28, 2021 at 1:05 PM Tom Lane  wrote:
> Fair point, but you decided when you chose to use float that you don't
> care about the differences between numbers that only differ at the
> seventeenth or so decimal place.  (Maybe, if you don't understand what
> float is, you didn't make that choice intentionally ... but that's
> a documentation issue not a code shortcoming.)  However, it's fairly
> hard to believe that somebody who writes '1.4 years'::interval doesn't
> care about the 0.4 year.  The fact that we silently convert that to,
> effectively, 1.... years seems like a bigger roundoff error
> than one would expect.

Yeah, that's definitely a fair point!

> > I am dubious that it's worth the pain of making the input function
> > reject cases involving fractional units. It's true that some people
> > here aren't happy with the current behavior, but they may no happier
> > if we reject those cases with an error, and other people may then be
> > unhappy too.
>
> Maybe.  A possible compromise is to accept only exactly-representable
> fractions.  Then, for instance, we'd take 1.5 years (resulting in 18
> months) but not 1.4 years.  Now, this might fall foul of your point about
> not wanting to mislead people into expecting the system to do things it
> can't; but I'd argue that the existing behavior misleads them much more.

Well, let's see what other people think.

> > I think your previous idea was the best one so far: fix
> > the input function so that 'X years Y months' and 'Y months X years'
> > always produce the same answer, and call it good.
>
> That would clearly be a bug fix.  I'm just troubled that there are
> still behaviors that people will see as bugs.

That's a reasonable thing to be troubled about, but the date and time
related datatypes have so many odd and crufty behaviors that I have a
hard time believing that there's another possible outcome. If somebody
showed up today and proposed a new data type and told us that the way
to format values of that data type was to say to_char(my_value,
alphabet_soup) I think they would not be taken very seriously. A lot
of this code, and the associated interfaces, date back to a time when
PostgreSQL was far more primitive than today, and when databases in
general were as well. At least we didn't end up with a datatype called
varchar2 ... or not yet, anyway.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Fix around conn_duration in pgbench

2021-07-28 Thread Fujii Masao




On 2021/07/28 16:15, Yugo NAGATA wrote:

I found another disconnect_all().

/* XXX should this be connection time? */
disconnect_all(state, nclients);

The measurement is also not necessary here.
So the above comment should be removed or updated?


I think this disconnect_all will be a no-op because all connections should
be already closed in threadRun(), but I left it just to be sure that
connections are all cleaned-up. I updated the comment for explaining above.

I attached the updated patch. Could you please look at this?


Thanks for updating the patch! LGTM.

Barring any objection, I will commit the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: CREATE SEQUENCE with RESTART option

2021-07-28 Thread Tom Lane
Fujii Masao  writes:
> On 2021/07/28 23:53, Bharath Rupireddy wrote:
>> -1. IMHO, this is something creating more confusion to the user. We
>> say that we allow both START and RESTART that RESTART is accepted as a
>> consequence of our internal option handling in gram.y. Instead, I
>> recommend throwing errorConflictingDefElem or errmsg("START and
>> RESTART are mutually exclusive options"). We do throw these errors in
>> a lot of other places for various options. Others may have better
>> thoughts though.

> Per docs, CREATE SEQUENCE conforms to the SQL standard, with some exceptions.
> So I'd agree with Michael if CREATE SEQUENCE with RESTART also conforms to
> the SQL standard, but I'd agree with Bharath otherwise.

I do not see any RESTART option in SQL:2021 11.72 .  Since we don't document it either, there's really no
expectation that anyone would use it.

I don't particularly think that we should document it, so I'm with Michael
that we don't need to do anything.  This is hardly the only undocumented
corner case in PG.

regards, tom lane




Re: Have I found an interval arithmetic bug?

2021-07-28 Thread Tom Lane
Robert Haas  writes:
> On Wed, Jul 28, 2021 at 11:52 AM Tom Lane  wrote:
>> You know, I was thinking exactly that thing earlier.  Changing the
>> on-disk representation is certainly a nonstarter, but the problem
>> here stems from expecting interval_in to do something sane with inputs
>> that do not correspond to any representable value.  I do not think we
>> have any other datatypes where we expect the input function to make
>> choices like that.

> A case that is perhaps more theoretically similar to the instance at
> hand is rounding during the construction of floating point values. My
> system thinks '1.01'::float = '1'::float, so
> in that case, as in this one, we've decided that it's OK to build an
> inexact representation of the input value.

Fair point, but you decided when you chose to use float that you don't
care about the differences between numbers that only differ at the
seventeenth or so decimal place.  (Maybe, if you don't understand what
float is, you didn't make that choice intentionally ... but that's
a documentation issue not a code shortcoming.)  However, it's fairly
hard to believe that somebody who writes '1.4 years'::interval doesn't
care about the 0.4 year.  The fact that we silently convert that to,
effectively, 1.... years seems like a bigger roundoff error
than one would expect.

> I am dubious that it's worth the pain of making the input function
> reject cases involving fractional units. It's true that some people
> here aren't happy with the current behavior, but they may no happier
> if we reject those cases with an error, and other people may then be
> unhappy too.

Maybe.  A possible compromise is to accept only exactly-representable
fractions.  Then, for instance, we'd take 1.5 years (resulting in 18
months) but not 1.4 years.  Now, this might fall foul of your point about
not wanting to mislead people into expecting the system to do things it
can't; but I'd argue that the existing behavior misleads them much more.

> I think your previous idea was the best one so far: fix
> the input function so that 'X years Y months' and 'Y months X years'
> always produce the same answer, and call it good.

That would clearly be a bug fix.  I'm just troubled that there are
still behaviors that people will see as bugs.

regards, tom lane




Re: CREATE SEQUENCE with RESTART option

2021-07-28 Thread Fujii Masao




On 2021/07/28 23:53, Bharath Rupireddy wrote:

On Wed, Jul 28, 2021 at 11:50 AM Michael Paquier  wrote:


On Mon, Jul 26, 2021 at 04:57:53PM +0900, Michael Paquier wrote:

FWIW, like Ashutosh upthread, my vote would be to do nothing here in
terms of behavior changes as this is just breaking a behavior for the
sake of breaking it, so there are chances that this is going to piss
some users that relied accidentally on the existing behavior.


In short, I would be tempted with something like the attached, that
documents RESTART in CREATE SEQUENCE, while describing its behavior
according to START.  In terms of regression tests, there is already a
lot in this area with ALTER SEQUENCE, but I think that having two
tests makes sense for CREATE SEQUENCE: one for RESTART without a
value and one with, where both explicitely set START.

Thoughts?


-1. IMHO, this is something creating more confusion to the user. We
say that we allow both START and RESTART that RESTART is accepted as a
consequence of our internal option handling in gram.y. Instead, I
recommend throwing errorConflictingDefElem or errmsg("START and
RESTART are mutually exclusive options"). We do throw these errors in
a lot of other places for various options. Others may have better
thoughts though.


Per docs, CREATE SEQUENCE conforms to the SQL standard, with some exceptions.
So I'd agree with Michael if CREATE SEQUENCE with RESTART also conforms to
the SQL standard, but I'd agree with Bharath otherwise.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Out-of-memory error reports in libpq

2021-07-28 Thread Tom Lane
Andrew Dunstan  writes:
> Is it worth making the first one a macro?

It'd be the same from a source-code perspective, but probably a
shade bulkier in terms of object code.

regards, tom lane




Re: Why don't update minimum recovery point in xact_redo_abort

2021-07-28 Thread Fujii Masao




On 2021/07/28 12:44, Fujii Masao wrote:



On 2021/07/28 6:25, Heikki Linnakangas wrote:

On 27/07/2021 19:49, Fujii Masao wrote:

Anyway I attached the patch that changes only xact_redo_abort()
so that it calls XLogFlush() to update min recovery point.


Looks good to me, thanks! FWIW, I used the attached script to reproduce this.


Thanks for the review!

Barring any objection, I will commit the patch and
back-patch it to all supported versions.


Pushed. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Have I found an interval arithmetic bug?

2021-07-28 Thread Robert Haas
On Wed, Jul 28, 2021 at 11:52 AM Tom Lane  wrote:
> You know, I was thinking exactly that thing earlier.  Changing the
> on-disk representation is certainly a nonstarter, but the problem
> here stems from expecting interval_in to do something sane with inputs
> that do not correspond to any representable value.  I do not think we
> have any other datatypes where we expect the input function to make
> choices like that.

It's not exactly the same issue, but the input function whose behavior
most regularly trips people up is bytea, because they try something
like 'x'::bytea and it seems to DWTW so then they try it on all their
data and discover that, for example, '\'::bytea fails outright, or
that ''::bytea = '\x'::bytea, contrary to expectations. People often
seem to think that casting to bytea should work like convert_to(), but
it doesn't. As in the case at hand, byteain() has to guess whether the
input is intended to be the 'hex' or 'escape' format, and because the
'escape' format looks a lot like plain old text, confusion ensues.
Now, guessing between two input formats that are both legal for the
data type is not exactly the same as guessing what to do with a value
that's not directly representable, but it has the same ultimate effect
i.e. the human beings perceive the system as buggy.

A case that is perhaps more theoretically similar to the instance at
hand is rounding during the construction of floating point values. My
system thinks '1.01'::float = '1'::float, so
in that case, as in this one, we've decided that it's OK to build an
inexact representation of the input value. I don't really see what can
be done about this considering that the textual representation uses
base 10 and the internal representation uses base 2, but I think this
doesn't cause us as many problems in practice because people
understand how it works, which doesn't seem to be the case with the
interval data type, at last if this thread is any indication.

I am dubious that it's worth the pain of making the input function
reject cases involving fractional units. It's true that some people
here aren't happy with the current behavior, but they may no happier
if we reject those cases with an error, and other people may then be
unhappy too. I think your previous idea was the best one so far: fix
the input function so that 'X years Y months' and 'Y months X years'
always produce the same answer, and call it good.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Out-of-memory error reports in libpq

2021-07-28 Thread Andrew Dunstan


On 7/28/21 11:02 AM, Tom Lane wrote:
>
> Here I've got to disagree.  We do need the form with a PQExpBuffer
> argument, because there are some places where that isn't a pointer
> to a PGconn's errorMessage.  But the large majority of the calls
> are "pqReportOOM(conn)", and I think having to write that as
> "pqReportOOM(&conn->errorMessage)" is fairly ugly and perhaps
> error-prone.
>
> I'm not wedded to the name "pqReportOOMBuffer" though --- maybe
> there's some better name for that one?
>
>   



Is it worth making the first one a macro?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: truncating timestamps on arbitrary intervals

2021-07-28 Thread John Naylor
On Wed, Jul 28, 2021 at 12:15 AM Michael Paquier 
wrote:
>
> On Tue, Jul 27, 2021 at 12:05:51PM -0400, John Naylor wrote:
> > Concretely, I propose to push the attached on master and v14. Since
we're
> > in beta 2 and this thread might not get much attention, I've CC'd the
RMT.
>
> (It looks like gmail has messed up a bit the format of your last
> message.)

Hmm, it looks fine in the archives.

> Hmm.  The docs say also the following thing, but your patch does not
> reflect that anymore:
> "Negative intervals are allowed and are treated the same as positive
> intervals."

I'd forgotten that was documented based on incomplete information, thanks
for looking! Pushed with that fixed.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Have I found an interval arithmetic bug?

2021-07-28 Thread Tom Lane
Robert Haas  writes:
> If we were doing this over again, I would argue that, with this
> on-disk representation, 0.7 months ought to be rejected as invalid
> input, because it's generally not a good idea to have a data type that
> silently converts a value into a different value that is not
> equivalent for all purposes. It is confusing and causes people to
> expect behavior different from what they will actually get. Now, it
> seems far too late to consider such a change at this point ... and it
> is also no good considering a change to the on-disk representation of
> the existing data type at this point ... but it is also no good
> pretending like we have a floating-point representation of months and
> days when we actually do not.

You know, I was thinking exactly that thing earlier.  Changing the
on-disk representation is certainly a nonstarter, but the problem
here stems from expecting interval_in to do something sane with inputs
that do not correspond to any representable value.  I do not think we
have any other datatypes where we expect the input function to make
choices like that.

Is it really too late to say "that was a damfool idea" and drop fractional
years/months/etc from interval_in's lexicon?  By definition, this will not
create any dump/reload problems, because interval_out will never emit any
such thing.  It will break applications that are expecting such syntax to
do something sane.  But that expectation is fundamentally not meetable,
so maybe we should just make a clean break.  (Without back-patching it,
of course.)

I'm not entirely sure about whether to reject fractional days, though.
Converting those on the assumption of 1 day == 24 hours is not quite
theoretically right.  But it's unsurprising, which is not something
we can say about fractions of the larger units.  So maybe it's OK to
continue accepting that.

regards, tom lane




Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep

2021-07-28 Thread Bharath Rupireddy
On Wed, Jul 28, 2021 at 8:12 AM Kyotaro Horiguchi
 wrote:
>
> Hello.
>
> At Tue, 27 Jul 2021 11:04:09 +0530, Bharath Rupireddy 
>  wrote in
> > On Mon, Jul 26, 2021 at 11:03 PM Bossart, Nathan  
> > wrote:
> > PSA v3 patch.
>
> I have some comments.
>
> - No harm, but it's pointless to feed MyLatch to WaitLatch when
>  WL_LATCH_SET is not set (or rather misleading).

+1. I can send NULL  to WaitLatch.

> - It seems that the additional wait-event is effectively useless for
>  most of the processes. Considering that this feature is for debugging
>  purpose, it'd be better to use ps display instead (or additionally)
>  if we want to see the wait event anywhere.

Hm. That's a good idea to show up in the ps display.

> The events of autovacuum workers can be seen in pg_stat_activity properly.
>
> For client-backends, that state cannot be seen in
> pg_stat_activity. That seems inevitable since backends aren't
> allocated a PGPROC entry yet at that time. (So the wait event is set
> to local memory as a safety measure in this case.)  On the other hand,
> I find it inconvenient that the ps display is shown as just "postgres"
> while in that state.  I think we can show "postgres: preauth waiting"
> or something.  (It is shown as "postgres: username dbname [conn]
> initializing" while PostAuthDelay)

Hm. Is n't it better to show something like below in the ps display?
for pre_auth_delay: "postgres: pre auth delay"
for post_auth_delay: "postgres: <> post auth delay"

But, I'm not sure whether this ps display thing will add any value to
the end user who always can't see the ps display. So, how about having
both i.e. ps display (useful for pre auth delay cases) and wait event
(useful for post auth delay)?

Regards,
Bharath Rupireddy.




Re: when the startup process doesn't (logging startup delays)

2021-07-28 Thread Robert Haas
On Wed, Jul 28, 2021 at 11:25 AM Bharath Rupireddy
 wrote:
> > Perhaps during initdb we don't want messages, but I'm not sure that we
> > need to do anything about that here. None of the messages that the
> > server normally produces show up when you run initdb, so I guess they
> > are getting redirected to /dev/null or something.
>
> I enabled the below log message in XLogFlush and ran initdb, it is
> printing these logs onto the stdout, looks like the logs have not been
> redirected to /dev/null in initdb/bootstrap mode.
>
> #ifdef WAL_DEBUG
> if (XLOG_DEBUG)
> elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X",
> LSN_FORMAT_ARGS(record),
> LSN_FORMAT_ARGS(LogwrtResult.Write),
> LSN_FORMAT_ARGS(LogwrtResult.Flush));
> #endif
>
> So, even in bootstrap mode, can we use the above #ifdef WAL_DEBUG and
> XLOG_DEBUG to print those logs? It looks like the problem with these
> macros is that they are not settable by normal users in the production
> environment, but only by the developers. I'm not sure how much it is
> helpful to print the startup process progress logs in bootstrap mode.
> Maybe we can use the IsBootstrapProcessingMode macro to disable these
> logs in the bootstrap mode.

I don't think we should drag XLOG_DEBUG into this. That's a debugging
facility that isn't really relevant to the topic at hand. The point is
the server normally prints a bunch of messages that we don't see in
bootstrap mode. For example:

[rhaas pgsql]$ postgres
2021-07-28 11:32:33.824 EDT [36801] LOG:  starting PostgreSQL 15devel
on x86_64-apple-darwin19.6.0, compiled by clang version 5.0.2
(tags/RELEASE_502/final), 64-bit
2021-07-28 11:32:33.825 EDT [36801] LOG:  listening on IPv6 address
"::1", port 5432
2021-07-28 11:32:33.825 EDT [36801] LOG:  listening on IPv4 address
"127.0.0.1", port 5432
2021-07-28 11:32:33.826 EDT [36801] LOG:  listening on Unix socket
"/tmp/.s.PGSQL.5432"
2021-07-28 11:32:33.846 EDT [36802] LOG:  database system was shut
down at 2021-07-28 11:32:32 EDT
2021-07-28 11:32:33.857 EDT [36801] LOG:  database system is ready to
accept connections

None of that shows up when you run initdb. Taking a fast look at the
code, I don't think the reasons are the same for all of those
messages. Some of the code isn't reached, whereas e.g. "database
system was shut down at 2021-07-28 11:32:32 EDT" is special-cased. I'm
not sure right off the top of my head what this code should do, but
ideally it looks something like one of the cases we've already got.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Have I found an interval arithmetic bug?

2021-07-28 Thread Robert Haas
On Tue, Jul 27, 2021 at 4:02 PM Tom Lane  wrote:
> What I think we have consensus on is that interval_in is doing the
> wrong thing in a particular corner case.  I have heard nobody but
> you suggesting that we should start undertaking behavioral changes
> in other interval functions, and I don't believe that that's a good
> road to start going down.  These behaviors have stood for many years.
> Moreover, since the whole thing is by definition operating with
> inadequate information, it is inevitable that for every case you
> make better there will be another one you make worse.

I agree that we need to be really conservative here. I think Tom is
right that if we start changing behaviors that "seem wrong," we will
probably make some things better and other things worse. The overall
amount of stuff that "seems wrong" will probably not go down, but a
lot of people's applications will break when they try to upgrade to
v15. That's not going to be a win overall.

I think a lot of the discussion on this thread consists of people
hoping for things that are not very realistic. The interval type
represents the number of months as an integer, and the number of days
as an integer. That means that an interval like '0.7 months' does not
really exist. If you ask for that interval what you get is actually 21
days, which is a reasonable approximation of 0.7 months but not the
same thing, except in April, June, September, and November. So when
you then say that you want 0.7 months + 0.3 months to equal 1.0
months, what you're really requesting is that 21 days + 9 days = 1
month. That system has been tried in the past, but it was abandoned
roughly around the time of Julius Caeser for the very good reason that
the orbital period of the earth about the sun is noticeably greater
than 360 days.

It would be entirely possible to design a data type that could
represent such values more exactly. A data type that had a
representation similar to interval but with double values for the
numbers of years and months would be able to compute 0.7 months + 0.3
months and get 1.0 months with no problem.

If we were doing this over again, I would argue that, with this
on-disk representation, 0.7 months ought to be rejected as invalid
input, because it's generally not a good idea to have a data type that
silently converts a value into a different value that is not
equivalent for all purposes. It is confusing and causes people to
expect behavior different from what they will actually get. Now, it
seems far too late to consider such a change at this point ... and it
is also no good considering a change to the on-disk representation of
the existing data type at this point ... but it is also no good
pretending like we have a floating-point representation of months and
days when we actually do not.

--
Robert Haas
EDB: http://www.enterprisedb.com




Re: Minimal logical decoding on standbys

2021-07-28 Thread Alvaro Herrera
On 2021-Jul-27, Drouvot, Bertrand wrote:

> diff --git a/src/backend/utils/cache/lsyscache.c 
> b/src/backend/utils/cache/lsyscache.c

> +bool
> +get_rel_logical_catalog(Oid relid)
> +{
> + boolres;
> + Relation rel;
> +
> + /* assume previously locked */
> + rel = table_open(relid, NoLock);
> + res = RelationIsAccessibleInLogicalDecoding(rel);
> + table_close(rel, NoLock);
> +
> + return res;
> +}

So RelationIsAccessibleInLogicalDecoding() does a cheap check for
wal_level which can be done without opening the table; I think this
function should be rearranged to avoid doing that when not needed.
Also, putting this function in lsyscache.c seems somewhat wrong since
it's not merely accessing the system caches ...

I think it would be better to move this elsewhere (relcache.c, proto in
relcache.h, perhaps call it RelationIdIsAccessibleInLogicalDecoding) and
short-circuit for the check that can be done before opening the table.
At least the GiST code appears to be able to call this several times per
vacuum run, so it makes sense to short-circuit it for the fast case.

... though looking at the GiST code again I wonder if it would be more
sensible to just stash the table's Relation pointer somewhere in the
context structs instead of opening and closing it time and again.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Investigación es lo que hago cuando no sé lo que estoy haciendo"
(Wernher von Braun)




Re: when the startup process doesn't (logging startup delays)

2021-07-28 Thread Bharath Rupireddy
On Wed, Jul 28, 2021 at 7:02 PM Robert Haas  wrote:
>
> On Wed, Jul 28, 2021 at 5:24 AM Nitin Jadhav
>  wrote:
> > I also agree that this is the better place to do it. Hence modified
> > the patch accordingly. The condition "!AmStartupProcess()" is added to
> > differentiate whether the call is done from a startup process or some
> > other process. Actually StartupXLOG() gets called in 2 places. one in
> > StartupProcessMain() and the other in InitPostgres(). As the logging
> > of the startup progress is required only during the startup process
> > and not in the other cases,
>
> The InitPostgres() case occurs when the server is started in bootstrap
> mode (during initdb) or in single-user mode (postgres --single). I do
> not see any reason why we shouldn't produce progress messages in at
> least the latter case. I suspect that someone who is in the rather
> desperate scenario of having to use single-user mode would really like
> to know how long the server is going to take to start up.

+1 to emit the same log messages in single-user mode and basically
whoever calls StartupXLOG. Do we need to adjust the GUC parameter
log_startup_progress_interval(a reasonable value) so that the logs are
generated by default?

> Perhaps during initdb we don't want messages, but I'm not sure that we
> need to do anything about that here. None of the messages that the
> server normally produces show up when you run initdb, so I guess they
> are getting redirected to /dev/null or something.

I enabled the below log message in XLogFlush and ran initdb, it is
printing these logs onto the stdout, looks like the logs have not been
redirected to /dev/null in initdb/bootstrap mode.

#ifdef WAL_DEBUG
if (XLOG_DEBUG)
elog(LOG, "xlog flush request %X/%X; write %X/%X; flush %X/%X",
LSN_FORMAT_ARGS(record),
LSN_FORMAT_ARGS(LogwrtResult.Write),
LSN_FORMAT_ARGS(LogwrtResult.Flush));
#endif

So, even in bootstrap mode, can we use the above #ifdef WAL_DEBUG and
XLOG_DEBUG to print those logs? It looks like the problem with these
macros is that they are not settable by normal users in the production
environment, but only by the developers. I'm not sure how much it is
helpful to print the startup process progress logs in bootstrap mode.
Maybe we can use the IsBootstrapProcessingMode macro to disable these
logs in the bootstrap mode.

> So I don't think that using AmStartupProcess() for this purpose is right.

Agree.

Regards,
Bharath Rupireddy.




Re: Have I found an interval arithmetic bug?

2021-07-28 Thread Bruce Momjian
On Wed, Jul 28, 2021 at 08:42:31AM +0100, Dean Rasheed wrote:
> On Wed, 28 Jul 2021 at 00:08, John W Higgins  wrote:
> >
> > It's nice to envision all forms of fancy calculations. But the fact is that
> >
> > '1.5 month'::interval * 2 != '3 month"::interval
> >
> 
> That's not exactly true. Even without the patch:
> 
> SELECT '1.5 month'::interval * 2 AS product,
>'3 month'::interval AS expected,
>justify_interval('1.5 month'::interval * 2) AS justified_product,
>'1.5 month'::interval * 2 = '3 month'::interval AS equal;
> 
> product | expected | justified_product | equal
> +--+---+---
>  2 mons 30 days | 3 mons   | 3 mons| t
> (1 row)
> 
> So it's equal even without calling justify_interval() on the result.
> 
> FWIW, I remain of the opinion that the interval literal code should
> just spill down to lower units in all cases, just like the
> multiplication and division code, so that the results are consistent
> (barring floating point rounding errors) and explainable.

Here is a more minimal patch that doesn't change the spill-down units at
all, but merely documents it, and changes the spilldown to months to
round instead of truncate.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 453115f942..bd938a675a 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -2840,15 +2840,17 @@ P  years-months-
 
 
 
- In the verbose input format, and in some fields of the more compact
- input formats, field values can have fractional parts; for example
- '1.5 week' or '01:02:03.45'.  Such input is
- converted to the appropriate number of months, days, and seconds
- for storage.  When this would result in a fractional number of
- months or days, the fraction is added to the lower-order fields
- using the conversion factors 1 month = 30 days and 1 day = 24 hours.
- For example, '1.5 month' becomes 1 month and 15 days.
- Only seconds will ever be shown as fractional on output.
+ Field values can have fractional parts;  for example, '1.5
+ weeks' or '01:02:03.45'.  However,
+ because interval internally stores only three integer units (months,
+ days, seconds), fractional units must be spilled to smaller units.
+ For example, because months are approximated to equal 30 days,
+ fractional values of units greater than months is rounded to be the
+ nearest integer number of months.  Fractional units of months or less
+ are computed to be an integer number of days and seconds, assuming
+ 24 hours per day.  For example, '1.5 months'
+ becomes 1 month 15 days.  Only seconds will ever
+ be shown as fractional on output.
 
 
 
diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 54ae632de2..cb3fa85892 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -3306,29 +3306,25 @@ DecodeInterval(char **field, int *ftype, int nf, int range,
 
 	case DTK_YEAR:
 		tm->tm_year += val;
-		if (fval != 0)
-			tm->tm_mon += fval * MONTHS_PER_YEAR;
+		tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
 		tmask = DTK_M(YEAR);
 		break;
 
 	case DTK_DECADE:
 		tm->tm_year += val * 10;
-		if (fval != 0)
-			tm->tm_mon += fval * MONTHS_PER_YEAR * 10;
+		tm->tm_mon += rint(fval * MONTHS_PER_YEAR * 10);
 		tmask = DTK_M(DECADE);
 		break;
 
 	case DTK_CENTURY:
 		tm->tm_year += val * 100;
-		if (fval != 0)
-			tm->tm_mon += fval * MONTHS_PER_YEAR * 100;
+		tm->tm_mon += rint(fval * MONTHS_PER_YEAR * 100);
 		tmask = DTK_M(CENTURY);
 		break;
 
 	case DTK_MILLENNIUM:
 		tm->tm_year += val * 1000;
-		if (fval != 0)
-			tm->tm_mon += fval * MONTHS_PER_YEAR * 1000;
+		tm->tm_mon += rint(fval * MONTHS_PER_YEAR * 1000);
 		tmask = DTK_M(MILLENNIUM);
 		break;
 
@@ -3565,7 +3561,7 @@ DecodeISO8601Interval(char *str,
 			{
 case 'Y':
 	tm->tm_year += val;
-	tm->tm_mon += (fval * MONTHS_PER_YEAR);
+	tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
 	break;
 case 'M':
 	tm->tm_mon += val;
@@ -3601,7 +3597,7 @@ DecodeISO8601Interval(char *str,
 		return DTERR_BAD_FORMAT;
 
 	tm->tm_year += val;
-	tm->tm_mon += (fval * MONTHS_PER_YEAR);
+	tm->tm_mon += rint(fval * MONTHS_PER_YEAR);
 	if (unit == '\0')
 		return 0;
 	if (unit == 'T')
diff --git a/src/interfaces/ecpg/pgtypeslib/interval.c b/src/interfaces/ecpg/pgtypeslib/interval.c
index 02b3c47223..a7e530cb5d 100644
--- a/src/interfaces/ecpg/pgtypeslib/interval.c
+++ b/src/interfaces/ecpg/pgtypeslib/interval.c
@@ -155,7 +155,7 @@ DecodeISO8601Interval(char *str,
 			{
 case 'Y':
 	tm

Re: [Proposal] Global temporary tables

2021-07-28 Thread Tony Zhu
Hi Wenjing

would you please rebase the code?

Thank you very much
Tony

The new status of this patch is: Waiting on Author


Re: Out-of-memory error reports in libpq

2021-07-28 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Jul 27, 2021 at 10:31:25PM -0400, Tom Lane wrote:
>> Yeah, there are half a dozen places that currently print something
>> more specific than "out of memory".  I judged that the value of this
>> was not worth the complexity it'd add to support it in this scheme.
>> Different opinions welcome of course.

> I don't mind either that this removes a bit of context.  For
> unlikely-going-to-happen errors that's not worth the extra translation
> cost.

Yeah, the extra translatable strings are the main concrete cost of
keeping this behavior.  But I'm dubious that labeling a small number
of the possible OOM points is worth anything, especially if they're
not providing the failed allocation request size.  You can't tell if
that request was unreasonable or if it was just an unlucky victim
of bloat elsewhere.  Unifying the reports into a common function
could be a starting point for more consistent/detailed OOM reports,
if anyone cared to work on that.  (I hasten to add that I don't.)

> +   pqReportOOMBuffer(&conn->errorMessage);

> Not much a fan of having two routines to do this job though.  I would
> vote for keeping the one named pqReportOOM() with PQExpBuffer as
> argument.

Here I've got to disagree.  We do need the form with a PQExpBuffer
argument, because there are some places where that isn't a pointer
to a PGconn's errorMessage.  But the large majority of the calls
are "pqReportOOM(conn)", and I think having to write that as
"pqReportOOM(&conn->errorMessage)" is fairly ugly and perhaps
error-prone.

I'm not wedded to the name "pqReportOOMBuffer" though --- maybe
there's some better name for that one?

regards, tom lane




Re: CREATE SEQUENCE with RESTART option

2021-07-28 Thread Bharath Rupireddy
On Wed, Jul 28, 2021 at 11:50 AM Michael Paquier  wrote:
>
> On Mon, Jul 26, 2021 at 04:57:53PM +0900, Michael Paquier wrote:
> > FWIW, like Ashutosh upthread, my vote would be to do nothing here in
> > terms of behavior changes as this is just breaking a behavior for the
> > sake of breaking it, so there are chances that this is going to piss
> > some users that relied accidentally on the existing behavior.
>
> In short, I would be tempted with something like the attached, that
> documents RESTART in CREATE SEQUENCE, while describing its behavior
> according to START.  In terms of regression tests, there is already a
> lot in this area with ALTER SEQUENCE, but I think that having two
> tests makes sense for CREATE SEQUENCE: one for RESTART without a
> value and one with, where both explicitely set START.
>
> Thoughts?

-1. IMHO, this is something creating more confusion to the user. We
say that we allow both START and RESTART that RESTART is accepted as a
consequence of our internal option handling in gram.y. Instead, I
recommend throwing errorConflictingDefElem or errmsg("START and
RESTART are mutually exclusive options"). We do throw these errors in
a lot of other places for various options. Others may have better
thoughts though.

Regards,
Bharath Rupireddy.




Re: Case expression pushdown

2021-07-28 Thread Gilles Darold

Le 26/07/2021 à 18:03, Alexander Pyhalov a écrit :

Tom Lane писал 2021-07-26 18:18:

Alexander Pyhalov  writes:

[ 0001-Allow-pushing-CASE-expression-to-foreign-server-v6.patch ]


This doesn't compile cleanly:

deparse.c: In function 'foreign_expr_walker.isra.4':
deparse.c:920:8: warning: 'collation' may be used uninitialized in
this function [-Wmaybe-uninitialized]
 if (collation != outer_cxt->collation)
    ^
deparse.c:914:3: warning: 'state' may be used uninitialized in this
function [-Wmaybe-uninitialized]
   switch (state)
   ^~

These uninitialized variables very likely explain the fact that it fails
regression tests, both for me and for the cfbot.  Even if this 
weren't an

outright bug, we don't tolerate code that produces warnings on common
compilers.

    regards, tom lane


Hi.

Of course, this is a patch issue. Don't understand how I overlooked this.
Rebased on master and fixed it. Tests are passing here (but they also 
passed for previous patch version).


What exact tests are failing?



I confirm that there is no compilation warning and all regression tests 
pass successfully for the v7 patch, I have not checked previous patch 
but this one doesn't fail on cfbot too.



Best regards,

--
Gilles Darold





Re: Some code cleanup for pgbench and pg_verifybackup

2021-07-28 Thread Robert Haas
On Tue, Jul 27, 2021 at 11:18 PM Michael Paquier  wrote:
> I have been looking at that.  Here are some findings:
> - In pg_archivecleanup, CleanupPriorWALFiles() does not bother
> exit()'ing with an error code.  Shouldn't we worry about reporting
> that correctly?  It seems strange to not inform users about paths that
> would be broken, as that could bloat the archives without one knowing
> about it.
> - In pg_basebackup.c, ReceiveTarAndUnpackCopyChunk() would not
> exit() when failing to change permissions for non-WIN32.
> - pg_recvlogical is missing a failure handling for fstat(), as of
> 5c0de38.
> - pg_verifybackup is in the wrong, as mentioned upthread.
>
> Thoughts?  All that does not seem to enter into the category of things
> worth back-patching, except for pg_verifybackup.

I think all of these are reasonable fixes. In the case of
pg_basebackup, a chmod() failure doesn't necessarily oblige us to give
up and exit; we could presumably still write the data. But it may be
best to give up and exit. The other cases appear to be clear bugs.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Have I found an interval arithmetic bug?

2021-07-28 Thread John W Higgins
On Wed, Jul 28, 2021 at 12:42 AM Dean Rasheed 
wrote:

> On Wed, 28 Jul 2021 at 00:08, John W Higgins  wrote:
> >
> > It's nice to envision all forms of fancy calculations. But the fact is
> that
> >
> > '1.5 month'::interval * 2 != '3 month"::interval
> >
>
> That's not exactly true. Even without the patch:
>
> SELECT '1.5 month'::interval * 2 AS product,
>'3 month'::interval AS expected,
>justify_interval('1.5 month'::interval * 2) AS justified_product,
>'1.5 month'::interval * 2 = '3 month'::interval AS equal;
>
> product | expected | justified_product | equal
> +--+---+---
>  2 mons 30 days | 3 mons   | 3 mons| t
> (1 row)
>
>
That's viewing something via the mechanism that is incorrectly (technically
speaking) doing the work in the first place. It believes they are the same
- but they are clearly not when actually used.

select '1/1/2001'::date + (interval '3 month');
  ?column?
-
 2001-04-01 00:00:00
(1 row)

vs

select '1/1/2001'::date + (interval '1.5 month' * 2);
  ?column?
-
 2001-03-31 00:00:00
(1 row)

That's the flaw in this entire body of work - we keep taking fractional
amounts - doing round offs and then trying to add or multiply the pieces
back and ending up with weird floating point math style errors. That's
never to complain about it - but we shouldn't be looking at edge cases with
things like 1 month * 1.234 when 1.5 months * 2 doesn't work properly.

John

P.S. Finally we have items like this

select '12/1/2001'::date + (interval '1.5 months' * 2);
  ?column?
-
 2002-03-03 00:00:00
(1 row)

postgres=# select '1/1/2001'::date + (interval '1.5 months' * 2);
  ?column?
-
 2001-03-31 00:00:00
(1 row)

Which only has a 28 day gap because of the length of February - clearly
this is not working quite right.


Re: Emit namespace in post-copy output

2021-07-28 Thread Daniel Gustafsson
I took a look at this I agree with the reviewer that it's a good change.  The
output from multiple jobs in vacuumdb is clearly easier to parse with this
since the initial LOG and later DETAIL can be interleaved with other relations
of the same name in other namespaces.

+   get_namespace_name(RelationGetNamespace(OldHeap)),

Since get_namespace_name() returns a palloced string, this will lead to a 2x
leak of the namespace length as opposed to the 1x of today.  While hardly a big
deal, it seems prudent to cap this by storing the returned string locally now
that we need it twice.

I've updated the patch with this, see the attached v2.  Barring objections I
will go ahead with this.

--
Daniel Gustafsson   https://vmware.com/



v2-0001-Emit-namespace-in-the-post-copy-errmsg.patch
Description: Binary data


Re: when the startup process doesn't (logging startup delays)

2021-07-28 Thread Robert Haas
On Wed, Jul 28, 2021 at 5:24 AM Nitin Jadhav
 wrote:
> I also agree that this is the better place to do it. Hence modified
> the patch accordingly. The condition "!AmStartupProcess()" is added to
> differentiate whether the call is done from a startup process or some
> other process. Actually StartupXLOG() gets called in 2 places. one in
> StartupProcessMain() and the other in InitPostgres(). As the logging
> of the startup progress is required only during the startup process
> and not in the other cases,

The InitPostgres() case occurs when the server is started in bootstrap
mode (during initdb) or in single-user mode (postgres --single). I do
not see any reason why we shouldn't produce progress messages in at
least the latter case. I suspect that someone who is in the rather
desperate scenario of having to use single-user mode would really like
to know how long the server is going to take to start up.

Perhaps during initdb we don't want messages, but I'm not sure that we
need to do anything about that here. None of the messages that the
server normally produces show up when you run initdb, so I guess they
are getting redirected to /dev/null or something.

So I don't think that using AmStartupProcess() for this purpose is right.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




fixing pg_basebackup tests for modern Windows/msys2

2021-07-28 Thread Andrew Dunstan

While looking into issues with fairywren and pg_basebackup tests, I
created a similar environment but with more modern Windows / msys2.
Before it even got to the test that failed on fairywren it failed the
first TAP test for a variety of reasons, all connected to
TestLib::perl2host.

First, this function is in some cases returning paths for directories
with trailing slashes and or embedded double slashes.  Both of these can
cause problems, especially when written to a tablespace map file. Also,
the cygpath invocation is returning a path with backslashes whereas "pwd
-W' returns a path with forward slashes.

So the first attached patch rectifies these problems. It fixes issues
with doubles and trailing slashes and makes cygpath return a path with
forward slashes just like the non-cygpath branch.

However, there is another problem, which is that if called on a path
that includes a symlink, on the test platform I set up it actually
resolves that link rather than just following it. The end result is that
the use of a shorter path via a symlink is effectively defeated. I
haven't found any way to stop this behaviour.

The second patch therefore adjusts the test to avoid calling perl2host
on such a path. It just calls perl2host on the symlink's parent, and
thereafter uses that result.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

>From eee31e6fd8f183ae0d836dfc131537912979d954 Mon Sep 17 00:00:00 2001
From: Andrew Dunstan 
Date: Mon, 26 Jul 2021 09:34:59 -0400
Subject: [PATCH 1/2] Make TestLib::perl2host more consistent and robust

Sometimes cygpath has been observed to return a path with a trailing
slash. That can cause problems, Also, make "cygpath" usage
consistent with "pwd -W" with respect to the use of forward slashes.

Backpatch to release 14 where the current code was introduced.
---
 src/test/perl/TestLib.pm | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 15572abbea..cbab1587cc 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -304,6 +304,8 @@ except for the case of Perl=msys and host=mingw32.  The subject need not
 exist, but its parent or grandparent directory must exist unless cygpath is
 available.
 
+The returned path uses forward slashes but has no trailing slash.
+
 =cut
 
 sub perl2host
@@ -313,10 +315,11 @@ sub perl2host
 	if ($is_msys2)
 	{
 		# get absolute, windows type path
-		my $path = qx{cygpath -a -w "$subject"};
+		my $path = qx{cygpath -a -m "$subject"};
 		if (!$?)
 		{
 			chomp $path;
+			$path =~ s!/$!!;
 			return $path if $path;
 		}
 		# fall through if this didn't work.
@@ -342,6 +345,7 @@ sub perl2host
 	# this odd way of calling 'pwd -W' is the only way that seems to work.
 	my $dir = qx{sh -c "pwd -W"};
 	chomp $dir;
+	$dir =~ s!/$!!;
 	chdir $here;
 	return $dir . $leaf;
 }
-- 
2.25.4

>From c50c319df2fba24779861f410c5dda8303e80a28 Mon Sep 17 00:00:00 2001
From: Andrew Dunstan 
Date: Wed, 28 Jul 2021 07:29:38 -0400
Subject: [PATCH 2/2] Avoid calling TestLib::perl2host on a symlinked directory

Certain versions of msys2/Windows have been observed to resolve symlinks
in perl2host rather than just follow them. This defeats using a
symlinked shorter path to a longer path, and makes certain tests fail.
We therefore call perl2host on the parent directory of the symlink and
thereafter just use that result.

Apply to release 14 where the problem has been observed.
---
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 74f8c2c739..bde31b3c03 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -238,11 +238,13 @@ $node->start;
 # to our physical temp location.  That way we can use shorter names
 # for the tablespace directories, which hopefully won't run afoul of
 # the 99 character length limit.
-my $shorter_tempdir = TestLib::tempdir_short . "/tempdir";
+my $sys_tempdir = TestLib::tempdir_short;
+my $real_sys_tempdir = TestLib::perl2host($sys_tempdir) . "/tempdir";
+my $shorter_tempdir =  $sys_tempdir . "/tempdir";
 dir_symlink "$tempdir", $shorter_tempdir;
 
 mkdir "$tempdir/tblspc1";
-my $realTsDir= TestLib::perl2host("$shorter_tempdir/tblspc1");
+my $realTsDir= "$real_sys_tempdir/tblspc1";
 my $real_tempdir = TestLib::perl2host($tempdir);
 $node->safe_psql('postgres',
 	"CREATE TABLESPACE tblspc1 LOCATION '$realTsDir';");
@@ -275,7 +277,7 @@ SKIP:
 
 	# Recover tablespace into a new directory (not where it was!)
 	my $repTsDir = "$tempdir/tblspc1replica";
-	my $realRepTsDir = TestLib::perl2host("$shorter_tempdir/tblspc1replica");
+	my $realRepTsDir = "$real_sys_tempdir/tblspc1replica";
 	mkdir $repTsDir;
 	TestLib::system_or_bail($tar, 'xf', $tblspc_tars[0], '-C', $repTsDir);
 
@@ -390,7 +392,7 @@ ok( -d "$te

Re: .ready and .done files considered harmful

2021-07-28 Thread Robert Haas
On Wed, Jul 28, 2021 at 6:48 AM Dipesh Pandit  wrote:
> As of now shared memory is not attached to the archiver. Archiver cannot
> access ThisTimeLineID or a flag available in shared memory.

If that is true, why are there functions PgArchShmemSize() and
PgArchShmemInit(), and how does this statement in PgArchiverMain()
manage not to core dump?

/*
 * Advertise our pgprocno so that backends can use our latch to wake us up
 * while we're sleeping.
 */
PgArch->pgprocno = MyProc->pgprocno;

I think what you are saying is true before v14, but not in v14 and master.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Showing applied extended statistics in explain

2021-07-28 Thread Robert Haas
On Tue, Jul 27, 2021 at 4:50 PM Tom Lane  wrote:
> TBH I do not agree that this is a great idea.  I think it's inevitably
> exposing a lot of unstable internal planner details.

Well, that is a risk, but I think I like the alternative even less.
Imagine if we had a CREATE INDEX command but no way -- other than
running queries and noticing how long they seem to take -- to tell
whether it was being used. That would suck, a lot, and this seems to
me to be exactly the same. A user who creates a statistics object has
a legitimate interest in finding out whether that object is doing
anything to a given query that they happen to care about.

> I like even less
> the aspect that this means a lot of information has to be added to the
> finished plan in case it's needed for EXPLAIN.  Aside from the sheer
> cost of copying that data around, what happens if for example somebody
> drops a statistic object between the time of planning and the EXPLAIN?
> Are we going to start keeping locks on those objects for the lifetime
> of the plans?

I don't understand the premise of this question. We don't keep locks
on tables or indexes involved in a plan for the lifetime of a plan, or
on functions or any other kind of object either. We instead arrange to
invalidate the plan if those objects are modified or dropped. Why
would we not use the same approach here?

-- 
Robert Haas
EDB: http://www.enterprisedb.com




RE: Failed transaction statistics to measure the logical replication progress

2021-07-28 Thread osumi.takami...@fujitsu.com
On Tuesday, July 27, 2021 3:59 PM Ajin Cherian  wrote:
> On Thu, Jul 8, 2021 at 4:55 PM osumi.takami...@fujitsu.com
>  wrote:
> 
> > Attached file is the POC patch for this.
> > Current design is to save failed stats data in the ReplicationSlot struct.
> > This is because after the error, I'm not able to access the ReorderBuffer
> object.
> > Thus, I chose the object where I can interact with at the
> ReplicationSlotRelease timing.
> 
> I think this is a good idea to capture the failed replication stats.
> But I'm wondering how you are deciding if the replication failed or not? Not 
> all
> cases of ReplicationSLotRelease are due to a failure. It could also be due to 
> a
> planned dropping of subscription or disable of subscription. I have not tested
> this but won't the failed stats be updated in this case as well? Is that 
> correct?
Yes, what you said is true. Currently, when I run DROP SUBSCRIPTION or
ALTER SUBSCRIPTION DISABLE, failed stats values are added
to pg_stat_replication_slots unintentionally, if they have some left values.
This is because all those commands, like the subscriber apply failure
by duplication error, have the publisher get 'X' message at 
ProcessRepliesIfAny()
and go into the path to call ReplicationSlotRelease().

Also, other opportunities like server stop call the same in the end,
which leads to a situation that after the server restart,
the value of failed stats catch up with the (successful) existing stats values.
Accordingly, I need to change the patch to adjust those situations.
Thank you.


Best Regards,
Takamichi Osumi



Re: Reduce the number of special cases to build contrib modules on windows

2021-07-28 Thread David Rowley
On Wed, 28 Jul 2021 at 03:52, Dagfinn Ilmari Mannsåker
 wrote:
>
> Alvaro Herrera  writes:
> > I think using the return value of grep as a boolean is confusing.  It
> > seems more legible to compare to 0.  So instead of this:
> >
> > + if (! grep { $_ eq $ref} @{ $self->{references} })
> > + {
> > + push @{ $self->{references} }, $ref;
> > + }
> >
> > use something like:
> >
> > + if (grep { $_ eq $ref} @{ $self->{references} } == 0)
>
> I disagree.  Using grep in boolean context is perfectly idiomatic perl.
> What would be more idiomatic is List::Util::any, but that's not availble
> without upgrading List::Util from CPAN on Perls older than 5.20, so we
> can't use that.

Ok, if the grep stuff is ok as is with the boolean comparison then I'd
say 0002 and 0003 of the attached are ok to go.

I pushed the v9 0001 and 0005 patch after adjusting the AddFile($self,
...) to become $self->AddFile(...)

I've adjusted the attached 0001 patch (previously 0002) to define
LOWER_NODE in ltree.h as mentioned by Tom.

0004 still needs work.

Thanks for all the reviews.

David


v10-0001-Adjust-MSVC-build-scripts-to-parse-Makefiles-for.patch
Description: Binary data


v10-0002-Make-the-includes-field-an-array-in-MSVC-build-s.patch
Description: Binary data


v10-0003-Don-t-duplicate-references-and-libraries-in-MSVC.patch
Description: Binary data


v10-0004-Remove-some-special-cases-from-MSVC-build-script.patch
Description: Binary data


Re: RFC: Logging plan of the running query

2021-07-28 Thread torikoshia

On 2021-07-28 03:45, Pavel Stehule wrote:

út 27. 7. 2021 v 20:34 odesílatel Fujii Masao
 napsal:


On 2021/07/09 14:05, torikoshia wrote:

On 2021-07-02 23:21, Bharath Rupireddy wrote:

On Tue, Jun 22, 2021 at 8:00 AM torikoshia

 wrote:

Updated the patch.


Thanks for the patch. Here are some comments on the v4 patch:


Thanks for your comments and suggestions!
I agree with you and updated the patch.

On Thu, Jul 1, 2021 at 3:34 PM Fujii Masao

 wrote:



DO $$
BEGIN
PERFORM pg_sleep(100);
END$$;

When I called pg_log_current_query_plan() to send the signal to
the backend executing the above query, I got the following log

message.

I think that this is not expected message. I guess this issue

happened

because the information about query text and plan is retrieved
from ActivePortal. If this understanding is right, ISTM that we

should

implement new mechanism so that we can retrieve those information
even while nested query is being executed.


I'm now working on this comment.


One idea is to define new global pointer, e.g., "QueryDesc
*ActiveQueryDesc;".
This global pointer is set to queryDesc in ExecutorRun()
(also maybe ExecutorStart()). And this is reset to NULL in
ExecutorEnd() and
when an error is thrown. Then ProcessLogCurrentPlanInterrupt() can
get the plan of the currently running query from that global pointer
instead of ActivePortal, and log it. Thought?


It cannot work - there can be a lot of nested queries, and at the end
you cannot reset to null, but you should return back pointer to outer
query.


Thanks for your comment!

I'm wondering if we can avoid this problem by saving one outer level 
QueryDesc in addition to the current one.

I'm going to try it.

--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION




Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-28 Thread Amul Sul
On Wed, Jul 28, 2021 at 4:37 PM Amul Sul  wrote:
>
> On Wed, Jul 28, 2021 at 2:26 AM Robert Haas  wrote:
> >
> > On Fri, Jul 23, 2021 at 4:03 PM Robert Haas  wrote:
> > > My 0003 is where I see some lingering problems. It creates
> > > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
> > > need the xlogreader. But it doesn't really solve the problem of how
> > > checkpointer.c would be able to call this function with proper
> > > arguments. It is at least better in not needing two arguments to
> > > decide what to do, but how is checkpointer.c supposed to know what to
> > > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
> > > what to pass for EndOfLogTLI and EndOfLog?
> >
> > On further study, I found another problem: the way my patch set leaves
> > things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which
> > will not be correctly initialized in any process other than the
> > startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog)
> > would just be skipped. Your 0001 seems to have the same problem. You
> > added Assert(AmStartupProcess()) to the inside of the if
> > (ArchiveRecoveryRequested) block, but that doesn't fix anything.
> > Outside the startup process, ArchiveRecoveryRequested will always be
> > false, but the point is that the associated stuff should be done if
> > ArchiveRecoveryRequested would have been true in the startup process.
> > Both of our patch sets leave things in a state where that would never
> > happen, which is not good. Unless I'm missing something, it seems like
> > maybe you didn't test your patches to verify that, when the
> > XLogAcceptWrites() call comes from the checkpointer, all the same
> > things happen that would have happened had it been called from the
> > startup process. That would be a really good thing to have tested
> > before posting your patches.
> >
>
> My bad, I am extremely sorry about that. I usually do test my patches,
> but somehow I failed to test this change due to manually testing the
> whole ASRO feature and hurrying in posting the newest version.
>
> I will try to be more careful next time.
>

I was too worried about how I could miss that & after thinking more
about that, I realized that the operation for ArchiveRecoveryRequested
is never going to be skipped in the startup process and that never
left for the checkpoint process to do that later. That is the reason
that assert was added there.

When ArchiveRecoveryRequested, the server will no longer be in
the wal prohibited mode, we implicitly change the state to
wal-permitted. Here is the snip from the 0003 patch:

@@ -6614,13 +6629,30 @@ StartupXLOG(void)
  (errmsg("starting archive recovery")));
  }

- /*
- * Take ownership of the wakeup latch if we're going to sleep during
- * recovery.
- */
  if (ArchiveRecoveryRequested)
+ {
+ /*
+ * Take ownership of the wakeup latch if we're going to sleep during
+ * recovery.
+ */
  OwnLatch(&XLogCtl->recoveryWakeupLatch);

+ /*
+ * Since archive recovery is requested, we cannot be in a wal prohibited
+ * state.
+ */
+ if (ControlFile->wal_prohibited)
+ {
+ /* No need to hold ControlFileLock yet, we aren't up far enough */
+ ControlFile->wal_prohibited = false;
+ ControlFile->time = (pg_time_t) time(NULL);
+ UpdateControlFile();
+
+ ereport(LOG,
+ (errmsg("clearing WAL prohibition because the system is in archive
recovery")));
+ }
+ }
+


> > As far as EndOfLogTLI is concerned, there are, somewhat annoyingly,
> > several TLIs stored in XLogCtl. None of them seem to be precisely the
> > same thing as EndLogTLI, but I am hoping that replayEndTLI is close
> > enough. I found out pretty quickly through testing that replayEndTLI
> > isn't always valid -- it ends up 0 if we don't enter recovery. That's
> > not really a problem, though, because we only need it to be valid if
> > ArchiveRecoveryRequested. The code that initializes and updates it
> > seems to run whenever InRecovery = true, and ArchiveRecoveryRequested
> > = true will force InRecovery = true. So it looks to me like
> > replayEndTLI will always be initialized in the cases where we need a
> > value. It's not yet entirely clear to me if it has to have the same
> > value as EndOfLogTLI. I find this code comment quite mysterious:
> >
> > /*
> >  * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
> >  * the end-of-log. It could be different from the timeline that EndOfLog
> >  * nominally belongs to, if there was a timeline switch in that segment,
> >  * and we were reading the old WAL from a segment belonging to a higher
> >  * timeline.
> >  */
> > EndOfLogTLI = xlogreader->seg.ws_tli;
> >
> > The thing is, if we were reading old WAL from a segment belonging to a
> > higher timeline, wouldn't we have switched to that new timeline?
>
> AFAIUC, by browsing the code, yes, we are switching to the new
> timeline.  Along with lastReplayedTLI, lastReplayedEndRecPtr is also
> the same as the

Re: perlcritic: prohibit map and grep in void conext

2021-07-28 Thread Daniel Gustafsson
> On 28 Jul 2021, at 13:10, Andrew Dunstan  wrote:
> 
> On 7/27/21 12:06 PM, Dagfinn Ilmari Mannsåker wrote:
>> Hi hackers,
>> 
>> In the patches for improving the MSVC build process, I noticed a use of
>> `map` in void context.  This is considered bad form, and has a
>> perlcritic policy forbidding it:
>> https://metacpan.org/pod/Perl::Critic::Policy::BuiltinFunctions::ProhibitVoidMap.
>> 
>> Attached is a patch that increases severity of that and the
>> corresponding `grep` policy to 5 to enable it in our perlcritic policy,
>> and fixes the one use that had already snuck in.
> 
> Personally I'm OK with it, but previous attempts to enforce perlcritic
> policies have met with a less than warm reception, and we had to back
> off. Maybe this one will fare better.

I'm fine with increasing this policy, but I don't have strong feelings.  If we
feel the perlcritic policy change is too much, I would still prefer to go ahead
with the map rewrite part of the patch though.

--
Daniel Gustafsson   https://vmware.com/





Re: Out-of-memory error reports in libpq

2021-07-28 Thread Andrew Dunstan


On 7/27/21 6:40 PM, Tom Lane wrote:
> While cleaning out dead branches in my git repo, I came across an
> early draft of what eventually became commit ffa2e4670 ("In libpq,
> always append new error messages to conn->errorMessage").  I realized
> that it contained a good idea that had gotten lost on the way to that
> commit.  Namely, let's reduce all of the 60-or-so "out of memory"
> reports in libpq to calls to a common subroutine, and then let's teach
> the common subroutine a recovery strategy for the not-unlikely
> possibility that it fails to append the "out of memory" string to
> conn->errorMessage.  That recovery strategy of course is to reset the
> errorMessage buffer to empty, hopefully regaining some space.  We lose
> whatever we'd had in the buffer before, but we have a better chance of
> the "out of memory" message making its way to the user.
>
> The first half of that just saves a few hundred bytes of repetitive
> coding.  However, I think that the addition of recovery logic is
> important for robustness, because as things stand libpq may be
> worse off than before for OOM handling.  Before ffa2e4670, almost
> all of these call sites did printfPQExpBuffer(..., "out of memory").
> That would automatically clear the message buffer to empty, and
> thereby be sure to report the out-of-memory failure if at all
> possible.  Now we might fail to report the thing that the user
> really needs to know to make sense of what happened.
>
> Therefore, I feel like this was an oversight in ffa2e4670,
> and we ought to back-patch the attached into v14.
>
> cc'ing the RMT in case they wish to object.
>
>   


I'm honored you've confused me with Alvaro :-)

This seems sensible, and we certainly shouldn't be worse off than
before, so let's do it.

I'm fine with having two functions for call simplicity, but I don't feel
strongly about it.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: perlcritic: prohibit map and grep in void conext

2021-07-28 Thread Andrew Dunstan


On 7/27/21 12:06 PM, Dagfinn Ilmari Mannsåker wrote:
> Hi hackers,
>
> In the patches for improving the MSVC build process, I noticed a use of
> `map` in void context.  This is considered bad form, and has a
> perlcritic policy forbidding it:
> https://metacpan.org/pod/Perl::Critic::Policy::BuiltinFunctions::ProhibitVoidMap.
>
> Attached is a patch that increases severity of that and the
> corresponding `grep` policy to 5 to enable it in our perlcritic policy,
> and fixes the one use that had already snuck in.
>


Personally I'm OK with it, but previous attempts to enforce perlcritic
policies have met with a less than warm reception, and we had to back
off. Maybe this one will fare better.

I keep the buildfarm code perlcritic compliant down to severity 3 with a
handful of exceptions.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [Patch] ALTER SYSTEM READ ONLY

2021-07-28 Thread Amul Sul
On Wed, Jul 28, 2021 at 2:26 AM Robert Haas  wrote:
>
> On Fri, Jul 23, 2021 at 4:03 PM Robert Haas  wrote:
> > My 0003 is where I see some lingering problems. It creates
> > XLogAcceptWrites(), moves the appropriate stuff there, and doesn't
> > need the xlogreader. But it doesn't really solve the problem of how
> > checkpointer.c would be able to call this function with proper
> > arguments. It is at least better in not needing two arguments to
> > decide what to do, but how is checkpointer.c supposed to know what to
> > pass for xlogaction? Worse yet, how is checkpointer.c supposed to know
> > what to pass for EndOfLogTLI and EndOfLog?
>
> On further study, I found another problem: the way my patch set leaves
> things, XLogAcceptWrites() depends on ArchiveRecoveryRequested, which
> will not be correctly initialized in any process other than the
> startup process. So CleanupAfterArchiveRecovery(EndOfLogTLI, EndOfLog)
> would just be skipped. Your 0001 seems to have the same problem. You
> added Assert(AmStartupProcess()) to the inside of the if
> (ArchiveRecoveryRequested) block, but that doesn't fix anything.
> Outside the startup process, ArchiveRecoveryRequested will always be
> false, but the point is that the associated stuff should be done if
> ArchiveRecoveryRequested would have been true in the startup process.
> Both of our patch sets leave things in a state where that would never
> happen, which is not good. Unless I'm missing something, it seems like
> maybe you didn't test your patches to verify that, when the
> XLogAcceptWrites() call comes from the checkpointer, all the same
> things happen that would have happened had it been called from the
> startup process. That would be a really good thing to have tested
> before posting your patches.
>

My bad, I am extremely sorry about that. I usually do test my patches,
but somehow I failed to test this change due to manually testing the
whole ASRO feature and hurrying in posting the newest version.

I will try to be more careful next time.

> As far as EndOfLogTLI is concerned, there are, somewhat annoyingly,
> several TLIs stored in XLogCtl. None of them seem to be precisely the
> same thing as EndLogTLI, but I am hoping that replayEndTLI is close
> enough. I found out pretty quickly through testing that replayEndTLI
> isn't always valid -- it ends up 0 if we don't enter recovery. That's
> not really a problem, though, because we only need it to be valid if
> ArchiveRecoveryRequested. The code that initializes and updates it
> seems to run whenever InRecovery = true, and ArchiveRecoveryRequested
> = true will force InRecovery = true. So it looks to me like
> replayEndTLI will always be initialized in the cases where we need a
> value. It's not yet entirely clear to me if it has to have the same
> value as EndOfLogTLI. I find this code comment quite mysterious:
>
> /*
>  * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
>  * the end-of-log. It could be different from the timeline that EndOfLog
>  * nominally belongs to, if there was a timeline switch in that segment,
>  * and we were reading the old WAL from a segment belonging to a higher
>  * timeline.
>  */
> EndOfLogTLI = xlogreader->seg.ws_tli;
>
> The thing is, if we were reading old WAL from a segment belonging to a
> higher timeline, wouldn't we have switched to that new timeline?

AFAIUC, by browsing the code, yes, we are switching to the new
timeline.  Along with lastReplayedTLI, lastReplayedEndRecPtr is also
the same as the EndOfLog that we needed when ArchiveRecoveryRequested
is true.

I went through the original commit 7cbee7c0a1db and the thread[1] but
didn't find any related discussion for that.

> Suppose we want WAL segment 246 from TLI 1, but we don't have that
> segment on TLI 1, only TLI 2. Well, as far as I know, for us to use
> the TLI 2 version, we'd need to have TLI 2 in the history of the
> recovery_target_timeline. And if that is the case, then we would have
> to replay through the record where the timeline changes. And if we do
> that, then the discrepancy postulated by the comment cannot still
> exist by the time we reach this code, because this code is only
> reached after we finish WAL redo. So I'm baffled as to how this can
> happen, but considering how many cases there are in this code, I sure
> can't promise that it doesn't. The fact that we have few tests for any
> of this doesn't help either.

I am not an expert in this area, but will try to spend some more time
on understanding and testing.

1] postgr.es/m/555dd101.7080...@iki.fi

Regards,
Amul




Re: pg_receivewal starting position

2021-07-28 Thread Ronan Dunklau
Le mercredi 28 juillet 2021, 08:22:30 CEST Kyotaro Horiguchi a écrit :
> At Tue, 27 Jul 2021 07:50:39 +0200, Ronan Dunklau 
> wrote in
> > Hello,
> > 
> > I've notived that pg_receivewal logic for deciding which LSN to start
> > 
> > streaming at consists of:
> >   - looking up the latest WAL file in our destination folder, and resume
> >   from
> > 
> > here
> > 
> >   - if there isn't, use the current flush location instead.
> > 
> > This behaviour surprised me when using it with a replication slot: I was
> > expecting it to start streaming at the last flushed location from the
> > replication slot instead. If you consider a backup tool which will take
> > pg_receivewal's output and transfer it somewhere else, using the
> > replication slot position would be the easiest way to ensure we don't
> > miss WAL files.
> > 
> > Does that make sense ?
> > 
> > I don't know if it should be the default, toggled by a command line flag,
> > or if we even should let the user provide a LSN.
> 
> *I* think it is completely reasonable (or at least convenient or less
> astonishing) that pg_receivewal starts from the restart_lsn of the
> replication slot to use.  The tool already decides the clean-start LSN
> a bit unusual way. And it seems to me that proposed behavior can be
> the default when -S is specified.
> 

As of now we can't get the replication_slot restart_lsn with a replication 
connection AFAIK.

This implies that the patch could require the user to specify a maintenance-db 
parameter, and we would use that if provided to fetch the replication slot 
info, or fallback to the previous behaviour. I don't really like this approach 
as the behaviour changing wether we supply a maintenance-db parameter or not 
is error-prone for the user.

Another option would be to add a new replication command (for example 
ACQUIRE_REPLICATION_SLOT ) to set the replication slot as the 
current one, and return some info about it (restart_lsn at least for a 
physical slot).

I don't see any reason not to make it work for logical replication connections 
/ slots, but it wouldn't be that useful since we can query the database in 
that case.

Acquiring the replication slot instead of just reading it would make sure that 
no other process could start the replication between the time we read the 
restart_lsn and when we issue START_REPLICATION. START_REPLICATION could then 
check if we already have a replication slot, and ensure it is the same one as 
the one we're trying to use. 

From pg_receivewal point of view, this would amount to:

 - check if we currently have wal in the target directory.
   - if we do, proceed as currently done, by computing the start lsn and 
timeline from the last archived wal
  - if we don't, and we have a slot, run ACQUIRE_REPLICATION_SLOT. Use the 
restart_lsn as the start lsn if there is one, and don't provide a timeline
  - if we still don't have a start_lsn, fallback to using the current server 
wal position as is done. 

What do you think ? Which information should we provide about the slot ?


-- 
Ronan Dunklau






Re: Added schema level support for publication.

2021-07-28 Thread vignesh C
On Tue, Jul 27, 2021 at 5:11 AM Greg Nancarrow  wrote:
>
> On Mon, Jul 26, 2021 at 3:21 PM vignesh C  wrote:
> >
> > Thanks for the comment, this is modified in the v15 patch attached.
> >
>
> I have several minor review comments.
>
> (1) src/backend/catalog/objectaddress.c
> Should start comment sentences with an uppercase letter, for consistency.
>
> + /* fetch publication name and schema oid from input list */
>
> I also notice that some 1-sentence comments end with "." (full-stop)
> and others don't. It seems to alternate all over the place, and so is
> quite noticeable.
> Unfortunately, it already seems to be like this in much of the code
> that this patch patches.
> Ideally (at least my personal preference is) 1-sentence comments
> should not end with a ".".

Modified.

> (2) src/backend/catalog/pg_publication.c
> errdetail message
>
> I think the following should say "Temporary schemas ..." (since the
> existing error message for tables says "System tables cannot be added
> to publications.").
>
> +   errdetail("Temporary schema cannot be added to publications.")));
>

Modified.

> (3) src/backend/commands/publicationcmds.c
> PublicationAddTables
>
> I think that the Assert below is not correct (i.e. not restrictive
enough).
> Although the condition passes, it is allowing, for example,
> stmt->for_all_tables==true if stmt->schemas==NIL, and that doesn't
> seem to be correct.
> I suggest the following change:
>
> BEFORE:
> + Assert(!stmt || !stmt->for_all_tables || !stmt->schemas);
> AFTER:
> + Assert(!stmt || (!stmt->for_all_tables && !stmt->schemas));
>

Modified.

> (4) src/backend/commands/publicationcmds.c
> PublicationAddSchemas
>
> Similarly, I think that the Assert below is not restrictive enough,
> and think it should be changed:
>
> BEFORE:
> + Assert(!stmt || !stmt->for_all_tables || !stmt->tables);
> AFTER:
> + Assert(!stmt || (!stmt->for_all_tables && !stmt->tables));
>

Modified.

> (5) src/bin/pg_dump/common.c
>
> Spelling mistake.
>
> BEFORE:
> + pg_log_info("reading publciation schemas");
> AFTER:
> + pg_log_info("reading publication schemas");

Modified.

Thanks for the comments, the comments are fixed in the v16 patch attached
at [1].
[1] -
https://www.postgresql.org/message-id/CALDaNm2LgV5XcLF80rJ60NwnjKpZj%3D%3DLxJpO4W2TG2G5XmUtDA%40mail.gmail.com

Regards,
Vignesh


Re: .ready and .done files considered harmful

2021-07-28 Thread Dipesh Pandit
Hi,

> I don't think it's great that we're using up SIGINT for this purpose.
> There aren't that many signals available at the O/S level that we can
> use for our purposes, and we generally try to multiplex them at the
> application layer, e.g. by setting a latch or a flag in shared memory,
> rather than using a separate signal. Can we do something of that sort
> here? Or maybe we don't even need a signal. ThisTimeLineID is already
> visible in shared memory, so why not just have the archiver just check
> and see whether it's changed, say via a new accessor function
> GetCurrentTimeLineID()?

As of now shared memory is not attached to the archiver. Archiver cannot
access ThisTimeLineID or a flag available in shared memory.

if (strcmp(argv[1], "--forkbackend") == 0 ||

strcmp(argv[1], "--forkavlauncher") == 0 ||

strcmp(argv[1], "--forkavworker") == 0 ||

strcmp(argv[1], "--forkboot") == 0 ||

strncmp(argv[1], "--forkbgworker=", 15) == 0)

PGSharedMemoryReAttach();

else

PGSharedMemoryNoReAttach();

This is the reason we have thought of sending a notification to the
archiver if
there is a timeline switch. Should we consider attaching shared memory to
archiver process or explore more on notification mechanism to avoid
using SIGINT?

Thanks,
Dipesh


Re: a thinko in b676ac443b6

2021-07-28 Thread Tomas Vondra
On 7/28/21 3:15 AM, Amit Langote wrote:
> On Wed, Jul 28, 2021 at 1:07 AM Tomas Vondra
>  wrote:
>> On 7/27/21 4:28 AM, Amit Langote wrote:
>>> I think it can be incorrect to use the same TupleDesc for both the
>>> slots in ri_Slots (for ready-to-be-inserted tuples) and ri_PlanSlots
>>> (for subplan output tuples).  Especially if you consider what we did
>>> in 86dc90056df that was committed into v14.  In that commit, we
>>> changed the way a subplan under ModifyTable produces its output for an
>>> UPDATE statement.  Previously, it would produce a tuple matching the
>>> target table's TupleDesc exactly (plus any junk columns), but now it
>>> produces only a partial tuple containing the values for the changed
>>> columns.
>>>
>>> So it's better to revert to using planSlot->tts_tupleDescriptor for
>>> the slots in ri_PlanSlots.  Attached a patch to do so.
>>
>> Yeah, this seems like a clear mistake - thanks for noticing it! Clearly
>> no regression test triggered the issue, so I wonder what's the best way
>> to test it - any idea what would the test need to do?
> 
> Ah, I should've mentioned that this is only a problem if the original
> query is an UPDATE.  With v14, only INSERTs can use batching and the
> subplan does output a tuple matching the target table's TupleDesc in
> their case, so the code seems to work fine.
> 
> As I said, I noticed a problem when rebasing my patch to allow
> cross-partition UPDATEs to use batching for the inserts that are
> performed internally to implement such UPDATEs.  The exact problem I
> noticed is that the following Assert tts_virtual_copyslot() (via
> ExecCopySlot called with an ri_PlanSlots[] entry) failed:
> 
> Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts);
> 
> srcdesc in this case is a slot in ri_PlanSlots[] initialized with the
> target table's TupleDesc (the "thinko") and dstslot is the slot that
> holds subplan's output tuple ('planSlot' passed to ExecInsert).  As I
> described in my previous email, dstslot's TupleDesc can be narrower
> than the target table's TupleDesc in the case of an UPDATE, so the
> Assert can fail in theory.
> 
>> I did some quick experiments with batched INSERTs with RETURNING clauses
>> and/or subplans, but I haven't succeeded in triggering the issue :-(
> 
> Yeah, no way to trigger this except UPDATEs.  It still seems like a
> good idea to fix this in v14.
> 

OK, thanks for the explanation. So it's benign in v14, but I agree it's
better to fix it there too. I'll get this sorted/pushed.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: when the startup process doesn't (logging startup delays)

2021-07-28 Thread Nitin Jadhav
> > > I saw that you fixed it by calling InitStartupProgress() after the 
> > > walkdir()
> > > calls which do pre_sync_fname.  So then walkdir is calling
> > > LogStartupProgress(STARTUP_PROCESS_OP_FSYNC) even when it's not doing 
> > > fsync,
> > > and then LogStartupProgress() is returning because !AmStartupProcess().
> >
> > I don't think walkdir() has any business calling LogStartupProgress()
> > at all. It's supposed to be a generic function, not one that is only
> > available to be called from the startup process, or has different
> > behavior there. From my point of view, the right thing is to put the
> > logging calls into the particular callbacks that SyncDataDirectory()
> > uses.
>
> You're right - this is better.

I also agree that this is the better place to do it. Hence modified
the patch accordingly. The condition "!AmStartupProcess()" is added to
differentiate whether the call is done from a startup process or some
other process. Actually StartupXLOG() gets called in 2 places. one in
StartupProcessMain() and the other in InitPostgres(). As the logging
of the startup progress is required only during the startup process
and not in the other cases, so added the condition to confirm the call
is from the startup process. I did not find any other way to
differentiate. Kindly let me know if there is any other better
approach to do this.

> > > Maybe I'm missing something here, but I don't understand the purpose
> > > of this. You can always combine two functions into one, but it's only
> > > worth doing if you end up with less code, which doesn't seem to be the
> > > case here.
> >
> >  4 files changed, 39 insertions(+), 71 deletions(-)
>
> Hmm. I don't completely hate that, but I don't love it either. I don't
> think the result is any easier to understand than the original, and in
> some ways it's worse. In particular, you've flattened the separate
> comments for each of the individual functions into a single-line
> comment that is more generic than the comments it replaced. You could
> fix that and you'd still be slightly ahead on LOC, but I'm not
> convinced that it's actually a real win.

As per my understanding there are no changes required wrt this. Hence
not done any changes.

Please find the updated patch attached. Kindly share your comments if any.

On Mon, Jul 26, 2021 at 10:41 PM Robert Haas  wrote:
>
> On Mon, Jul 26, 2021 at 11:30 AM Justin Pryzby  wrote:
> > > Maybe I'm missing something here, but I don't understand the purpose
> > > of this. You can always combine two functions into one, but it's only
> > > worth doing if you end up with less code, which doesn't seem to be the
> > > case here.
> >
> >  4 files changed, 39 insertions(+), 71 deletions(-)
>
> Hmm. I don't completely hate that, but I don't love it either. I don't
> think the result is any easier to understand than the original, and in
> some ways it's worse. In particular, you've flattened the separate
> comments for each of the individual functions into a single-line
> comment that is more generic than the comments it replaced. You could
> fix that and you'd still be slightly ahead on LOC, but I'm not
> convinced that it's actually a real win.
>
> > > ... but I'm not exactly sure how to get there from here. Having only
> > > LogStartupProgress() but having it do a giant if-statement to figure
> > > out whether we're mid-phase or end-of-phase does not seem like the
> > > right approach.
> >
> > I used a bool arg and negation to handle within a single switch.  Maybe it's
> > cleaner to use a separate enum value for each DONE, and set a local done 
> > flag.
>
> If we're going to go the route of combining the functions, I agree
> that a Boolean is the way to go; I think we have some existing
> precedent for 'bool finished' rather than 'bool done'.
>
> But I kind of wonder if we should have an enum in the first place. It
> feels like we've got code in a bunch of places that just exists to
> decide which enum value to use, and then code someplace else that
> turns around and decides which message to produce. That would be
> sensible if we were using the same enum values in lots of places, but
> that's not the case here. So suppose we just moved the messages to the
> places where we're now calling LogStartupProgress() or
> LogStartupProgressComplete()? So something like this:
>
> if (should_report_startup_progress())
> ereport(LOG,
>  (errmsg("syncing data directory (syncfs), elapsed
> time: %ld.%02d s, current path: %s",
>secs, (usecs / 1), path)));
>
> Well, that doesn't quite work, because "secs" and "usecs" aren't going
> to exist in the caller. We can fix that easily enough: let's just make
> should_report_startup_progress take arguments long *secs, int *usecs,
> and bool done. Then the above becomes:
>
> if (should_report_startup_progress(&secs, &usecs, false))
> ereport(LOG,
>  (errmsg("syncing data directory (syncfs), elapsed
> time: %ld.%02d

Re: needless complexity in StartupXLOG

2021-07-28 Thread Kyotaro Horiguchi
At Tue, 27 Jul 2021 11:03:14 -0400, Robert Haas  wrote 
in 
> On Tue, Jul 27, 2021 at 9:18 AM Kyotaro Horiguchi
>  wrote:
> > FWIW, by the way, I complained that the variable name "promoted" is a
> > bit confusing.  It's old name was fast_promoted, which means that fast
> > promotion is being *requsted* and ongoing.  On the other hand the
> > current name "promoted" still means "(fast=non-fallback) promotion is
> > ongoing" so there was a conversation as the follows.
> >
> > https://www.postgresql.org/message-id/9fdd994d-a531-a52b-7906-e1cc22701310%40oss.nttdata.com
> 
> I agree - that variable name is also not great. I am open to making
> improvements in that area and in others that have been suggested on
> this thread, but my immediate goal is to figure out whether anyone
> objects to me committing the posted patch. If nobody comes up with a
> reason why it's a bad idea in the next few days, I'll plan to move
> ahead with it.

That's fine with me.

I still haven't find a way to lose the last checkpoint due to software
failure.  Repeated promotion without having new checkpoints is safe as
expected. We don't remove WAL files unless a checkpoint completes, and
a checkpoint preserves segments back to the one containing its redo
point.

In short, I'm for it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: proposal: enhancing plpgsql debug API - returns text value of variable content

2021-07-28 Thread Pavel Stehule
pá 23. 7. 2021 v 10:47 odesílatel Pavel Stehule 
napsal:

>
>
> pá 23. 7. 2021 v 10:30 odesílatel Aleksander Alekseev <
> aleksan...@timescale.com> napsal:
>
>> Hi Pavel,
>>
>> > I know it. Attached patch try to fix this issue
>> >
>> > I merged you patch (thank you)
>>
>> Thanks! I did some more minor changes, mostly in the comments. See the
>> attached patch. Other than that it looks OK. I think it's Ready for
>> Committer now.
>>
>
> looks well,
>
> thank you very much
>
> Pavel
>

rebase

Pavel


>
>> --
>> Best regards,
>> Aleksander Alekseev
>>
>
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 14bbe12da5..e0e68a2592 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -4059,6 +4059,8 @@ plpgsql_estate_setup(PLpgSQL_execstate *estate,
 	{
 		(*plpgsql_plugin_ptr)->error_callback = plpgsql_exec_error_callback;
 		(*plpgsql_plugin_ptr)->assign_expr = exec_assign_expr;
+		(*plpgsql_plugin_ptr)->eval_datum = exec_eval_datum;
+		(*plpgsql_plugin_ptr)->cast_value = do_cast_value;
 
 		if ((*plpgsql_plugin_ptr)->func_setup)
 			((*plpgsql_plugin_ptr)->func_setup) (estate, func);
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 0f6a5b30b1..abe7d63f78 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -415,6 +415,7 @@ pl_block		: decl_sect K_BEGIN proc_sect exception_sect K_END opt_label
 		new->cmd_type	= PLPGSQL_STMT_BLOCK;
 		new->lineno		= plpgsql_location_to_lineno(@2);
 		new->stmtid		= ++plpgsql_curr_compile->nstatements;
+		new->ns			= plpgsql_ns_top();
 		new->label		= $1.label;
 		new->n_initvars = $1.n_initvars;
 		new->initvarnos = $1.initvarnos;
@@ -907,7 +908,8 @@ stmt_perform	: K_PERFORM
 		new = palloc0(sizeof(PLpgSQL_stmt_perform));
 		new->cmd_type = PLPGSQL_STMT_PERFORM;
 		new->lineno   = plpgsql_location_to_lineno(@1);
-		new->stmtid = ++plpgsql_curr_compile->nstatements;
+		new->stmtid   = ++plpgsql_curr_compile->nstatements;
+		new->ns		  = plpgsql_ns_top();
 		plpgsql_push_back_token(K_PERFORM);
 
 		/*
@@ -943,6 +945,7 @@ stmt_call		: K_CALL
 		new->cmd_type = PLPGSQL_STMT_CALL;
 		new->lineno = plpgsql_location_to_lineno(@1);
 		new->stmtid = ++plpgsql_curr_compile->nstatements;
+		new->ns = plpgsql_ns_top();
 		plpgsql_push_back_token(K_CALL);
 		new->expr = read_sql_stmt();
 		new->is_call = true;
@@ -962,6 +965,7 @@ stmt_call		: K_CALL
 		new->cmd_type = PLPGSQL_STMT_CALL;
 		new->lineno = plpgsql_location_to_lineno(@1);
 		new->stmtid = ++plpgsql_curr_compile->nstatements;
+		new->ns = plpgsql_ns_top();
 		plpgsql_push_back_token(K_DO);
 		new->expr = read_sql_stmt();
 		new->is_call = false;
@@ -1000,7 +1004,8 @@ stmt_assign		: T_DATUM
 		new = palloc0(sizeof(PLpgSQL_stmt_assign));
 		new->cmd_type = PLPGSQL_STMT_ASSIGN;
 		new->lineno   = plpgsql_location_to_lineno(@1);
-		new->stmtid = ++plpgsql_curr_compile->nstatements;
+		new->stmtid   = ++plpgsql_curr_compile->nstatements;
+		new->ns		  = plpgsql_ns_top();
 		new->varno = $1.datum->dno;
 		/* Push back the head name to include it in the stmt */
 		plpgsql_push_back_token(T_DATUM);
@@ -1022,6 +1027,7 @@ stmt_getdiag	: K_GET getdiag_area_opt K_DIAGNOSTICS getdiag_list ';'
 		new->cmd_type = PLPGSQL_STMT_GETDIAG;
 		new->lineno   = plpgsql_location_to_lineno(@1);
 		new->stmtid	  = ++plpgsql_curr_compile->nstatements;
+		new->ns		  = plpgsql_ns_top();
 		new->is_stacked = $2;
 		new->diag_items = $4;
 
@@ -1194,6 +1200,7 @@ stmt_if			: K_IF expr_until_then proc_sect stmt_elsifs stmt_else K_END K_IF ';'
 		new->cmd_type	= PLPGSQL_STMT_IF;
 		new->lineno		= plpgsql_location_to_lineno(@1);
 		new->stmtid		= ++plpgsql_curr_compile->nstatements;
+		new->ns			= plpgsql_ns_top();
 		new->cond		= $2;
 		new->then_body	= $3;
 		new->elsif_list = $4;
@@ -1299,6 +1306,7 @@ stmt_loop		: opt_loop_label K_LOOP loop_body
 		new->cmd_type = PLPGSQL_STMT_LOOP;
 		new->lineno   = plpgsql_location_to_lineno(@2);
 		new->stmtid   = ++plpgsql_curr_compile->nstatements;
+		new->ns		  = plpgsql_ns_top();
 		new->label	  = $1;
 		new->body	  = $3.stmts;
 
@@ -1317,6 +1325,7 @@ stmt_while		: opt_loop_label K_WHILE expr_until_loop loop_body
 		new->cmd_type = PLPGSQL_STMT_WHILE;
 		new->lineno   = plpgsql_location_to_lineno(@2);
 		new->stmtid	  = ++plpgsql_curr_compile->nstatements;
+		new->ns		  = plpgsql_ns_top();
 		new->label	  = $1;
 		new->cond	  = $3;
 		new->body	  = $4.stmts;
@@ -1381,6 +1390,7 @@ for_control		: for_variable K_IN
 			new = palloc0(sizeof(PLpgSQL_stmt_dynfors));
 			new->cmd_type = PLPGSQL_STMT_DYNFORS;
 			new->stmtid	  = ++plpgsql_curr_compile->nstatements;
+			new->ns		  = plpgsql_ns_top();
 			if ($1.row)
 			{
 new->var = (P

Re: Have I found an interval arithmetic bug?

2021-07-28 Thread Dean Rasheed
On Wed, 28 Jul 2021 at 00:08, John W Higgins  wrote:
>
> It's nice to envision all forms of fancy calculations. But the fact is that
>
> '1.5 month'::interval * 2 != '3 month"::interval
>

That's not exactly true. Even without the patch:

SELECT '1.5 month'::interval * 2 AS product,
   '3 month'::interval AS expected,
   justify_interval('1.5 month'::interval * 2) AS justified_product,
   '1.5 month'::interval * 2 = '3 month'::interval AS equal;

product | expected | justified_product | equal
+--+---+---
 2 mons 30 days | 3 mons   | 3 mons| t
(1 row)

So it's equal even without calling justify_interval() on the result.

FWIW, I remain of the opinion that the interval literal code should
just spill down to lower units in all cases, just like the
multiplication and division code, so that the results are consistent
(barring floating point rounding errors) and explainable.

Regards,
Dean




Problem about postponing gathering partial paths for topmost scan/join rel

2021-07-28 Thread Richard Guo
Hi all,

In commit 3f90ec85 we are trying to postpone gathering partial paths for
topmost scan/join rel until we know the final targetlist, in order to
allow more accurate costing of parallel paths. We do this by the
following code snippet in standard_join_search:

+   /*
+* Except for the topmost scan/join rel, consider gathering
+* partial paths.  We'll do the same for the topmost scan/join rel
+* once we know the final targetlist (see grouping_planner).
+*/
+   if (lev < levels_needed)
+   generate_gather_paths(root, rel, false);

This change may cause a problem if the joinlist contains sub-joinlist
nodes, in which case 'lev == levels_needed' does not necessarily imply
it's the topmost for the final scan/join rel. It may only be the topmost
scan/join rel for the subproblem. And then we would miss the Gather
paths for this subproblem. It can be illustrated with the query below:

create table foo(i int, j int);
insert into foo select i, i from generate_series(1,5)i;
analyze foo;

set max_parallel_workers_per_gather to 4;
set parallel_setup_cost to 0;
set parallel_tuple_cost to 0;
set min_parallel_table_scan_size to 0;

# explain (costs off) select * from foo a join foo b on a.i = b.i full join
foo c on b.i = c.i;
 QUERY PLAN

 Hash Full Join
   Hash Cond: (b.i = c.i)
   ->  Hash Join
 Hash Cond: (a.i = b.i)
 ->  Gather
   Workers Planned: 4
   ->  Parallel Seq Scan on foo a
 ->  Hash
   ->  Gather
 Workers Planned: 4
 ->  Parallel Seq Scan on foo b
   ->  Hash
 ->  Gather
   Workers Planned: 4
   ->  Parallel Seq Scan on foo c
(15 rows)

Please note how we do the join for rel a and b. We run Gather above the
parallel scan and then do the join above the Gather.

These two base rels are grouped in a subproblem because of the FULL
JOIN. And due to the mentioned code change, we are unable to gather
partial paths for their joinrel.

If we can somehow fix this problem, then we would be able to do better
planning by running parallel join first and then doing Gather above the
join.

   ->  Gather
 Workers Planned: 4
 ->  Parallel Hash Join
   Hash Cond: (a.i = b.i)
   ->  Parallel Seq Scan on foo a
   ->  Parallel Hash
 ->  Parallel Seq Scan on foo b

To fix this problem, I'm thinking we can leverage 'root->all_baserels'
to tell if we are at the topmost scan/join rel, something like:

--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3041,7 +3041,7 @@ standard_join_search(PlannerInfo *root, int
levels_needed, List *initial_rels)
 * partial paths.  We'll do the same for the
topmost scan/join rel
 * once we know the final targetlist (see
grouping_planner).
 */
-   if (lev < levels_needed)
+   if (!bms_equal(rel->relids, root->all_baserels))
generate_useful_gather_paths(root, rel,
false);


Any thoughts?

Thanks
Richard


Re: proposal: possibility to read dumped table's name from file

2021-07-28 Thread Pavel Stehule
Hi

út 13. 7. 2021 v 1:16 odesílatel Tom Lane  napsal:

> Alvaro Herrera  writes:
> > [1] your proposal of "[+-] OBJTYPE OBJIDENT" plus empty lines allowed
> > plus lines starting with # are comments, seems plenty.  Any line not
> > following that format would cause an error to be thrown.
>
> I'd like to see some kind of keyword on each line, so that we could extend
> the command set by adding new keywords.  As this stands, I fear we'd end
> up using random punctuation characters in place of [+-], which seems
> pretty horrid from a readability standpoint.
>
> I think that this file format should be designed with an eye to allowing
> every, or at least most, pg_dump options to be written in the file rather
> than on the command line.  I don't say we have to *implement* that right
> now; but if the format spec is incapable of being extended to meet
> requests like that one, I think we'll regret it.  This line of thought
> suggests that the initial commands ought to match the existing
> include/exclude switches, at least approximately.
>
> Hence I suggest
>
> include table PATTERN
> exclude table PATTERN
>
> which ends up being the above but with words not [+-].
>

Here is an updated implementation of filter's file, that implements syntax
proposed by you.

Regards

Pavel



> regards, tom lane
>
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 7682226b99..d0459b385e 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -789,6 +789,56 @@ PostgreSQL documentation
   
  
 
+ 
+  --filter=filename
+  
+   
+Read objects filters from the specified file.
+If you use "-" as a filename, the filters are read from stdin.
+The lines of this file must have the following format:
+
+(include|exclude)[table|schema|foreign_data|data] objectname
+
+   
+
+   
+The first keyword specifies whether the object is to be included
+or excluded, and the second keyword specifies the type of object
+to be filtered:
+table (table),
+schema (schema),
+foreign_data (foreign server),
+data (table data).
+   
+
+   
+With the following filter file, the dump would include table
+mytable1 and data from foreign tables of
+some_foreign_server foreign server, but exclude data
+from table mytable2.
+
+include table mytable1
+include foreign_data some_foreign_server
+exclude table mytable2
+
+   
+
+   
+The lines starting with symbol # are ignored.
+Previous white chars (spaces, tabs) are not allowed. These
+lines can be used for comments, notes. Empty lines are ignored too.
+   
+
+   
+The --filter option works just like the other
+options to include or exclude tables, schemas, table data, or foreign
+tables, and both forms may be combined.  Note that there are no options
+to exclude a specific foreign table or to include a specific table's
+data.
+   
+  
+ 
+
  
   --if-exists
   
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 90ac445bcd..ba4c425ee6 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -55,10 +55,12 @@
 #include "catalog/pg_trigger_d.h"
 #include "catalog/pg_type_d.h"
 #include "common/connect.h"
+#include "common/string.h"
 #include "dumputils.h"
 #include "fe_utils/option_utils.h"
 #include "fe_utils/string_utils.h"
 #include "getopt_long.h"
+#include "lib/stringinfo.h"
 #include "libpq/libpq-fs.h"
 #include "parallel.h"
 #include "pg_backup_db.h"
@@ -308,7 +310,7 @@ static void appendReloptionsArrayAH(PQExpBuffer buffer, const char *reloptions,
 static char *get_synchronized_snapshot(Archive *fout);
 static void setupDumpWorker(Archive *AHX);
 static TableInfo *getRootTableInfo(const TableInfo *tbinfo);
-
+static void read_patterns_from_file(char *filename, DumpOptions *dopt);
 
 int
 main(int argc, char **argv)
@@ -380,6 +382,7 @@ main(int argc, char **argv)
 		{"enable-row-security", no_argument, &dopt.enable_row_security, 1},
 		{"exclude-table-data", required_argument, NULL, 4},
 		{"extra-float-digits", required_argument, NULL, 8},
+		{"filter", required_argument, NULL, 12},
 		{"if-exists", no_argument, &dopt.if_exists, 1},
 		{"inserts", no_argument, NULL, 9},
 		{"lock-wait-timeout", required_argument, NULL, 2},
@@ -613,6 +616,10 @@ main(int argc, char **argv)
 		  optarg);
 break;
 
+			case 12:			/* filter implementation */
+read_patterns_from_file(optarg, &dopt);
+break;
+
 			default:
 fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 exit_nicely(1);
@@ -1038,6 +1045,8 @@ help(const char *progname)
 			 "   access to)\n"));
 	printf(_("  --exclude-table-data=PATTERN do NOT dump data for the specified table(s)\n"));
 	printf(_(" 

Re: Fix around conn_duration in pgbench

2021-07-28 Thread Yugo NAGATA
Hello Fujii-san,

On Wed, 28 Jul 2021 00:20:21 +0900
Fujii Masao  wrote:

> 
> 
> On 2021/07/27 11:02, Yugo NAGATA wrote:
> > Hello Fujii-san,
> > 
> > Thank you for looking at it.
> > 
> > On Tue, 27 Jul 2021 03:04:35 +0900
> > Fujii Masao  wrote:
 
> +  * Per-thread last disconnection time is not 
> measured because it
> +  * is already done when the transaction 
> successfully finished.
> +  * Also, we don't need it when the thread is 
> aborted because we
> +  * can't report complete results anyway in such 
> cases.
> 
> What about commenting a bit more explicitly like the following?
> 
> 
> In CSTATE_FINISHED state, this disconnect_all() is no-op under -C/--connect 
> because all the connections that this thread established should have already 
> been closed at the end of transactions. So we don't need to measure the 
> disconnection delays here.
> 
> In CSTATE_ABORTED state, the measurement is no longer necessary because we 
> cannot report complete results anyways in this case.
> 

Thank you for the suggestion. I updated the comment. 
 
> >   
> >> -  /* no connection delay to record */
> >> -  thread->conn_duration = 0;
> >> +  /* connection delay is measured globally between the barriers */
> >>
> >> This comment is really correct? I was thinking that the measurement is not 
> >> necessary here because this is the case where -C option is not specified.
> > 
> > This comment means that, when -C is not specified, the connection delay is
> > measured between the barrier point where the benchmark starts
> > 
> >   /* READY */
> >   THREAD_BARRIER_WAIT(&barrier);
> > 
> > and the barrier point where all the thread finish making initial 
> > connections.
> > 
> >   /* GO */
> >   THREAD_BARRIER_WAIT(&barrier);
> 
> Ok, so you're commenting about the initial connection delay that's
> measured when -C is not specified. But I'm not sure if this comment
> here is really helpful. Seem rather confusing??

Ok. I removed this comment.


> I found another disconnect_all().
> 
>   /* XXX should this be connection time? */
>   disconnect_all(state, nclients);
> 
> The measurement is also not necessary here.
> So the above comment should be removed or updated?

I think this disconnect_all will be a no-op because all connections should
be already closed in threadRun(), but I left it just to be sure that
connections are all cleaned-up. I updated the comment for explaining above.

I attached the updated patch. Could you please look at this?

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 364b5a2e47..bf9649251b 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3545,8 +3545,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 if (is_connect)
 {
+	pg_time_usec_t start = now;
+
+	pg_time_now_lazy(&start);
 	finishCon(st);
-	now = 0;
+	now = pg_time_now();
+	thread->conn_duration += now - start;
 }
 
 if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3570,6 +3574,17 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  */
 			case CSTATE_ABORTED:
 			case CSTATE_FINISHED:
+/*
+ * In CSTATE_FINISHED state, this finishCon() is a no-op
+ * under -C/--connect because all the connections that this
+ * thread established should have already been closed at the end
+ * of transactions. So we don't need to measure the disconnection
+ * delays here.
+ *
+ * In CSTATE_ABORTED state, the measurement is no longer
+ * necessary because we cannot report complete results anyways
+ * in this case.
+ */
 finishCon(st);
 return;
 		}
@@ -6550,7 +6565,11 @@ main(int argc, char **argv)
 			bench_start = thread->bench_start;
 	}
 
-	/* XXX should this be connection time? */
+	/*
+	 * All connections should be already closed in threadRun(), so this
+	 * disconnect_all() will be a no-op, but clean up the connecions just
+	 * to be sure. We don't need to measure the disconnection delays here.
+	 */
 	disconnect_all(state, nclients);
 
 	/*
@@ -6615,6 +6634,7 @@ threadRun(void *arg)
 
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
+	thread->conn_duration = 0;
 	last_report = thread_start;
 	next_report = last_report + (int64) 100 * progress;
 
@@ -6638,14 +6658,6 @@ threadRun(void *arg)
 goto done;
 			}
 		}
-
-		/* compute connection delay */
-		thread->conn_duration = pg_time_now() - thread->started_time;
-	}
-	else
-	{
-		/* no connection delay to record */
-		thread->conn_duration = 0;
 	}
 
 	/* GO */
@@ -6846,9 +6858,7 @@ threadRun(void *arg)
 	}
 
 done:
-	start = pg_time_now();
 	disconnect_all(state, nstate);
-	thread->conn_dur