date:20210324

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-24 Thread Kyotaro Horiguchi

Sorry for the bug.

At Thu, 25 Mar 2021 01:50:29 -0400, Tom Lane  wrote in 
> Amul Sul  writes:
> > On Wed, Mar 24, 2021 at 8:09 PM Tom Lane  wrote:
> >> static inline struct SMgrRelationData *
> >> RelationGetSmgr(Relation rel)
> >> {
> >> if (unlikely(rel->rd_smgr == NULL))
> >> RelationOpenSmgr(rel);
> >> return rel->rd_smgr;
> >> }
> 
> > A quick question: Can't it be a macro instead of an inline function
> > like other macros we have in rel.h?
> 
> The multiple-evaluation hazard seems like an issue.  We've tolerated
> such hazards in the past, but mostly just because we weren't relying
> on static inlines being available, so there wasn't a good way around
> it.
> 
> Also, the conditional evaluation here would look rather ugly
> in a macro, I think, if indeed you could do it at all without
> provoking compiler warnings.

FWIW, +1 for the function as is.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: psql lacking clearerr()

2021-03-24 Thread Kyotaro Horiguchi

At Wed, 24 Mar 2021 11:11:41 -0300, Alvaro Herrera  
wrote in 
> psql seems to never call clearerr() on its output file.  So if it gets
> an error while printing a result, it'll show
> 
> could not print result table: Success
> 
> after each and every result, even though the output file isn't in error
> state anymore.
> 
> It seems that the simplest fix is just to do clearerr() at the start of
> printTable(), as in the attached.
> 
> I haven't been able to find a good reproducer.  Sometimes doing C-s C-c
> does it, but I'm not sure it is fully reproducible.

That worked for me:p  And the following steps always raises that error.

postgres=# select 1; (just to let it into history).
postgres=# C-s -> C-p -> C-m -> C-c
postgres=# select 1;
...
could not print result table: Success

And actually the patch works and the location looks like appropriate.

By the way, I think errno is not set when f* functions fail so anyway
isn't %m in the messages is useless?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Replication slot stats misgivings

2021-03-24 Thread Masahiko Sawada

On Wed, Mar 24, 2021 at 7:06 PM Amit Kapila  wrote:
>
> On Tue, Mar 23, 2021 at 10:54 PM Andres Freund  wrote:
> >
> > On 2021-03-23 23:37:14 +0900, Masahiko Sawada wrote:
> >
> > > > > Maybe we can compare the slot name in the
> > > > > received message to the name in the element of replSlotStats. If they
> > > > > don’t match, we swap entries in replSlotStats to synchronize the index
> > > > > of the replication slot in ReplicationSlotCtl->replication_slots and
> > > > > replSlotStats. If we cannot find the entry in replSlotStats that has
> > > > > the name in the received message, it probably means either it's a new
> > > > > slot or the previous create message is dropped, we can create the new
> > > > > stats for the slot. Is that what you mean, Andres?
> >
> > That doesn't seem great. Slot names are imo a poor identifier for
> > something happening asynchronously. The stats collector regularly
> > doesn't process incoming messages for periods of time because it is busy
> > writing out the stats file. That's also when messages to it are most
> > likely to be dropped (likely because the incoming buffer is full).
> >
>
> Leaving aside restart case, without some sort of such sanity checking,
> if both drop (of old slot) and create (of new slot) messages are lost
> then we will start accumulating stats in old slots. However, if only
> one of them is lost then there won't be any such problem.
>
> > Perhaps we could have RestoreSlotFromDisk() send something to the stats
> > collector ensuring the mapping makes sense?
> >
>
> Say if we send just the index location of each slot then probably we
> can setup replSlotStats. Now say before the restart if one of the drop
> messages was missed (by stats collector) and that happens to be at
> some middle location, then we would end up restoring some already
> dropped slot, leaving some of the still required ones. However, if
> there is some sanity identifier like name along with the index, then I
> think that would have worked for such a case.

Even such messages could also be lost? Given that any message could be
lost under a UDP connection, I think we cannot rely on a single
message. Instead, I think we need to loosely synchronize the indexes
while assuming the indexes in replSlotStats and
ReplicationSlotCtl->replication_slots are not synchronized.

>
> I think it would have been easier if we would have some OID type of
> identifier for each slot. But, without that may be index location of
> ReplicationSlotCtl->replication_slots and slotname combination can
> reduce the chances of slot stats go wrong quite less even if not zero.
> If not name, do we have anything else in a slot that can be used for
> some sort of sanity checking?

I don't see any useful information in a slot for sanity checking.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-24 Thread Tom Lane

Amul Sul  writes:
> On Wed, Mar 24, 2021 at 8:09 PM Tom Lane  wrote:
>> static inline struct SMgrRelationData *
>> RelationGetSmgr(Relation rel)
>> {
>> if (unlikely(rel->rd_smgr == NULL))
>> RelationOpenSmgr(rel);
>> return rel->rd_smgr;
>> }

> A quick question: Can't it be a macro instead of an inline function
> like other macros we have in rel.h?

The multiple-evaluation hazard seems like an issue.  We've tolerated
such hazards in the past, but mostly just because we weren't relying
on static inlines being available, so there wasn't a good way around
it.

Also, the conditional evaluation here would look rather ugly
in a macro, I think, if indeed you could do it at all without
provoking compiler warnings.

regards, tom lane

Re: Proposal: Save user's original authenticated identity for logging

2021-03-24 Thread Michael Paquier

On Wed, Mar 24, 2021 at 04:45:35PM +, Jacob Champion wrote:
> With low-coverage test suites, I think it's useful to allow as little
> strange behavior as possible -- in this case, printing authorization
> before authentication could signal a serious bug -- but I don't feel
> too strongly about it.

I got to look at the DN patch yesterday, so now's the turn of this
one.  Nice work.

+ * Sets the authenticated identity for the current user. The provided string
+ * will be copied into the TopMemoryContext. The ID will be logged if
+ * log_connections is enabled.
[...]
+   port->authn_id = MemoryContextStrdup(TopMemoryContext, id);
It may not be obvious that all the field is copied to TopMemoryContext
because the Port requires that.

+$node->stop('fast');
+my $log_contents = slurp_file($log);
Like 11e1577, let's just truncate the log files in all those tests.

+   if (auth_method < 0 || USER_AUTH_LAST < auth_method)
+   {
+   Assert((0 <= auth_method) && (auth_method <= USER_AUTH_LAST));
What's the point of having the check and the assertion?  NULL does not
really seem like a good default here as this should never really
happen.  Wouldn't a FATAL be actually safer?

+like(
+   $log_contents,
+   qr/connection authenticated: identity="ssltestuser"
method=scram-sha-256/,
+   "Basic SCRAM sets the username as the authenticated identity");
+
+$node->start;
It looks wrong to me to include in the SSL tests some checks related
to SCRAM authentication.  This should remain in 001_password.pl, as of
src/test/authentication/.

port->gss->princ = MemoryContextStrdup(TopMemoryContext, port->gbuf.value);
+   set_authn_id(port, gbuf.value);
I don't think that this position is right for GSSAPI.  Shouldn't this
be placed at the end of pg_GSS_checkauth() and only if the status is
OK?

-   ret = check_usermap(port->hba->usermap, port->user_name, peer_user, false);
-
-   pfree(peer_user);
+   ret = check_usermap(port->hba->usermap, port->user_name, port->authn_id, 
false);
I would also put this one after checking the usermap for peer.

+   /*
+* We have all of the information necessary to construct the authenticated
+* identity.
+*/
+   if (port->hba->compat_realm)
+   {
+   /* SAM-compatible format. */
+   authn_id = psprintf("%s\\%s", domainname, accountname);
+   }
+   else
+   {
+   /* Kerberos principal format. */
+   authn_id = psprintf("%s@%s", accountname, domainname);
+   }
+
+   set_authn_id(port, authn_id);
+   pfree(authn_id);
For SSPI, I think that this should be moved down once we are sure that
there is no error and that pg_SSPI_recvauth() reports STATUS_OK to the
caller.  There is a similar issue with CheckCertAuth(), and
set_authn_id() is documented so as it should be called only when we
are sure that authentication succeeded.

Reading through the thread, the consensus is to add the identity
information with log_connections.  One question I have is that if we
just log the information with log_connectoins, there is no real reason
to add this information to the Port, except the potential addition of
some system function, a superuser-only column in pg_stat_activity or
to allow extensions to access this information.  I am actually in
favor of keeping this information in the Port with those pluggability
reasons.  How do others feel about that?
--
Michael


signature.asc
Description: PGP signature

Decoding speculative insert with toast leaks memory

2021-03-24 Thread Ashutosh Bapat

Hi All,
We saw OOM in a system where WAL sender consumed Gigabttes of memory
which was never released. Upon investigation, we found out that there
were many ReorderBufferToastHash memory contexts linked to
ReorderBuffer context, together consuming gigs of memory. They were
running INSERT ... ON CONFLICT .. among other things. A similar report
at [1]

I could reproduce a memory leak in wal sender using following steps
Session 1
postgres=# create table t_toast (a int primary key, b text);
postgres=# CREATE PUBLICATION dbz_minimal_publication FOR TABLE public.t_toast;

Terminal 4
$ pg_recvlogical -d postgres --slot pgoutput_minimal_test_slot
--create-slot -P pgoutput
$ pg_recvlogical -d postgres --slot pgoutput_minimal_test_slot --start
-o proto_version=1 -o publication_names='dbz_minimal_publication' -f
/dev/null

Session 1
postgres=# select * from pg_replication_slots ;
 slot_name  |  plugin  | slot_type | datoid | database
| temporary | active | active_pid | xmin | catalog_xmin | restart_lsn
| confirmed_flush_lsn
+--+---++--+---+++--+--+-+-
 pgoutput_minimal_test_slot | pgoutput | logical   |  12402 | postgres
| f | f  ||  |  570 | 0/15FFFD0
| 0/1600020

postgres=# begin;
postgres=# insert into t_toast values (500, repeat('something' ||
txid_current()::text, 10)) ON CONFLICT (a) DO NOTHING;
INSERT 0 1

Session 2 (xid = 571)
postgres=# begin;
postgres=# insert into t_toast values (500, repeat('something' ||
txid_current()::text, 10)) ON CONFLICT (a) DO NOTHING;

Session 3 (xid = 572)
postgres=# begin;
postgres=# insert into t_toast values (500, repeat('something' ||
txid_current()::text, 10)) ON CONFLICT (a) DO NOTHING;

Session 1 (this session doesn't modify the table but is essential for
speculative insert to happen.)
postgres=# rollback;

Session 2 and 3 in the order in which you get control back (in my case
session 2 with xid = 571)
INSERT 0 1
postgres=# commit;
COMMIT

other session (in my case session 3 with xid = 572)
INSERT 0 0
postgres=# commit;
COMMIT

With the attached patch, we see following in the server logs
2021-03-25 09:57:20.469 IST [12424] LOG:  starting logical decoding
for slot "pgoutput_minimal_test_slot"
2021-03-25 09:57:20.469 IST [12424] DETAIL:  Streaming transactions
committing after 0/1600020, reading WAL from 0/15FFFD0.
2021-03-25 09:57:20.469 IST [12424] LOG:  logical decoding found
consistent point at 0/15FFFD0
2021-03-25 09:57:20.469 IST [12424] DETAIL:  There are no running transactions.
2021-03-25 09:59:45.494 IST [12424] LOG:  initializing hash table for
transaction 571
2021-03-25 09:59:45.494 IST [12424] LOG:  speculative insert
encountered in transaction 571
2021-03-25 09:59:45.494 IST [12424] LOG:  speculative insert confirmed
in transaction 571
2021-03-25 09:59:45.504 IST [12424] LOG:  destroying toast hash for
transaction 571
2021-03-25 09:59:50.806 IST [12424] LOG:  initializing hash table for
transaction 572
2021-03-25 09:59:50.806 IST [12424] LOG:  speculative insert
encountered in transaction 572
2021-03-25 09:59:50.806 IST [12424] LOG:  toast hash for transaction
572 is not cleared

Observe that the toast_hash was cleaned for the transaction 571 which
successfully inserted the row but was not cleaned for the transaction
572 which performed DO NOTHING instead of INSERT.

Here's the sequence of events which leads to memory leak in wal sender
1. Transaction 571 performs a speculative INSERT which is decoded as
toast insert followed by speculative insert of row
2. decoding toast tuple, causes the toast hash to be created
3. Speculative insert is ignored while decoding
4. Speculative insert is confimed and decoded as a normal INSERT, also
destroying the toast hash
5. Transaction 572 performs speculative insert which is decoded as
toast insert followed by speculative insert of row
6. decoding toast tuple causes the toast hash to be created
7. speculative insert is ignored while decoding
... Speculative INSERT is never confirmed and thus toast hash is never
destroyed leaking memory

In memory context dump we see as many ReorderBufferToastHash as the
number of times the above sequence is repeated.
TopMemoryContext: 1279640 total in 7 blocks; 23304 free (17 chunks);
1256336 used
...
Replication command context: 32768 total in 3 blocks; 10952 free
(9 chunks); 21816 used
 ...
ReorderBuffer: 8192 total in 1 blocks; 7656 free (7 chunks); 536 used
  ReorderBufferToastHash: 8192 total in 1 blocks; 2056 free (0
chunks); 6136 used
  ReorderBufferToastHash: 8192 total in 1 blocks; 2056 free (0
chunks); 6136 used
  ReorderBufferToastHash: 8192 total in 1 blocks; 2056 free (0
chunks); 6136 used


The relevant code is all in ReoderBufferCommit() in cases
REORDER_BUFFER_CHANGE_INSERT,
REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT and
REORDER_BUFFER_CHANGE_I

Re: In-placre persistance change of a relation

2021-03-24 Thread Kyotaro Horiguchi

At Thu, 25 Mar 2021 14:08:05 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> (I'm not sure when the subject was broken..)
> 
> At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > Commit bea449c635 conflicts with this on the change of the definition
> > of DropRelFileNodeBuffers. The change simplified this patch by a bit:p
> 
> In this version, I got rid of the "CLEANUP FORK"s, and added a new
> system "Smgr marks".  The mark files have the name of the
> corresponding fork file followed by ".u" (which means Uncommitted.).
> "Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM
> and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM
> in the previous version.x
> 
> I noticed that the previous version of the patch still leaves an
> orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
> before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
> revmoed at rollback.  In this version the responsibility to remove the
> mark files is moved to SyncPostCheckpoint, where main fork files are
> actually removed.

For the record, I noticed that basebackup could be confused by the
mark files but I haven't looked that yet.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: In-placre persistance change of a relation

2021-03-24 Thread Kyotaro Horiguchi

(I'm not sure when the subject was broken..)

At Thu, 14 Jan 2021 17:32:17 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Commit bea449c635 conflicts with this on the change of the definition
> of DropRelFileNodeBuffers. The change simplified this patch by a bit:p

In this version, I got rid of the "CLEANUP FORK"s, and added a new
system "Smgr marks".  The mark files have the name of the
corresponding fork file followed by ".u" (which means Uncommitted.).
"Uncommited"-marked main fork means the same as the CLEANUP2_FORKNUM
and uncommitted-marked init fork means the same as the CLEANUP_FORKNUM
in the previous version.x

I noticed that the previous version of the patch still leaves an
orphan main fork file after "BEGIN; CREATE TABLE x; ROLLBACK; (crash
before checkpoint)" since the "mark" file (or CLEANUP2_FORKNUM) is
revmoed at rollback.  In this version the responsibility to remove the
mark files is moved to SyncPostCheckpoint, where main fork files are
actually removed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 27ea96d84dfc2f3e0d62c4b8f7d20cc30771cf86 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v6 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  52 +++
 src/backend/access/transam/README  |   8 +
 src/backend/access/transam/xlog.c  |  17 +
 src/backend/catalog/storage.c  | 520 +++--
 src/backend/commands/tablecmds.c   | 246 ++--
 src/backend/replication/basebackup.c   |   3 +-
 src/backend/storage/buffer/bufmgr.c|  88 +
 src/backend/storage/file/fd.c  |   4 +-
 src/backend/storage/file/reinit.c  | 346 +++-
 src/backend/storage/smgr/md.c  |  92 -
 src/backend/storage/smgr/smgr.c|  32 ++
 src/backend/storage/sync/sync.c|  20 +-
 src/bin/pg_rewind/parsexlog.c  |  24 ++
 src/common/relpath.c   |  47 ++-
 src/include/catalog/storage.h  |   2 +
 src/include/catalog/storage_xlog.h |  42 +-
 src/include/common/relpath.h   |   9 +-
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/fd.h   |   1 +
 src/include/storage/md.h   |   8 +-
 src/include/storage/reinit.h   |  10 +-
 src/include/storage/smgr.h |  17 +
 22 files changed, 1384 insertions(+), 206 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..d251f22207 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,49 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_MARK)
+	{
+		xl_smgr_mark *xlrec = (xl_smgr_mark *) rec;
+		char	   *path = GetRelationPath(xlrec->rnode.dbNode,
+		   xlrec->rnode.spcNode,
+		   xlrec->rnode.relNode,
+		   InvalidBackendId,
+		   xlrec->forkNum, xlrec->mark);
+		char	   *action;
+
+		switch (xlrec->action)
+		{
+			case XLOG_SMGR_MARK_CREATE:
+action = "CREATE";
+break;
+			case XLOG_SMGR_MARK_UNLINK:
+action = "DELETE";
+break;
+			default:
+action = "";
+break;
+		}
+
+		appendStringInfo(buf, "%s %s", action, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +98,15 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_MARK:
+			id = "MARK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..7cf77e4a02 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is par

RE: Global snapshots

2021-03-24 Thread tsunakawa.ta...@fujitsu.com

From: Andrey V. Lepikhov 
> Current state of the patch set rebased on master, 5aed6a1fc2.
> 
> It is development version. Here some problems with visibility still detected 
> in
> two tests:
> 1. CSN Snapshot module - TAP test on time skew.
> 2. Clock SI implementation - TAP test on emulation of bank transaction.

I'm sorry to be late to respond.  Thank you for the update.

As discussed at the HighGo meeting, what do you think we should do about this 
patch set, now that we agreed that Clock-SI is covered by Microsoft's patent?  
I'd appreciate it if you could share some idea to change part of the algorithm 
and circumvent the patent.

Otherwise, why don't we discuss alternatives, such as the Commitment Ordering?

I have a hunch that YugabyteDB's method seems promising, which I wrote in the 
following wiki.  Of course, we should make efforts to see if it's patented 
before diving deeper into the design or implementation.

Scaleout Design - PostgreSQL wiki
https://wiki.postgresql.org/wiki/Scaleout_Design


Regards
Takayuki Tsunakawa

Re: shared-memory based stats collector

2021-03-24 Thread Kyotaro Horiguchi

Thank you for the the lot of help!


At Mon, 22 Mar 2021 16:17:37 -0700, Andres Freund  wrote in 
> Thanks! That change shouldn't be necessary on my branch - I did
> something to fix this kind of problem too. I decided that there's no
> point in doing hash table lookups for the database: It's not going to
> change in the life of a backend. So there's now two static "pending"

Right.

> entries: One for the current DB, one for the shared DB.  There's only
> one place that needed to change,
> pgstat_report_checksum_failures_in_db(), which now reports the changes
> directly instead of going via pending.
> I suspect we should actually do that with a number of other DB specific
> functions. Things like recovery conflicts, deadlocks, checksum failures
> imo should really not be delayed till later. And you should never have
> enough of them to make contention a concern.

Sounds readonable.

> You can see a somewhat sensible list of changes from your v52 at
> https://github.com/anarazel/postgres/compare/master...shmstat-before-split-2021-03-22
> (I did fix some of damage from rebase in a non-incremental way, of course)
> 
> My branch: https://github.com/anarazel/postgres/tree/shmstat
> 
> It would be cool if you could check if there any relevant things between
> v52 and v56 that I should include.
> 
> 
> I think a lot of the concerns I had with the patch are addressed at the
> end of my series of changes.  Please let me know what you think.

I like the name "stats subsystem".

https://github.com/anarazel/postgres/commit/f28463601e93c68f4dd50fe930d29a54509cffc7

I'm impressed that the way you resolved "who should load stats". Using
static shared memory area to hold the point to existing DSA memory
resolves the "first attacher problem".  However somewhat doubtful
about the "who should write the stats file", I think it is reasonable
in general.

But the current place of calling pgstat_write_stats() is a bit to
early.  Checkpointer reports some stats *after* calling
ShutdownXLOG().  Perhaps we need to move it after pg_stat_report_*()
calls in HandleCheckpointerInterrupts().


Separating pgbestat_backend_initialize() from pgstat_initialize()
allows us to initialize stats subsystem earlier in autovacuum workers,
which looks nice.


https://github.com/anarazel/postgres/commit/3304ee1344f348e079b5eb208d76a2f1553e721c

>* Whenever the for a dropped stats entry could not be freed (because
>* backends still have references), this is incremented, causing 
> backends
>* to run pgstat_lookup_cache_gc(), allowing that memory to be 
> reclaimed.

"Whenever the  for a "

gc_count is incremented whenever *some stats hash entries are
removed*. Some of the delinked shared stats area might not be freed
due to references.

If each backend finds that gc_count is incremented, it removes
corresponding local hash entries to the delinked shared entries. If
the backend was the last referer, it frees the shared area.


https://github.com/anarazel/postgres/commit/88ffb289860c7011e729cd0a1a01cda1899e6209

Ah, it sounds nice that refcount == 1 means it is to be dropped and no
one is referring to it. Thanks!

https://github.com/anarazel/postgres/commit/03824a236597c87c99d07aa14b9af9d6fe04dd37

+* XXX: Why is this a good place to do this?

Agreed. We don't need to so haste to clean up stats entries.  We could
run that in pgstat_reporT_stat()?


flush_walstat()

I found a mistake of an existing comment.

- * If nowait is true, this function returns false on lock failure. Otherwise
- * this function always returns true.
+ * If nowait is true, this function returns true on lock failure. Otherwise
+ * this function always returns false.


https://github.com/anarazel/postgres/commit/7bde068d8a512d918f76cfc88c1c10f1db8fe553
(pgstat_reset_replslot_counter())
+* AFIXME: pgstats has business no looking into slot.c structures at
+* this level of detail.

Does just moving name resolution part to pgstatfuncs.c resolve it?
pgstat_report_replslot_drop() have gotten fixed a similar way.


https://github.com/anarazel/postgres/commit/ded2198d93ce5944fc9d68031d86dd84944053f8

Yeah, I forcefully consolidated replslot stats are into stats-hash but
I agree that it would be more natural that replslot stats are
fixed-sized stats.

https://github.com/anarazel/postgres/commit/e2ef1931fb51da56a6ba483c960e034e52f90430

Agreed that it's better to move database stat entries to fixed pointers.

> My next step is going to be to squash all my changes into the base
> patch, and try to extract all the things that I think can be
> independently committed, and to reduce unnecessary diff noise.  Once
> that's done I plan to post that series to the list.
> 
> 
> TODO:
> 
> - explain the design at the top of pgstat.c
> 
> - figure out a way to deal with the different demands on stats
>   consistency / efficiency
> 
> - see how hard it'd be to not need collect_oids()
> 
> - split pgstat.c
> 
> - consider removing PgStatTypes and re

Re: Tying an object's ownership to datdba

2021-03-24 Thread Noah Misch

On Wed, Mar 24, 2021 at 11:57:37AM -0400, John Naylor wrote:
> On Mon, Dec 28, 2020 at 12:32 AM Noah Misch  wrote:
> > [v2]
> 
> Hi Noah,
> 
> In the refactoring patch, there is a lingering comment reference to 
> roles_has_privs_of(). Aside from that, it looks good to me. A possible thing 
> to consider is an assert that is_admin is not null where we expect that.

Thanks.  The next version will contain the comment fix and
"Assert(OidIsValid(admin_of) == PointerIsValid(is_admin));".

> The database owner role patch looks good as well.

Do you plan to put the CF entry into Ready for Committer, or should the
patches wait for another review?

> > I ended up blocking DDL that creates role memberships involving the new 
> > role;
> > see reasons in user.c comments.  Lifting those restrictions looked feasible,
> > but it was inessential to the mission, and avoiding unintended consequences
> > would have been tricky.  Views "information_schema.enabled_roles" and
> > "information_schema.applicable_roles" list any implicit membership in
> > pg_database_owner, but pg_catalog.pg_group and psql \dgS do not.  If this
> > leads any reviewer to look closely at applicable_roles, note that its 
> > behavior
> > doesn't match its documentation
> > (https://postgr.es/m/flat/20060728170615.gy20...@kenobi.snowman.net).
> 
> Is this something that needs fixing separately?

It is bug, but I think fixing it is not very urgent and should happen
separately, if at all.

Re: wal stats questions

2021-03-24 Thread Andres Freund

Hi,

On 2021-03-25 10:51:56 +0900, Masahiro Ikeda wrote:
> On 2021/03/25 8:22, Andres Freund wrote:
> > 1) What is the motivation to have both prevWalUsage and pgWalUsage,
> >instead of just accumulating the stats since the last submission?
> >There doesn't seem to be any comment explaining it? Computing
> >differences to previous values, and copying to prev*, isn't free. I
> >assume this is out of a fear that the stats could get reset before
> >they're used for instrument.c purposes - but who knows?
>
> Is your point that it's better to call pgWalUsage = 0; after sending the
> stats?

Yes. At least there should be a comment explaining why it's done the way
it is.



> > 3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
> >just to figure out if there's been any changes isn't all that
> >cheap. This is regularly exercised in read-only workloads too. Seems
> >adding a boolean WalStats.have_pending = true or such would be
> >better.
>
> I understood that for backends, this may leads performance degradation and
> this problem is not only for the WalStats but also SLRUStats.
>
> I wondered the degradation is so big because pgstat_report_stat() checks if at
> least PGSTAT_STAT_INTERVAL msec is passed since it last sent before check the
> diff. If my understanding is correct, to get timestamp is more expensive.

Getting a timestamp is expensive, yes. But I think we need to check for
the no-pending-wal-stats *before* the clock check. See the below:


> > 4) For plain backends pgstat_report_wal() is called by
> >pgstat_report_stat() - but it is not checked as part of the "Don't
> >expend a clock check if nothing to do" check at the top.  It's
> >probably rare to have pending wal stats without also passing one of
> >the other conditions, but ...
>
> (I'm not confidence my understanding of your comment is right.)
> You mean that we need to expend a clock check in pgstat_report_wal()?
> I think it's unnecessary because pgstat_report_stat() is already checked it.

No, I mean that right now we might can erroneously return early
pgstat_report_wal() for normal backends in some workloads:

void
pgstat_report_stat(bool disconnect)
...
/* Don't expend a clock check if nothing to do */
if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
!have_function_stats && !disconnect)
return;

might return if there only is pending WAL activity. This needs to check
whether there are pending WAL stats. Which in turn means that the check
should be fast.

Greetings,

Andres Freund

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Tomas Vondra

On 3/25/21 1:05 AM, Tomas Vondra wrote:
> ...
>
> 0002 is an attempt to fix an issue I noticed today - we need to handle
> type changes. Until now we did not have problems with that, because we
> only had attnums - so we just reset the statistics (with the exception
> of functional dependencies, on the assumption that those remain valid).
> 
> With expressions it's a bit more complicated, though.
> 
> 1) we need to transform the expressions so that the Vars contain the
> right type info etc. Otherwise an analyze with the old pg_node_tree crashes
> 
> 2) we need to reset the pg_statistic[] data too, which however makes
> keeping the functional dependencies a bit less useful, because those
> rely on the expression stats :-(
> 
> So I'm wondering what to do about this. I looked into how ALTER TABLE
> handles indexes, and 0003 is a PoC to do the same thing for statistics.
> Of couse, this is a bit unfortunate because it recreates the statistics
> (so we don't keep anything, not even functional dependencies).
> 
> I think we have two options:
> 
> a) Make UpdateStatisticsForTypeChange smarter to also transform and
> update the expression string, and reset pg_statistics[] data.
> 
> b) Just recreate the statistics, just like we do for indexes. Currently
> this does not force analyze, so it just resets all the stats. Maybe it
> should do analyze, though.
> 
> Any opinions? I need to think about this a bit more, but maybe (b) with
> the analyze is the right thing to do. Keeping just some of the stats
> always seemed a bit weird. (This is why the 0002 patch breaks one of the
> regression tests.)
> 

After thinking about this a bit more I think (b) is the right choice,
and the analyze is not necessary. The reason is fairly simple - we drop
the per-column statistics, because ATExecAlterColumnType does

RemoveStatistics(RelationGetRelid(rel), attnum);

so the user has to run analyze anyway, to get any reasonable estimates
(we keep the functional dependencies, but those still rely on per-column
statistics quite a bit). And we'll have to do the same thing with
per-expression stats too. It was a nice idea to keep at least the stats
that are not outright broken, but unfortunately it's not a very useful
optimization. It increases the instability of the system, because now we
have estimates with all statistics, no statistics, and something in
between after the partial reset. Not nice.

So my plan is to get rid of UpdateStatisticsForTypeChange, and just do
mostly what we do for indexes. It's not perfect (as demonstrated in last
message), but that'd apply even to option (a).

Any better ideas?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-24 Thread Amul Sul

On Wed, Mar 24, 2021 at 8:09 PM Tom Lane  wrote:
>
> Amul Sul  writes:
> > On Tue, Mar 23, 2021 at 8:59 PM Tom Lane  wrote:
> >> On closer inspection, I believe the true culprit is c6b92041d,
> >> which did this:
> >> -   heap_sync(state->rs_new_rel);
> >> +   smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
> >> heap_sync was careful about opening rd_smgr, the new code not so much.
>
> > So we also need to make sure of the
> > RelationOpenSmgr() call before smgrimmedsync() as proposed previously.
>
> I wonder if we should try to get rid of this sort of bug by banning
> direct references to rd_smgr?  That is, write the above and all
> similar code like
>
> smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
>
> where we provide something like
>
> static inline struct SMgrRelationData *
> RelationGetSmgr(Relation rel)
> {
> if (unlikely(rel->rd_smgr == NULL))
> RelationOpenSmgr(rel);
> return rel->rd_smgr;
> }
>
> and then we could get rid of most or all other RelationOpenSmgr calls.
>

+1

> This might create more code bloat than it's really worth, but
> it'd be a simple and mechanically-checkable scheme.

I think that will be fine, one-time pain. If you want I will do those changes.

A quick question: Can't it be a macro instead of an inline function
like other macros we have in rel.h?

Regards,
Amul

Re: multi-install PostgresNode

2021-03-24 Thread Michael Paquier

On Wed, Mar 24, 2021 at 03:33:51PM -0300, Alvaro Herrera wrote:
> On 2021-Mar-24, Andrew Dunstan wrote:
>> +# The install path set in get_new_node needs to be a directory containing
>> +# bin and lib subdirectories as in a standard PostgreSQL installation, so 
>> this
>> +# can't be used with installations where the bin and lib directories don't 
>> have
>> +# a common parent directory.
> 
> I've never heard of an installation where that wasn't true.  If there
> was a need for that, seems like it'd be possible to set them with
> { bindir => ..., libdir => ...} but I doubt it'll ever be necessary.

This would imply an installation with some fancy --bindir or --libdir
specified in ./configure.  Never say never, but I also think that what
has been committed is fine.  And the result is simple, that's really
cool.  So now pg_upgrade's test.sh can be switched to a TAP test.
--
Michael


signature.asc
Description: PGP signature

Re: pgsql: Move tablespace path re-creation from the makefiles to pg_regres

2021-03-24 Thread Noah Misch

On Thu, Mar 25, 2021 at 07:44:02AM +0900, Michael Paquier wrote:
> On Wed, Mar 24, 2021 at 10:50:50AM -0400, Robert Haas wrote:
> > On Wed, Mar 24, 2021 at 5:56 AM Christoph Berg  wrote:
> >> Maybe creating the tablespace directory from within the testsuite
> >> would suffice?
> >>
> >> CREATE TABLE foo (t text);
> >> COPY foo FROM PROGRAM 'mkdir @testtablespace@';
> >> CREATE TABLESPACE regress_tblspace LOCATION '@testtablespace@';
> > 
> > Would that work on Windows?

That would entail forbidding various shell metacharacters in @testtablespace@.
One could avoid imposing such a restriction by adding a mkdir() function to
regress.c and writing it like:

CREATE FUNCTION mkdir(text)
   RETURNS void AS '@libdir@/regress@DLSUFFIX@', 'regress_mkdir'
   LANGUAGE C STRICT\;
  REVOKE ALL ON FUNCTION mkdir FROM public;
SELECT mkdir('@testtablespace@');
CREATE TABLESPACE regress_tblspace LOCATION '@testtablespace@';

> I doubt that all the Windows environments would be able to get their
> hands on that.

> And I am not sure either that this would work when it
> comes to the CI case, again on Windows.

How might a CI provider break that?

Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-03-24 Thread Masahiro Ikeda

On 2021/03/23 16:10, Fujii Masao wrote:
> 
> 
> On 2021/03/22 20:25, ikedamsh wrote:
>> Agreed. Users can know whether the stats is for walreceiver or not. The
>> pg_stat_wal view in standby server shows for the walreceiver, and in primary
>> server it shows for the others. So, I updated the document.
>> (v20-0003-Makes-the-wal-receiver-report-WAL-statistics.patch)
> 
> Thanks for updating the docs!
> 
> There was the discussion about when the stats collector is invoked, at [1].
> Currently during archive recovery or standby, the stats collector is
> invoked when the startup process reaches the consistent state, sends
> PMSIGNAL_BEGIN_HOT_STANDBY, and then the system is starting accepting
> read-only connections. But walreceiver can be invoked at earlier stage.
> This can cause walreceiver to generate and send the statistics about WAL
> writing even though the stats collector has not been running yet. This might
> be problematic? If so, maybe we need to ensure that the stats collector is
> invoked before walreceiver?
> 
> During recovery, the stats collector is not invoked if hot standby mode is
> disabled. But walreceiver can be running in this case. So probably we should
> change walreceiver so that it's invoked even when hot standby is disabled?
> Otherwise we cannnot collect the statistics about WAL writing by walreceiver
> in that case.
> 
> [1]
> https://postgr.es/m/e5a982a5-8bb4-5a10-cf9a-40dd1921b...@oss.nttdata.com

Thanks for comments! I didn't notice that.
As I mentioned[1], if my understanding is right, this issue seem to be not for
only the wal receiver.

Since the shared memory thread already handles these issues, does this patch,
which to collect the stats for the wal receiver and make a common function for
writing wal files, have to be committed after the patches for share memory
stats are committed? Or to handle them in this thread because we don't know
when the shared memory stats patches will be committed.

I think the former is better because to collect stats in shared memory is very
useful feature for users and it make a big change in design. So, I think it's
beneficial to make an effort to move the shared memory stats thread forward
(by reviewing or testing) instead of handling the issues in this thread.

[1]
https://www.postgresql.org/message-id/9f4e19ad-518d-b91a-e500-25a666471c42%40oss.nttdata.com

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: [HACKERS] logical decoding of two-phase transactions

2021-03-24 Thread Peter Smith

On Wed, Mar 24, 2021 at 11:31 PM Amit Kapila  wrote:
>
> On Tue, Mar 23, 2021 at 3:31 PM Peter Smith  wrote:
> >
> > On Tue, Mar 23, 2021 at 10:44 AM Peter Smith  wrote:
> > >
> > > PSA patches to fix those.
> > >
> >
> > Hi Amit.
> >
> > PSA a patch to allow the ALTER SUBSCRIPTION ... REFRESH PUBLICATION to
> > work when two-phase tristate is PENDING.
> >
> > This is necessary for the pg_dump/pg_restore scenario, or for any
> > other use-case where the subscription might
> > start off having no tables.
> >
>
> + subrels = GetSubscriptionRelations(MySubscription->oid);
> +
> + /*
> + * If there are no tables then leave the state as PENDING, which
> + * allows ALTER SUBSCRIPTION ... REFRESH PUBLICATION to work.
> + */
> + become_two_phase_enabled = list_length(subrels) > 0;
>
> This code is similar at both the places it is used. Isn't it better to
> move this inside AllTablesyncsReady and if required then we can change
> the name of the function.

I agree. That way is better.

PSA a patch which changes the AllTableSyncsReady function to now
include the zero tables check.

(This patch is to be applied on top of all previous patches)

--
Kind Regards,
Peter Smith.
Fujitsu Australia.


v66-0005-Change-AllTablesyncsReady-to-return-false-when-0.patch
Description: Binary data

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-24 Thread Julien Rouhaud

On Wed, Mar 24, 2021 at 01:02:00PM -0300, Alvaro Herrera wrote:
> On 2021-Mar-24, Julien Rouhaud wrote:
> 
> > From e08c9d5fc86ba722844d97000798de868890aba3 Mon Sep 17 00:00:00 2001
> > From: Bruce Momjian 
> > Date: Mon, 22 Mar 2021 17:43:23 -0400
> > Subject: [PATCH v20 2/3] Expose queryid in pg_stat_activity and
> 
> >  src/backend/executor/execMain.c   |   9 ++
> >  src/backend/executor/execParallel.c   |  14 ++-
> >  src/backend/executor/nodeGather.c |   3 +-
> >  src/backend/executor/nodeGatherMerge.c|   4 +-
> 
> Hmm...
> 
> I find it odd that there's executor code that acquires the current query
> ID from pgstat, after having been put there by planner or ExecutorStart
> itself.  Seems like a modularity violation.  I wonder if it would make
> more sense to have the value maybe in struct EState (or perhaps there's
> a better place -- but I don't think they have a way to reach the
> QueryDesc anyhow), put there by ExecutorStart, so that places such as
> execParallel, nodeGather etc don't have to fetch it from pgstat but from
> EState.

The current queryid is already available in the Estate, as the underlying
PlannedStmt contains it.  The problem is that we want to display the top level
queryid, not the current query one, and the top level queryid is held in
pgstat.

Proposal for col LIKE $1 with generic Plan

2021-03-24 Thread Andy Fan

Thanks to the get_index_clause_from_support, we can use index for WHERE a
like
'abc%' case. However this currently only works on custom plan. Think about
the
case where the planning takes lots of time, custom plan will not give a good
result. so I want to see if we should support this for a generic plan as
well.

The first step of this is we need to find an operator to present prefix is
just
literal. which means '%' is just '%', not match any characters. After trying
'like' and '~' operator, I find none of them can be used. for example:

PREPARE s AS SELECT * FROM t WHERE a LIKE ('^' || $1);
EXECUTE s('%abc');

'%' is still a special character to match any characters.  So '~' is.  So I
think
we need to define an new operator like text(a) ~^ text(b), which means a
is prefixed with b literally. For example:

'abc' ~^ 'ab` -> true
'abc' ~^ 'ab%' -> false

so the above case can be written as:

PREPARE s AS SELECT * FROM t WHERE a ~^ $1;

The second step is we need to define greater string for $1 just like
make_greater_string.  Looks we have to know the exact value to make the
greater
string, so we can define a new FuncExpr like this:

Index Cond: ((md5 >= $1::text) AND (md5 < make_greater_string_fn($1)).

This may not be able to fix handle the current make_greater_string return
NULL
case. so we may define another FuncExpr like text_less_than_or_null

Index Cond: ((md5 >= $1::text) AND (text_less_than_or_null(md5,
make_greater_string_fn($1)))

Is this a right thing to do and a  right method?

Thanks
-- 
Best Regards
Andy Fan (https://www.aliyun.com/)

Re: making update/delete of inheritance trees scale better

2021-03-24 Thread Amit Langote

On Thu, Mar 25, 2021 at 4:22 AM Tom Lane  wrote:
> I wrote:
> > Yeah, it's on my to-do list for this CF, but I expect it's going to
> > take some concentrated study and other things keep intruding :-(.
>
> I've started to look at this seriously,

Thanks a lot.

> and I wanted to make a note
> about something that I think we should try to do, but there seems
> little hope of shoehorning it in for v14.
>
> That "something" is that ideally, the ModifyTable node should pass
> only the updated column values to the table AM or FDW, and let that
> lower-level code worry about reconstructing a full tuple by
> re-fetching the unmodified columns.  When I first proposed this
> concept, I'd imagined that maybe we could avoid the additional tuple
> read that this implementation requires by combining it with the
> tuple access that a heap UPDATE must do anyway to lock and outdate
> the target tuple.  Another example of a possible win is Andres'
> comment upthread

(Ah, I think you mean Heikki's.)

> that a columnar-storage AM would really rather
> deal only with the modified columns.  Also, the patch as it stands
> asks FDWs to supply all columns in a whole-row junk var, which is
> something that might become unnecessary.

That would indeed be nice.  I had considered taking on the project to
revise FDW local (non-direct) update APIs to deal with being passed
only the values of changed columns, but hit some problems when
implementing that in postgres_fdw that I don't quite remember the
details of.  As you say below, we can pick that up later.

> However, there are big stumbling blocks in the way.  ModifyTable
> is responsible for applying CHECK constraints, which may require
> looking at the values of not-modified columns.  An even bigger
> problem is that per-row triggers currently expect to be given
> the whole proposed row (and to be able to change columns not
> already marked for update).  We could imagine redefining the
> trigger APIs to reduce the overhead here, but that's certainly
> not going to happen in time for v14.

Yeah, at least the trigger concerns look like they will take work we
better not do in the 2 weeks left in the v14 cycle.

> So for now I think we have to stick with Amit's design of
> reconstructing the full updated tuple at the outset in
> ModifyTable, and then proceeding pretty much as updates
> have done in the past.  But there's more we can do later.

Agreed.

I'm addressing Robert's comments and will post an updated patch by tomorrow.

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: wal stats questions

2021-03-24 Thread Masahiro Ikeda

On 2021/03/25 8:22, Andres Freund wrote:
> Hi,
> 
> I got a few questions about the wal stats while working on the shmem
> stats patch:

Thanks for your reviews.

> 1) What is the motivation to have both prevWalUsage and pgWalUsage,
>instead of just accumulating the stats since the last submission?
>There doesn't seem to be any comment explaining it? Computing
>differences to previous values, and copying to prev*, isn't free. I
>assume this is out of a fear that the stats could get reset before
>they're used for instrument.c purposes - but who knows?

Is your point that it's better to call pgWalUsage = 0; after sending the
stats? My understanding is as same as your assumption. For example,
pg_stat_statements.c use pgWalUsage and calculate the diff.

But, because the stats may be sent after the transaction is finished, it
doesn't seem to lead wrong stats if pgWalUsage = 0 is called. So, I agree your
suggestion.

If the extension wants to know the walusage diff across two transactions,
it may lead to wrong stats, but I think it won't happen.

> 2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
>former being used by wal writer, the latter by most other processes?
>There again don't seem to be comments explaining this.

To control the transmission interval for the wal writer because it may send
the stats too frequency, and to omit to calculate the generated wal stats
because it doesn't generate wal records. But, now I think it's better to merge
them.

> 3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
>just to figure out if there's been any changes isn't all that
>cheap. This is regularly exercised in read-only workloads too. Seems
>adding a boolean WalStats.have_pending = true or such would be
>better.

I understood that for backends, this may leads performance degradation and
this problem is not only for the WalStats but also SLRUStats.

I wondered the degradation is so big because pgstat_report_stat() checks if at
least PGSTAT_STAT_INTERVAL msec is passed since it last sent before check the
diff. If my understanding is correct, to get timestamp is more expensive.

> 4) For plain backends pgstat_report_wal() is called by
>pgstat_report_stat() - but it is not checked as part of the "Don't
>expend a clock check if nothing to do" check at the top.  It's
>probably rare to have pending wal stats without also passing one of
>the other conditions, but ...

(I'm not confidence my understanding of your comment is right.)
You mean that we need to expend a clock check in pgstat_report_wal()?
I think it's unnecessary because pgstat_report_stat() is already checked it.

> Generally the various patches seems to to have a lot of the boilerplate
> style comments (like "Prepare and send the message"), but very little in
> the way of explaining the design.

Sorry for that. I'll be careful.

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: Nicer error when connecting to standby with hot_standby=off

2021-03-24 Thread Fujii Masao





On 2021/03/24 22:06, James Coleman wrote:

That looks good to me. Thanks for working on this.


Thanks! I pushed the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: MultiXact\SLRU buffers configuration

2021-03-24 Thread Thomas Munro

On Thu, Mar 25, 2021 at 10:31 AM Thomas Munro  wrote:
> We already know that increasing the number of CLOG buffers above the
> current number hurts as the linear search begins to dominate
> (according to the commit message for 5364b357), and it doesn't seem
> great to ship a new feature that melts your CPU when you turn it up.
> Perhaps, to ship this, we need to introduce a buffer mapping table?  I
> have attached a "one coffee" attempt at that, on top of your v10 patch
> (unmodified), for discussion.  It survives basic testing but I don't
> know how it performs.

Hrrr... Cfbot showed an assertion failure.  Here's the two coffee
version with a couple of silly mistakes fixed.
From 4817d16cfb6704d43a7bef12648e753d239c809c Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Mon, 15 Feb 2021 21:51:56 +0500
Subject: [PATCH v11 1/2] Make all SLRU buffer sizes configurable

---
 doc/src/sgml/config.sgml  | 108 ++
 src/backend/access/transam/clog.c |   6 +
 src/backend/access/transam/commit_ts.c|   5 +-
 src/backend/access/transam/multixact.c|   8 +-
 src/backend/access/transam/subtrans.c |   5 +-
 src/backend/commands/async.c  |   8 +-
 src/backend/storage/lmgr/predicate.c  |   4 +-
 src/backend/utils/init/globals.c  |   8 ++
 src/backend/utils/misc/guc.c  |  77 +
 src/backend/utils/misc/postgresql.conf.sample |  16 +++
 src/include/access/multixact.h|   4 -
 src/include/access/subtrans.h |   3 -
 src/include/commands/async.h  |   5 -
 src/include/miscadmin.h   |   8 ++
 src/include/storage/predicate.h   |   4 -
 15 files changed, 240 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ddc6d789d8..0adcf0efaf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1886,6 +1886,114 @@ include_dir 'conf.d'

   
  
+ 
+
+  multixact_offsets_slru_buffers (integer)
+  
+   multixact_offsets_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for MultiXact offsets. MultiXact offsets
+are used to store information about offsets of multiple row lockers (caused by SELECT FOR UPDATE and others).
+It defaults to 64 kilobytes (64KB).
+This parameter can only be set at server start.
+   
+  
+ 
+
+
+  multixact_members_slru_buffers (integer)
+  
+   multixact_members_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for MultiXact members. MultiXact members
+are used to store information about XIDs of multiple row lockers. Typically multixact_members_slru_buffers
+is twice more than multixact_offsets_slru_buffers.
+It defaults to 128 kilobytes (128KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  subtrans_slru_buffers (integer)
+  
+   subtrans_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for subtransactions.
+It defaults to 256 kilobytes (256KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  notify_slru_buffers (integer)
+  
+   notify_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for asyncronous notifications (NOTIFY, LISTEN).
+It defaults to 64 kilobytes (64KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  serial_slru_buffers (integer)
+  
+   serial_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for predicate locks.
+It defaults to 128 kilobytes (128KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  clog_slru_buffers (integer)
+  
+   clog_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for CLOG.
+It defaults to 0, in this case CLOG size is taken as shared_buffers / 512.
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  commit_ts_slru_buffers (integer)
+  
+   commit_ts_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for commit timestamps.
+It defaults to 0, in this case CLOG size is taken as shared_buffers / 512.
+This parameter can only be set at server start.
+   
+  
+ 
 
  
   max_stack_depth (integer)
diff --git a/src/backend/access/transam/clog.c b/src/

Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-24 Thread Andres Freund

Hi,

On 2021-03-24 18:36:14 +0900, Fujii Masao wrote:
> Barring any objection, I'm thinking to commit this patch.

To which branches?  I am *strongly* opposed to backpatching this.


>  /*
> - * Simple signal handler for exiting quickly as if due to a crash.
> + * Simple signal handler for processes which have not yet touched or do not
> + * touch shared memory to exit quickly.
>   *
> - * Normally, this would be used for handling SIGQUIT.
> + * Note that if processes already touched shared memory, use
> + * SignalHandlerForCrashExit() instead and force the postmaster into
> + * a system reset cycle because shared memory may be corrupted.
> + */
> +void
> +SignalHandlerForNonCrashExit(SIGNAL_ARGS)
> +{
> + /*
> +  * Since we don't touch shared memory, we can just pull the plug and 
> exit
> +  * without running any atexit handlers.
> +  */
> + _exit(1);
> +}

This strikes me as a quite a misleading function name. Outside of very
narrow circumstances a normal exit shouldn't use _exit(1). Neither 1 no
_exit() (as opposed to exit()) seem appropriate. This seems like a bad
idea.

Also, won't this lead to postmaster now starting to complain about
pgstat having crashed in an immediate shutdown? Right now only exit
status 0 is expected:

if (pid == PgStatPID)
{
PgStatPID = 0;
if (!EXIT_STATUS_0(exitstatus))
LogChildExit(LOG, _("statistics collector 
process"),
 pid, exitstatus);
if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
PgStatPID = pgstat_start();
continue;
}



> + * The statistics collector is started by the postmaster as soon as the
> + * startup subprocess finishes, or as soon as the postmaster is ready
> + * to accept read only connections during archive recovery.  It remains
> + * alive until the postmaster commands it to terminate.  Normal
> + * termination is by SIGUSR2 after the checkpointer must exit(0),
> + * which instructs the statistics collector to save the final statistics
> + * to reuse at next startup and then exit(0).

I can't parse "...after the checkpointer must exit(0), which instructs..."


> + * Emergency termination is by SIGQUIT; like any backend, the statistics
> + * collector will exit quickly without saving the final statistics.  It's
> + * ok because the startup process will remove all statistics at next
> + * startup after emergency termination.

Normally there won't be any stats to remove, no? The permanent stats
file has been removed when the stats collector started up.


Greetings,

Andres Freund

Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-24 Thread Masahiro Ikeda

On 2021/03/24 18:36, Fujii Masao wrote:
> On 2021/03/23 15:50, Fujii Masao wrote:
>> + * The statistics collector is started by the postmaster as soon as the
>> + * startup subprocess finishes.
>>
>> This comment needs to be updated? Because the stats collector can
>> be invoked when the startup process sends PMSIGNAL_BEGIN_HOT_STANDBY
>> signal.
> 
> I updated this comment in the patch.
> Attached is the updated version of the patch.
> 
> Barring any objection, I'm thinking to commit this patch.

Thanks for your comments and updating the patch!
I checked your patch and I don't have any comments.

Regards,

-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: make the stats collector shutdown without writing the statsfiles if the immediate shutdown is requested.

2021-03-24 Thread Masahiro Ikeda

On 2021/03/24 18:36, Fujii Masao wrote:
> 
> 
> On 2021/03/24 3:51, Andres Freund wrote:
>> Hi,
>>
>> On 2021-03-23 15:50:46 +0900, Fujii Masao wrote:
>>> This fact makes me wonder that if we collect the statistics about WAL 
>>> writing
>>> from walreceiver as we discussed in other thread, the stats collector should
>>> be invoked at more earlier stage. IIUC walreceiver can be invoked before
>>> PMSIGNAL_BEGIN_HOT_STANDBY is sent.
>>
>> FWIW, in the shared memory stats patch the stats subsystem is
>> initialized early on by the startup process.
> 
> This is good news!

Fujii-san, Andres-san,
Thanks for your comments!

I didn't think about the start order. From the point of view, I noticed that
the current source code has two other concerns.

1. This problem is not only for the wal receiver.

The problem which the wal receiver starts before the stats collector
is launched during archive recovery is not only for the the wal receiver but
also the checkpointer and the bgwriter. Before starting redo, the startup
process sends the postmaster "PMSIGNAL_RECOVERY_STARTED" signal to launch the
checkpointer and the bgwriter to be able to perform creating restartpoint.

Although the socket for communication between the stats collector and the
other processes is made in earlier stage via pgstat_init(), I agree to make
the stats collector starts earlier stage is defensive. BTW, in my
environments(linux, net.core.rmem_default = 212992), the socket can buffer
almost 300 WAL stats messages. This mean that, as you said, if the redo phase
is too long, it can lost the messages easily.

2. To make the stats clear in redo phase.

The statistics can be reset after the wal receiver, the checkpointer and
the wal writer are started in redo phase. So, it's not enough the stats
collector is invoked at more earlier stage. We need to fix it.

(I hope I am not missing something.)
Thanks to Andres-san's work([1]), the above problems will be handle in the
shared memory stats patch. First problem will be resolved since the stats are
collected in shared memory, so the stats collector process is unnecessary
itself. Second problem will be resolved to remove the reset code because the
temporary stats file won't generated, and if the permanent stats file
corrupted, just recreate it.

[1]
https://github.com/anarazel/postgres/compare/master...shmstat-before-split-2021-03-22

Regards,
-- 
Masahiro Ikeda
NTT DATA CORPORATION

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Justin Pryzby

On Thu, Mar 25, 2021 at 01:05:37AM +0100, Tomas Vondra wrote:
> here's an updated patch. 0001 should address most of the today's review
> items regarding comments etc.

This is still an issue:

postgres=# CREATE STATISTICS xt ON a FROM t JOIN t ON true;
ERROR:  schema "i" does not exist

-- 
Justin

Re: popcount

2021-03-24 Thread David Fetter

On Tue, Mar 23, 2021 at 10:51:08AM +0100, Peter Eisentraut wrote:
> On 21.03.21 02:31, David Fetter wrote:
> > > I have now read the entire internet on what a suitable name for this
> > > function could be.  I think the emerging winner is BIT_COUNT(), which
> > > already exists in MySQL, and also in Python (int.bit_count()) and Java
> > > (Integer.bitCount()).
> > 
> > Thanks for doing this tedious work. Please find attached the next
> > version of the patch.
> 
> committing, with some polishing

Thanks!

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Support for NSS as a libpq TLS backend

2021-03-24 Thread Jacob Champion

On Wed, 2021-03-24 at 14:10 -0400, Stephen Frost wrote:
> * Jacob Champion (pchamp...@vmware.com) wrote:
> > I could see this being a problem if two client certificate nicknames
> > collide across multiple in-use databases, maybe?
> 
> Right, in such a case either cert might get returned and it's possible
> that the "wrong" one is returned and therefore the connection would end
> up failing, assuming that they aren't actually the same and just happen
> to be in both.
> 
> Seems like we could use SECMOD_OpenUserDB() and then pass the result
> from that into PK11_ListCertsInSlot() and scan through the certs in just
> the specified database to find the one we're looking for if we really
> feel compelled to try and address this risk.  I've reached out to the
> NSS folks to see if they have any thoughts about the best way to address
> this.

Some additional findings (NSS 3.63), please correct me if I've made any 
mistakes:

The very first NSSInitContext created is special. If it contains a database, 
that database will be considered part of the "internal" slot and its 
certificates can be referenced directly by nickname. If it doesn't have a 
database, the internal slot has no certificates, and it will continue to have 
zero certificates until NSS is completely shut down and reinitialized with a 
new "first" context.

Databases that are opened *after* the first one are given their own separate 
slots. Any certificates that are part of those databases seemingly can't be 
referenced directly by nickname. They have to be prefixed by their token name 
-- a name which you don't have if you used NSS_InitContext() to create the 
database. You have to use SECMOD_OpenUserDB() instead. This explains some 
strange failures I was seeing in local testing, where the order of InitContext 
determined whether our client certificate selection succeeded or failed.

If you SECMOD_OpenUserDB() a database that is identical to the first (internal) 
database, NSS deduplicates for you and just returns the internal slot. Which 
seems like it's helpful, except you're not allowed to close that database, and 
you have to know not to close it by checking to see whether that slot is the 
"internal key slot". It appears to remain open until NSS is shut down entirely.
But if you open a database that is *not* the magic internal database,
and then open a duplicate of that one, NSS creates yet another new slot
for the duplicate. So SECMOD_OpenUserDB() may or may not be a resource
hog, depending on the global state of the process at the time libpq
opens its first connection. We won't be able to control what the parent
application will do before loading us up.

It also doesn't look like any of the SECMOD_* machinery that we're
looking at is thread-safe, but I'd really like to be wrong...

--Jacob

Re: Refactor SSL test framework to support multiple TLS libraries

2021-03-24 Thread Daniel Gustafsson

> On 25 Mar 2021, at 00:26, Alvaro Herrera  wrote:
> 
> On 2021-Mar-25, Daniel Gustafsson wrote:
> 
>> Attached is a v2 which addresses the comments raised on the main NSS thread, 
>> as
>> well as introduces named parameters for the server cert function to make the
>> test code easier to read.
> 
> I don't like this patch.  I think your SSL::Server::OpenSSL and
> SSL::Server::NSS packages should be doing "use parent SSL::Server";
> having SSL::Server grow a line to "use" its subclass
> SSL::Server::OpenSSL and import its get_new_openssl_backend() method
> seems to go against the grain.

I'm far from skilled at Perl module inheritance but that makes sense, I'll take
a stab at that after some sleep and coffee.

--
Daniel Gustafsson   https://vmware.com/

Re: Refactor SSL test framework to support multiple TLS libraries

2021-03-24 Thread Alvaro Herrera

On 2021-Mar-25, Daniel Gustafsson wrote:

> Attached is a v2 which addresses the comments raised on the main NSS thread, 
> as
> well as introduces named parameters for the server cert function to make the
> test code easier to read.

I don't like this patch.  I think your SSL::Server::OpenSSL and
SSL::Server::NSS packages should be doing "use parent SSL::Server";
having SSL::Server grow a line to "use" its subclass
SSL::Server::OpenSSL and import its get_new_openssl_backend() method
seems to go against the grain.

-- 
Álvaro Herrera39°49'30"S 73°17'W

wal stats questions

2021-03-24 Thread Andres Freund

Hi,

I got a few questions about the wal stats while working on the shmem
stats patch:

1) What is the motivation to have both prevWalUsage and pgWalUsage,
   instead of just accumulating the stats since the last submission?
   There doesn't seem to be any comment explaining it? Computing
   differences to previous values, and copying to prev*, isn't free. I
   assume this is out of a fear that the stats could get reset before
   they're used for instrument.c purposes - but who knows?

2) Why is there both pgstat_send_wal() and pgstat_report_wal()? With the
   former being used by wal writer, the latter by most other processes?
   There again don't seem to be comments explaining this.

3) Doing if (memcmp(&WalStats, &all_zeroes, sizeof(PgStat_MsgWal)) == 0)
   just to figure out if there's been any changes isn't all that
   cheap. This is regularly exercised in read-only workloads too. Seems
   adding a boolean WalStats.have_pending = true or such would be
   better.

4) For plain backends pgstat_report_wal() is called by
   pgstat_report_stat() - but it is not checked as part of the "Don't
   expend a clock check if nothing to do" check at the top.  It's
   probably rare to have pending wal stats without also passing one of
   the other conditions, but ...

Generally the various patches seems to to have a lot of the boilerplate
style comments (like "Prepare and send the message"), but very little in
the way of explaining the design.

Greetings,

Andres Freund

Re: Refactor SSL test framework to support multiple TLS libraries

2021-03-24 Thread Daniel Gustafsson

Attached is a v2 which addresses the comments raised on the main NSS thread, as
well as introduces named parameters for the server cert function to make the
test code easier to read.

--
Daniel Gustafsson   https://vmware.com/



v2-0001-Refactor-SSL-testharness-for-multiple-library.patch
Description: Binary data

Re: Minimal logical decoding on standbys

2021-03-24 Thread Fabrízio de Royes Mello

On Wed, Mar 24, 2021 at 3:57 AM Drouvot, Bertrand 
wrote:
>
> Thanks for pointing out, fixed in v14 attached.
>

Thanks... now everything is working as expected... changed the status to
Ready for Commiter:
https://commitfest.postgresql.org/32/2968/

Regards,

-- 
   Fabrízio de Royes Mello
   PostgreSQL Developer at OnGres Inc. - https://ongres.com

Re: Support for NSS as a libpq TLS backend

2021-03-24 Thread Daniel Gustafsson

> On 24 Mar 2021, at 04:54, Michael Paquier  wrote:
> 
> On Tue, Mar 23, 2021 at 12:38:50AM +0100, Daniel Gustafsson wrote:
>> Thanks again for reviewing, another version which addresses the remaining
>> issues will be posted soon but I wanted to get this out to give further 
>> reviews
>> something that properly works.
> 
> I have been looking at the infrastructure of the tests, patches 0002
> (some refactoring) and 0003 (more refactoring with tests for NSS), and
> I am a bit confused by its state.
> 
> First, I think that the split is not completely clear.  For example,
> patch 0003 has changes for OpenSSL.pm and Server.pm, but wouldn't it
> be better to have all the refactoring infrastructure only in 0002,
> with 0003 introducing only the NSS pieces for its internal data and
> NSS.pm?

Yes.  Juggling a patchset of this size is errorprone.  This is why I opened the
separate thread for this where the patch can be held apart cleaner, so let's
take this discussion over there.  I will post an updated patch there shortly.

> +   keyfile => 'server-password',
> +   nssdatabase => 'server-cn-only.crt__server-password.key.db',
> +   passphrase_cmd => 'echo secret1',
> 001_ssltests.pl and 002_scram.pl have NSS-related parameters, which
> does not look like a clean separation to me as there are OpenSSL tests
> that use some NSS parts, and the main scripts should remain neutral in
> terms setting contents, including only variables and callbacks that
> should be filled specifically for each SSL implementation, no?  Aren't
> we missing a second piece here with a set of callbacks for the
> per-library test paths then?

Well, then again, keyfile is an OpenSSL specific parameter, it just happens to
be named quite neutrally.  I'm not sure how to best express the certificate and
key requirements of a test since the testcase is the source of truth in terms
of what it requires.  If we introduce a standard set of cert/keys which all
backends are required to supply, we could refer to those.  Tests that need
something more specific can then go into 00X_.pl.  There is a balance
to strike though, there is a single backend now with at most one on the horizon
which is yet to be decided upon, making it too generic may end up making test
writing overcomplicated. Do you have any concretee ideas?

> +   if (defined($openssl))
> +   {
> +   copy_files("ssl/server-*.crt", $pgdata);
> +   copy_files("ssl/server-*.key", $pgdata);
> +   chmod(0600, glob "$pgdata/server-*.key") or die $!;
> +   copy_files("ssl/root+client_ca.crt", $pgdata);
> +   copy_files("ssl/root_ca.crt",$pgdata);
> +   copy_files("ssl/root+client.crl",$pgdata);
> +   mkdir("$pgdata/root+client-crldir");
> +   copy_files("ssl/root+client-crldir/*",
> "$pgdata/root+client-crldir/");
> +   }
> +   elsif (defined($nss))
> +   {
> +   RecursiveCopy::copypath("ssl/nss", $pgdata . "/nss") if -e
> "ssl/nss";
> +   }
> This had better be in its own callback, for example.

Yes, this one is a clearer case, fixed in the v2 patch which will be posted on
the separate thread.

--
Daniel Gustafsson   https://vmware.com/

Re: pgsql: Move tablespace path re-creation from the makefiles to pg_regres

2021-03-24 Thread Michael Paquier

On Wed, Mar 24, 2021 at 10:50:50AM -0400, Robert Haas wrote:
> On Wed, Mar 24, 2021 at 5:56 AM Christoph Berg  wrote:
>> Maybe creating the tablespace directory from within the testsuite
>> would suffice?
>>
>> CREATE TABLE foo (t text);
>> COPY foo FROM PROGRAM 'mkdir @testtablespace@';
>> CREATE TABLESPACE regress_tblspace LOCATION '@testtablespace@';
> 
> Would that work on Windows?

I doubt that all the Windows environments would be able to get their
hands on that.  And I am not sure either that this would work when it
comes to the CI case, again on Windows.
--
Michael


signature.asc
Description: PGP signature

Re: MultiXact\SLRU buffers configuration

2021-03-24 Thread Thomas Munro

Hi Andrey,

On Sat, Mar 13, 2021 at 1:44 AM Andrey Borodin  wrote:
> [v10]

+intmultixact_offsets_slru_buffers = 8;
+intmultixact_members_slru_buffers = 16;
+intsubtrans_slru_buffers = 32;
+intnotify_slru_buffers = 8;
+intserial_slru_buffers = 16;
+intclog_slru_buffers = 0;
+intcommit_ts_slru_buffers = 0;

I don't think we should put "slru" (the name of the buffer replacement
algorithm, implementation detail) in the GUC names.

+It defaults to 0, in this case CLOG size is taken as
shared_buffers / 512.

We already know that increasing the number of CLOG buffers above the
current number hurts as the linear search begins to dominate
(according to the commit message for 5364b357), and it doesn't seem
great to ship a new feature that melts your CPU when you turn it up.
Perhaps, to ship this, we need to introduce a buffer mapping table?  I
have attached a "one coffee" attempt at that, on top of your v10 patch
(unmodified), for discussion.  It survives basic testing but I don't
know how it performs.
From 4817d16cfb6704d43a7bef12648e753d239c809c Mon Sep 17 00:00:00 2001
From: Andrey Borodin 
Date: Mon, 15 Feb 2021 21:51:56 +0500
Subject: [PATCH v10 1/2] Make all SLRU buffer sizes configurable

---
 doc/src/sgml/config.sgml  | 108 ++
 src/backend/access/transam/clog.c |   6 +
 src/backend/access/transam/commit_ts.c|   5 +-
 src/backend/access/transam/multixact.c|   8 +-
 src/backend/access/transam/subtrans.c |   5 +-
 src/backend/commands/async.c  |   8 +-
 src/backend/storage/lmgr/predicate.c  |   4 +-
 src/backend/utils/init/globals.c  |   8 ++
 src/backend/utils/misc/guc.c  |  77 +
 src/backend/utils/misc/postgresql.conf.sample |  16 +++
 src/include/access/multixact.h|   4 -
 src/include/access/subtrans.h |   3 -
 src/include/commands/async.h  |   5 -
 src/include/miscadmin.h   |   8 ++
 src/include/storage/predicate.h   |   4 -
 15 files changed, 240 insertions(+), 29 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ddc6d789d8..0adcf0efaf 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1886,6 +1886,114 @@ include_dir 'conf.d'

   
  
+ 
+
+  multixact_offsets_slru_buffers (integer)
+  
+   multixact_offsets_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for MultiXact offsets. MultiXact offsets
+are used to store information about offsets of multiple row lockers (caused by SELECT FOR UPDATE and others).
+It defaults to 64 kilobytes (64KB).
+This parameter can only be set at server start.
+   
+  
+ 
+
+
+  multixact_members_slru_buffers (integer)
+  
+   multixact_members_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for MultiXact members. MultiXact members
+are used to store information about XIDs of multiple row lockers. Typically multixact_members_slru_buffers
+is twice more than multixact_offsets_slru_buffers.
+It defaults to 128 kilobytes (128KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  subtrans_slru_buffers (integer)
+  
+   subtrans_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for subtransactions.
+It defaults to 256 kilobytes (256KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  notify_slru_buffers (integer)
+  
+   notify_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for asyncronous notifications (NOTIFY, LISTEN).
+It defaults to 64 kilobytes (64KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  serial_slru_buffers (integer)
+  
+   serial_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for predicate locks.
+It defaults to 128 kilobytes (128KB).
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  clog_slru_buffers (integer)
+  
+   clog_slru_buffers configuration parameter
+  
+  
+  
+   
+Specifies the amount of memory to be used for CLOG.
+It defaults to 0, in this case CLOG size is taken as shared_buffers / 512.
+This parameter can only be set at server start.
+   
+  
+ 
+ 
+
+  commit_ts_slru_buffers (integer)
+  
+

Re: pg_amcheck contrib application

2021-03-24 Thread Robert Haas

Mark,

Here's a quick and very dirty sketch of what I think perhaps this
logic could look like. This is pretty much untested and it might be
buggy, but at least you can see whether we're thinking at all in the
same direction.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


very-rough-visibility-ideas.patch
Description: Binary data

Re: truncating timestamps on arbitrary intervals

2021-03-24 Thread Peter Eisentraut

On 24.03.21 18:58, John Naylor wrote:

 > As a potential follow-up, should we perhaps add named arguments?  That
 > might make the invocations easier to read, depending on taste.

I think it's quite possible some users will prefer that. All we need is 
to add something like

proargnames => '{bin_width,input,origin}'

to the catalog, right?

right, plus some documentation adjustments perhaps

Also, I noticed that I put in double semicolons in the new functions 
somehow. I'll fix that as well.

I have fixed that.

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Alvaro Herrera

On 2021-Mar-24, Justin Pryzby wrote:

> On Wed, Mar 24, 2021 at 05:15:46PM +, Dean Rasheed wrote:

> > Hmm, I think "univariate" and "multivariate" are pretty ubiquitous,
> > when used to describe statistics. You could use "single-column" and
> > "multi-column", but then "column" isn't really right anymore, since it
> > might be a column or an expression. I can't think of any other terms
> > that fit.

Agreed.  If we need to define the term, we can spend a sentence or two
in that.

> We already use "multivariate", just not in create-statistics.sgml

> So I think the answer is for create-statistics to expose that word in a
> user-facing way in its reference to multivariate-statistics-examples.

+1

-- 
Álvaro Herrera   Valdivia, Chile

Re: truncating timestamps on arbitrary intervals

2021-03-24 Thread Peter Eisentraut


On 24.03.21 18:25, Erik Rijkers wrote:

On 2021.03.24. 16:38 Peter Eisentraut  wrote:




Committed.



'In cases full units' seems strange.


fixed, thanks

Re: making update/delete of inheritance trees scale better

2021-03-24 Thread Tom Lane

I wrote:
> Yeah, it's on my to-do list for this CF, but I expect it's going to
> take some concentrated study and other things keep intruding :-(.

I've started to look at this seriously, and I wanted to make a note
about something that I think we should try to do, but there seems
little hope of shoehorning it in for v14.

That "something" is that ideally, the ModifyTable node should pass
only the updated column values to the table AM or FDW, and let that
lower-level code worry about reconstructing a full tuple by
re-fetching the unmodified columns.  When I first proposed this
concept, I'd imagined that maybe we could avoid the additional tuple
read that this implementation requires by combining it with the
tuple access that a heap UPDATE must do anyway to lock and outdate
the target tuple.  Another example of a possible win is Andres'
comment upthread that a columnar-storage AM would really rather
deal only with the modified columns.  Also, the patch as it stands
asks FDWs to supply all columns in a whole-row junk var, which is
something that might become unnecessary.

However, there are big stumbling blocks in the way.  ModifyTable
is responsible for applying CHECK constraints, which may require
looking at the values of not-modified columns.  An even bigger
problem is that per-row triggers currently expect to be given
the whole proposed row (and to be able to change columns not
already marked for update).  We could imagine redefining the
trigger APIs to reduce the overhead here, but that's certainly
not going to happen in time for v14.

So for now I think we have to stick with Amit's design of
reconstructing the full updated tuple at the outset in
ModifyTable, and then proceeding pretty much as updates
have done in the past.  But there's more we can do later.

regards, tom lane

Re: TRUNCATE on foreign table

2021-03-24 Thread Fujii Masao





On 2021/03/13 18:57, Kazutaka Onishi wrote:

I have fixed the patch to pass check-world test. :D


Thanks for updating the patch! Here are some review comments from me.


  By default all foreign tables using postgres_fdw are 
assumed
  to be updatable.  This may be overridden using the following option:

In postgres-fdw.sgml, "and truncatable" should be appended into
the above first description? Also "option" in the second description
should be a plural form "options"?


 TRUNCATE is not currently supported for foreign tables.
 This implies that if a specified table has any descendant tables that are
 foreign, the command will fail.

truncate.sgml should be updated because, for example, it contains
the above descriptions.


+ frels_extra is same length with
+ frels_list, that delivers extra information of
+ the context where the foreign-tables are truncated.
+

Don't we need to document the detail information about frels_extra?
Otherwise the developers of FDW would fail to understand how to
handle the frels_extra when trying to make their FDWs support TRUNCATE.


+   relids_extra = lappend_int(relids_extra, (recurse ? 0 : 1));
+   relids_extra = lappend_int(relids_extra, -1);

postgres_fdw determines whether to specify ONLY or not by checking
whether the passed extra value is zero or not. That is, for example,
using only 0 and 1 for extra values is enough for the purpose. But
ExecuteTruncate() sets three values 0, -1 and 1 as extra ones. Why are
these three values necessary?


With the patch, if both local and foreign tables are specified as
the target tables to truncate, TRUNCATE command tries to truncate
foreign tables after truncating local ones. That is, if "truncatable"
option is set to false or enough permission to truncate is not granted
yet in the foreign server, an error will be thrown after the local tables
are truncated. I don't think this is good order of processings. IMO,
instead, we should check whether foreign tables can be truncated
before any actual truncation operations. For example, we can easily
do that by truncate foreign tables before local ones. Thought?


XLOG_HEAP_TRUNCATE record is written even for the truncation of
a foreign table. Why is this necessary?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pl/pgsql feature request: shorthand for argument and local variable references

2021-03-24 Thread Pavel Stehule

st 24. 3. 2021 v 8:42 odesílatel Erik Rijkers  napsal:

> > On 2021.03.24. 08:09 Pavel Stehule  wrote:
> > [...]
> > [plpgsql-routine-label-20210324.patch]
>
>
> Hi,
>
> In that sgml
>
> "the functions' arguments"
>
> should be
>
> "the function's arguments"
>

fixed,

thank you for check



>
> Erik
>
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index 52f60c827c..3643521091 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -292,7 +292,9 @@ $$ LANGUAGE plpgsql;
   special variables such as FOUND (see
   ).  The outer block is
   labeled with the function's name, meaning that parameters and special
-  variables can be qualified with the function's name.
+  variables can be qualified with the function's name. The name of this outer
+  block can be changed by inserting special command 
+  #routine_label new_name at the start of the function.
  
 
 
@@ -435,6 +437,31 @@ $$ LANGUAGE plpgsql;
  
 
 
+
+ The function's argument can be qualified with the function name:
+
+CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$
+BEGIN
+RETURN sales_tax.subtotal * 0.06;
+END;
+$$ LANGUAGE plpgsql;
+
+
+
+
+ Sometimes the function name is too long and it can be more practical to use
+ some shorter label. The top namespace label can be changed (along with
+ the function's arguments) using the option routine_label:
+
+CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$
+#routine_label s
+BEGIN
+RETURN s.subtotal * 0.06;
+END;
+$$ LANGUAGE plpgsql;
+
+
+
  
   Some more examples:
 
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index ce8d97447d..d32e050c32 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -378,6 +378,10 @@ do_compile(FunctionCallInfo fcinfo,
 	 */
 	plpgsql_ns_init();
 	plpgsql_ns_push(NameStr(procStruct->proname), PLPGSQL_LABEL_BLOCK);
+
+	/* save top ns for possibility to alter top label */
+	function->root_ns = plpgsql_ns_top();
+
 	plpgsql_DumpExecTree = false;
 	plpgsql_start_datums();
 
diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c
index 919b968826..7132da35d1 100644
--- a/src/pl/plpgsql/src/pl_funcs.c
+++ b/src/pl/plpgsql/src/pl_funcs.c
@@ -101,6 +101,7 @@ plpgsql_ns_additem(PLpgSQL_nsitem_type itemtype, int itemno, const char *name)
 	nse->itemtype = itemtype;
 	nse->itemno = itemno;
 	nse->prev = ns_top;
+	nse->alias = NULL;
 	strcpy(nse->name, name);
 	ns_top = nse;
 }
@@ -141,7 +142,7 @@ plpgsql_ns_lookup(PLpgSQL_nsitem *ns_cur, bool localmode,
 			 nsitem->itemtype != PLPGSQL_NSTYPE_LABEL;
 			 nsitem = nsitem->prev)
 		{
-			if (strcmp(nsitem->name, name1) == 0)
+			if (strcmp(nsitem->alias ? nsitem->alias : nsitem->name, name1) == 0)
 			{
 if (name2 == NULL ||
 	nsitem->itemtype != PLPGSQL_NSTYPE_VAR)
@@ -155,13 +156,13 @@ plpgsql_ns_lookup(PLpgSQL_nsitem *ns_cur, bool localmode,
 
 		/* Check this level for qualified match to variable name */
 		if (name2 != NULL &&
-			strcmp(nsitem->name, name1) == 0)
+			strcmp(nsitem->alias ? nsitem->alias : nsitem->name, name1) == 0)
 		{
 			for (nsitem = ns_cur;
  nsitem->itemtype != PLPGSQL_NSTYPE_LABEL;
  nsitem = nsitem->prev)
 			{
-if (strcmp(nsitem->name, name2) == 0)
+if (strcmp(nsitem->alias ? nsitem->alias : nsitem->name, name2) == 0)
 {
 	if (name3 == NULL ||
 		nsitem->itemtype != PLPGSQL_NSTYPE_VAR)
@@ -197,7 +198,7 @@ plpgsql_ns_lookup_label(PLpgSQL_nsitem *ns_cur, const char *name)
 	while (ns_cur != NULL)
 	{
 		if (ns_cur->itemtype == PLPGSQL_NSTYPE_LABEL &&
-			strcmp(ns_cur->name, name) == 0)
+			strcmp(ns_cur->alias ? ns_cur->alias : ns_cur->name, name) == 0)
 			return ns_cur;
 		ns_cur = ns_cur->prev;
 	}
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 34e0520719..f3536d75ae 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -333,6 +333,7 @@ static	void			check_raise_parameters(PLpgSQL_stmt_raise *stmt);
 %token 	K_RETURNED_SQLSTATE
 %token 	K_REVERSE
 %token 	K_ROLLBACK
+%token 	K_ROUTINE_LABEL
 %token 	K_ROW_COUNT
 %token 	K_ROWTYPE
 %token 	K_SCHEMA
@@ -393,6 +394,24 @@ comp_option		: '#' K_OPTION K_DUMP
 	{
 		plpgsql_curr_compile->resolve_option = PLPGSQL_RESOLVE_COLUMN;
 	}
+| '#' K_ROUTINE_LABEL any_identifier
+	{
+		if (!plpgsql_curr_compile->root_ns)
+			ereport(ERROR,
+	(errcode(ERRCODE_SYNTAX_ERROR),
+	 errmsg("The command \"routine_label\" cannot be used in inline block"),
+	 parser_errposition(@1)));
+
+		/* Don't allow more top labels in rout

Re: multi-install PostgresNode

2021-03-24 Thread Alvaro Herrera

On 2021-Mar-24, Andrew Dunstan wrote:

> OK, like this?

Yeah, looks good!

> +# Private routine to return a copy of the environment with the PATH and 
> (DY)LD_LIBRARY_PATH
> +# correctly set when there is an install path set for the node.
> +# Routines that call Postgres binaries need to call this routine like this:
> +#
> +#local %ENV = $self->_get_install_env{[%extra_settings]);
> +#
> +# A copy of the environmnt is taken and node's host and port settings are 
> added
> +# as PGHOST and PGPORT, Then the extra settings (if any) are applied. Any 
> setting
> +# in %extra_settings with a value that is undefined is deleted; the 
> remainder are
> +# set. Then the PATH and (DY)LD_LIBRARY_PATH are adjusted if the node's 
> install path
> +# is set, and the copy environment is returned.

There's a typo "environmnt" here, and a couple of lines appear
overlength.

> +sub _get_install_env

I'd use a name that doesn't have "install" in it -- maybe _get_env or
_get_postgres_env or _get_PostgresNode_env -- but don't really care too
much about it.


> +# The install path set in get_new_node needs to be a directory containing
> +# bin and lib subdirectories as in a standard PostgreSQL installation, so 
> this
> +# can't be used with installations where the bin and lib directories don't 
> have
> +# a common parent directory.

I've never heard of an installation where that wasn't true.  If there
was a need for that, seems like it'd be possible to set them with
{ bindir => ..., libdir => ...} but I doubt it'll ever be necessary.

-- 
Álvaro Herrera   Valdivia, Chile
"We're here to devour each other alive"(Hobbes)

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Tom Lane

Justin Pryzby  writes:
> On Wed, Mar 24, 2021 at 01:30:26PM -0400, Robert Haas wrote:
>> ... So --no-toast-compression is just fine for people who are
>> dumping and restoring, but it's no help at all if you want to switch
>> TOAST compression methods while doing a pg_upgrade. However, what does
>> help with that is sticking with what Tom committed before rather than
>> changing to what he's proposing now.

> I don't know what/any other cases support using pg_upgrade to change stuff 
> like
> the example (changing to lz4).  The way to do it is to make the changes either
> before or after.  It seems weird to think that pg_upgrade would handle that.

Yeah; I think the charter of pg_upgrade is to reproduce the old database
state.  If you try to twiddle the process to incorporate some changes
in that state, maybe it will work, but if it breaks you get to keep both
pieces.  I surely don't wish to consider such shenanigans as supported.

But let's ignore the case of pg_upgrade and just consider a dump/restore.
I'd still say that unless you give --no-toast-compression then I would
expect the dump/restore to preserve the tables' old compression behavior.
Robert's argument that the pre-v14 database had no particular compression
behavior seems nonsensical to me.  We know exactly which compression
behavior it has.

regards, tom lane

Re: Support for NSS as a libpq TLS backend

2021-03-24 Thread Stephen Frost

Greetings,

* Jacob Champion (pchamp...@vmware.com) wrote:
> On Wed, 2021-03-24 at 13:00 -0400, Stephen Frost wrote:
> > * Jacob Champion (pchamp...@vmware.com) wrote:
> > > Right, but to clarify -- I was asking if *NSS* supports loading and
> > > using separate certificate databases as part of its API. It seems like
> > > the internals make it possible, but I don't see the public interfaces
> > > to actually use those internals.
> > 
> > Yes, this is done using SECMOD_OpenUserDB, see:
> > 
> > https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS/PKCS11_Functions#SECMOD_OpenUserDB
> 
> Ah, I had assumed that the DB-specific InitContext was using this
> behind the scenes; apparently not. I will give that a try, thanks!
> 
> > also there's info here:
> > 
> > https://groups.google.com/g/mozilla.dev.tech.crypto/c/Xz6Emfcue0E
> > 
> > We should document that, as mentioned in the link above, the NSS find
> > functions will find certs in all the opened databases.  As this would
> > all be under one application which is linked against libpq and passing
> > in different values for ssl_database for different connections, this
> > doesn't seem like it's really that much of an issue.
> 
> I could see this being a problem if two client certificate nicknames
> collide across multiple in-use databases, maybe?

Right, in such a case either cert might get returned and it's possible
that the "wrong" one is returned and therefore the connection would end
up failing, assuming that they aren't actually the same and just happen
to be in both.

Seems like we could use SECMOD_OpenUserDB() and then pass the result
from that into PK11_ListCertsInSlot() and scan through the certs in just
the specified database to find the one we're looking for if we really
feel compelled to try and address this risk.  I've reached out to the
NSS folks to see if they have any thoughts about the best way to address
this.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: truncating timestamps on arbitrary intervals

2021-03-24 Thread John Naylor

On Wed, Mar 24, 2021 at 1:25 PM Erik Rijkers  wrote:
>
> 'In cases full units' seems strange.
>
> Not a native speaker but maybe the attached changes are improvements?

-In cases full units (1 minute, 1 hour, etc.), it gives the same result
as
+In case of full units (1 minute, 1 hour, etc.), it gives the same
result as
 the analogous date_trunc call, but the difference
is
 that date_bin can truncate to an arbitrary
interval.


I would say "In the case of"


-The stride interval cannot contain units of
month
+The stride interval cannot contain units of a
month
 or larger.

The original seems fine to me here.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: truncating timestamps on arbitrary intervals

2021-03-24 Thread John Naylor

On Wed, Mar 24, 2021 at 11:38 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> Committed.
>
> I noticed that some of the documentation disappeared between v9 and v10.
>   So I put that back and updated it appropriately.  I also added a few
> more test cases to cover some things that have been discussed during the
> course of this thread.

Thanks! I put off updating the documentation in case the latest approach
was not feature-rich enough.

> As a potential follow-up, should we perhaps add named arguments?  That
> might make the invocations easier to read, depending on taste.

I think it's quite possible some users will prefer that. All we need is to
add something like

proargnames => '{bin_width,input,origin}'

to the catalog, right?

Also, I noticed that I put in double semicolons in the new functions
somehow. I'll fix that as well.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Justin Pryzby

On Wed, Mar 24, 2021 at 01:30:26PM -0400, Robert Haas wrote:
> On Wed, Mar 24, 2021 at 1:24 PM Justin Pryzby  wrote:
> > I think it's not specific to pg_upgrade, but any pg_dump |pg_restore.
> >
> > The analogy with tablespaces is restoring from a cluster where the 
> > tablespace
> > is named "vast" to one where it's named "huge".  I do this by running
> > PGOPTIONS=-cdefault_tablespace=huge pg_restore --no-tablespaces
> >
> > So I thinks as long as --no-toast-compression does the corresponding thing, 
> > the
> > "restore with alternate compression" case is handled fine.
> 
> I think you might be missing the point. If you're using pg_dump and
> pg_restore, you can pass --no-toast-compression if you want. But if

Actually, I forgot that pg_restore doesn't (can't) have --no-toast-compression.
So my analogy is broken.

> you're using pg_upgrade, and it's internally calling pg_dump
> --binary-upgrade, then you don't have control over what options get
> passed. So --no-toast-compression is just fine for people who are
> dumping and restoring, but it's no help at all if you want to switch
> TOAST compression methods while doing a pg_upgrade. However, what does
> help with that is sticking with what Tom committed before rather than
> changing to what he's proposing now.

I don't know what/any other cases support using pg_upgrade to change stuff like
the example (changing to lz4).  The way to do it is to make the changes either
before or after.  It seems weird to think that pg_upgrade would handle that.

I'm going to risk making the analogy that it's not supported to pass
--no-tablespaces from pg_upgrade to pg_dump/restore.  It certainly can't work
for --link (except in the weird case that the dirs are on the same filesystem). 
 

-- 
Justin

Re: multi-install PostgresNode

2021-03-24 Thread Andrew Dunstan


On 3/24/21 11:41 AM, Alvaro Herrera wrote:
> On 2021-Mar-24, Andrew Dunstan wrote:
>
>> On 3/24/21 9:23 AM, Andrew Dunstan wrote:
>>> On 3/24/21 8:29 AM, Alvaro Herrera wrote:
>>> If we're going to do that we probably shouldn't special case any
>>> particular settings, but simply take any extra arguments as extra env
>>> settings. And if any setting has undef (e.g. PGAPPNAME) we should unset it.
>> like this.
> Hmm, I like that PGAPPNAME handling has resulted in an overall
> simplification.  I'm not sure why you prefer to keep PGHOST and PGPORT
> handled individually at each callsite however; why not do it like
> _install, and add them to the environment always?  I doubt there's
> anything that requires them *not* to be set; and if there is, it's easy
> to make the claim that that's broken and should be fixed.
>
> I'm just saying that cluttering _get_install_env() with those two
> settings would result in less clutter overall.
>



OK, like this?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 97e05993be..a1473d220f 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -355,6 +355,7 @@ sub info
 	print $fh "Archive directory: " . $self->archive_dir . "\n";
 	print $fh "Connection string: " . $self->connstr . "\n";
 	print $fh "Log file: " . $self->logfile . "\n";
+	print $fh "Install Path: " , $self->{_install_path} . "\n" if $self->{_install_path};
 	close $fh or die;
 	return $_info;
 }
@@ -428,6 +429,8 @@ sub init
 	my $pgdata = $self->data_dir;
 	my $host   = $self->host;
 
+	local %ENV = $self->_get_install_env();
+
 	$params{allows_streaming} = 0 unless defined $params{allows_streaming};
 	$params{has_archiving}= 0 unless defined $params{has_archiving};
 
@@ -555,6 +558,8 @@ sub backup
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name= $self->name;
 
+	local %ENV = $self->_get_install_env();
+
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail(
 		'pg_basebackup', '-D', $backup_path, '-h',
@@ -784,18 +789,15 @@ sub start
 
 	print("### Starting node \"$name\"\n");
 
-	{
-		# Temporarily unset PGAPPNAME so that the server doesn't
-		# inherit it.  Otherwise this could affect libpqwalreceiver
-		# connections in confusing ways.
-		local %ENV = %ENV;
-		delete $ENV{PGAPPNAME};
-
-		# Note: We set the cluster_name here, not in postgresql.conf (in
-		# sub init) so that it does not get copied to standbys.
-		$ret = TestLib::system_log('pg_ctl', '-D', $self->data_dir, '-l',
+	# Temporarily unset PGAPPNAME so that the server doesn't
+	# inherit it.  Otherwise this could affect libpqwalreceiver
+	# connections in confusing ways.
+	local %ENV = $self->_get_install_env(PGAPPNAME => undef);
+
+	# Note: We set the cluster_name here, not in postgresql.conf (in
+	# sub init) so that it does not get copied to standbys.
+	$ret = TestLib::system_log('pg_ctl', '-D', $self->data_dir, '-l',
 			$self->logfile, '-o', "--cluster-name=$name", 'start');
-	}
 
 	if ($ret != 0)
 	{
@@ -826,6 +828,9 @@ sub kill9
 	my ($self) = @_;
 	my $name = $self->name;
 	return unless defined $self->{_pid};
+
+	local %ENV = $self->_get_install_env();
+
 	print "### Killing node \"$name\" using signal 9\n";
 	# kill(9, ...) fails under msys Perl 5.8.8, so fall back on pg_ctl.
 	kill(9, $self->{_pid})
@@ -852,6 +857,9 @@ sub stop
 	my $port   = $self->port;
 	my $pgdata = $self->data_dir;
 	my $name   = $self->name;
+
+	local %ENV = $self->_get_install_env();
+
 	$mode = 'fast' unless defined $mode;
 	return unless defined $self->{_pid};
 	print "### Stopping node \"$name\" using mode $mode\n";
@@ -874,6 +882,9 @@ sub reload
 	my $port   = $self->port;
 	my $pgdata = $self->data_dir;
 	my $name   = $self->name;
+
+	local %ENV = $self->_get_install_env();
+
 	print "### Reloading node \"$name\"\n";
 	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, 'reload');
 	return;
@@ -895,15 +906,12 @@ sub restart
 	my $logfile = $self->logfile;
 	my $name= $self->name;
 
-	print "### Restarting node \"$name\"\n";
+	local %ENV = $self->_get_install_env(PGAPPNAME => undef);
 
-	{
-		local %ENV = %ENV;
-		delete $ENV{PGAPPNAME};
+	print "### Restarting node \"$name\"\n";
 
-		TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
+	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
 			'restart');
-	}
 
 	$self->_update_pid(1);
 	return;
@@ -924,6 +932,9 @@ sub promote
 	my $pgdata  = $self->data_dir;
 	my $logfile = $self->logfile;
 	my $name= $self->name;
+
+	local %ENV = $self->_get_install_env();
+
 	print "### Promoting node \"$name\"\n";
 	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
 		'promote');
@@ -945,6 +956,9 @@ sub logrotate
 	my $pgdata  = $self->data_dir;
 	my $logfile = $self->logfile;
 	my $name= $self->name;
+
+	local %ENV = $self->_get_install_env();
+
 	print "### Rotating

Re: Support for NSS as a libpq TLS backend

2021-03-24 Thread Jacob Champion

On Wed, 2021-03-24 at 13:00 -0400, Stephen Frost wrote:
> * Jacob Champion (pchamp...@vmware.com) wrote:
> > Right, but to clarify -- I was asking if *NSS* supports loading and
> > using separate certificate databases as part of its API. It seems like
> > the internals make it possible, but I don't see the public interfaces
> > to actually use those internals.
> 
> Yes, this is done using SECMOD_OpenUserDB, see:
> 
> https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS/PKCS11_Functions#SECMOD_OpenUserDB

Ah, I had assumed that the DB-specific InitContext was using this
behind the scenes; apparently not. I will give that a try, thanks!

> also there's info here:
> 
> https://groups.google.com/g/mozilla.dev.tech.crypto/c/Xz6Emfcue0E
> 
> We should document that, as mentioned in the link above, the NSS find
> functions will find certs in all the opened databases.  As this would
> all be under one application which is linked against libpq and passing
> in different values for ssl_database for different connections, this
> doesn't seem like it's really that much of an issue.

I could see this being a problem if two client certificate nicknames
collide across multiple in-use databases, maybe?

--Jacob

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Robert Haas

On Wed, Mar 24, 2021 at 1:24 PM Justin Pryzby  wrote:
> I think it's not specific to pg_upgrade, but any pg_dump |pg_restore.
>
> The analogy with tablespaces is restoring from a cluster where the tablespace
> is named "vast" to one where it's named "huge".  I do this by running
> PGOPTIONS=-cdefault_tablespace=huge pg_restore --no-tablespaces
>
> So I thinks as long as --no-toast-compression does the corresponding thing, 
> the
> "restore with alternate compression" case is handled fine.

I think you might be missing the point. If you're using pg_dump and
pg_restore, you can pass --no-toast-compression if you want. But if
you're using pg_upgrade, and it's internally calling pg_dump
--binary-upgrade, then you don't have control over what options get
passed. So --no-toast-compression is just fine for people who are
dumping and restoring, but it's no help at all if you want to switch
TOAST compression methods while doing a pg_upgrade. However, what does
help with that is sticking with what Tom committed before rather than
changing to what he's proposing now.

If you like his current proposal, that's fine with me, as long as
we're on the same page about what happens if we adopt it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Robert Haas

On Wed, Mar 24, 2021 at 12:14 PM Dilip Kumar  wrote:
> Actually, we are already doing this, I mean ALTER TABLE .. ALTER
> COLUMN .. SET COMPRESSION is already updating the compression method
> of the index attribute.  So 0003 doesn't make sense, sorry for the
> noise.  However, 0001 and 0002 are still valid, or do you think that
> we don't want 0001 also?  If we don't need 0001 also then we need to
> update the test output for 0002 slightly.

It seems to me that 0002 is still not right. We can't fix the
attcompression to whatever the default is at the time the index is
created, because the default can be changed later, and there's no way
to fix index afterward. I mean, it would be fine to do it that way if
we were going to go with the other model, where the index state is
separate from the table state, either can be changed independently,
and it all gets dumped and restored. But, as it is, I think we should
be deciding how to compress new values for an expression column based
on the default_toast_compression setting at the time of compression,
not the time of index creation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: truncating timestamps on arbitrary intervals

2021-03-24 Thread Erik Rijkers

> On 2021.03.24. 16:38 Peter Eisentraut  
> wrote:

> 
> Committed.
> 

'In cases full units' seems strange.

Not a native speaker but maybe the attached changes are improvements?


Erik Rijkers--- ./doc/src/sgml/func.sgml.orig	2021-03-24 18:16:01.269515354 +0100
+++ ./doc/src/sgml/func.sgml	2021-03-24 18:18:31.695819520 +0100
@@ -9907,13 +9907,13 @@

 

-In cases full units (1 minute, 1 hour, etc.), it gives the same result as
+In case of full units (1 minute, 1 hour, etc.), it gives the same result as
 the analogous date_trunc call, but the difference is
 that date_bin can truncate to an arbitrary interval.

 

-The stride interval cannot contain units of month
+The stride interval cannot contain units of a month
 or larger.

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Justin Pryzby

On Wed, Mar 24, 2021 at 12:24:38PM -0400, Robert Haas wrote:
> On Wed, Mar 24, 2021 at 11:42 AM Tom Lane  wrote:
> > On reflection, though, I wonder if we've made pg_dump do the right
> > thing anyway.  There is a strong case to be made for the idea that
> > when dumping from a pre-14 server, it should emit
> > SET default_toast_compression = 'pglz';
> > rather than omitting any mention of the variable, which is what
> > I made it do in aa25d1089.  If we changed that, I think all these
> > diffs would go away.  Am I right in thinking that what's being
> > compared here is new pg_dump's dump from old server versus new
> > pg_dump's dump from new server?
> >
> > The "strong case" goes like this: initdb a v14 cluster, change
> > default_toast_compression to lz4 in its postgresql.conf, then
> > try to pg_upgrade from an old server.  If the dump script doesn't
> > set default_toast_compression = 'pglz' then the upgrade will
> > do the wrong thing because all the tables will be recreated with
> > a different behavior than they had before.  IIUC, this wouldn't
> > result in broken data, but it still seems to me to be undesirable.
> > dump/restore ought to do its best to preserve the old DB state,
> > unless you explicitly tell it --no-toast-compression or the like.
> 
> This feels a bit like letting the tail wag the dog, because one might
> reasonably guess that the user's intention in such a case was to
> switch to using LZ4, and we've subverted that intention by deciding
> that we know better. I wouldn't blame someone for thinking that using
> --no-toast-compression with a pre-v14 server ought to have no effect,
> but with your proposal here, it would. Furthermore, IIUC, the user has
> no way of passing --no-toast-compression through to pg_upgrade, so
> they're just going to have to do the upgrade and then fix everything
> manually afterward to the state that they intended to have all along.
> Now, on the other hand, if they wanted to make practically any other
> kind of change while upgrading, they'd have to do something like that
> anyway, so I guess this is no worse.

I think it's not specific to pg_upgrade, but any pg_dump |pg_restore.

The analogy with tablespaces is restoring from a cluster where the tablespace
is named "vast" to one where it's named "huge".  I do this by running
PGOPTIONS=-cdefault_tablespace=huge pg_restore --no-tablespaces

So I thinks as long as --no-toast-compression does the corresponding thing, the
"restore with alternate compression" case is handled fine.

-- 
Justin

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Robert Haas

On Wed, Mar 24, 2021 at 12:45 PM Tom Lane  wrote:
> I wouldn't be proposing this if the xversion failures were the only
> reason; making them go away is just a nice side-effect.  The core
> point is that the charter of pg_dump is to reproduce the source
> database's state, and as things stand we're failing to ensure we
> do that.

Well, that state is just a mental construct, right? In reality, there
is no such state stored anywhere in the old database. You're choosing
to attribute to it an implicit state that matches what would need to
be configured in the newer version to get the same behavior, which is
a reasonable thing to do, but it is an interpretive choice rather than
a bare fact.

I don't care very much if you want to change this, but to me it seems
slightly worse than the status quo. It's hard to imagine that someone
is going to create a new cluster, set the default to lz4, run
pg_upgrade, and then complain that the new columns ended up with lz4
as the default. It seems much more likely that they're going to
complain if the new columns *don't* end up with lz4 as the default.
And I also can't see any other scenario where imagining that the TOAST
compression property of the old database simply does not exist, rather
than being pglz implicitly, is worse.

But I could be wrong, and even if I'm right it's not a hill upon which
I wish to die.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Justin Pryzby

On Wed, Mar 24, 2021 at 05:15:46PM +, Dean Rasheed wrote:
> On Wed, 24 Mar 2021 at 16:48, Tomas Vondra
>  wrote:
> >
> > As for the changes proposed in the create_statistics, do we really want
> > to use univariate / multivariate there? Yes, the terms are correct, but
> > I'm not sure how many people looking at CREATE STATISTICS will
> > understand them.
> >
> 
> Hmm, I think "univariate" and "multivariate" are pretty ubiquitous,
> when used to describe statistics. You could use "single-column" and
> "multi-column", but then "column" isn't really right anymore, since it
> might be a column or an expression. I can't think of any other terms
> that fit.

We already use "multivariate", just not in create-statistics.sgml

doc/src/sgml/perform.sgml:multivariate statistics, 
which can capture
doc/src/sgml/perform.sgml:it's impractical to compute multivariate 
statistics automatically.
doc/src/sgml/planstats.sgml: 
doc/src/sgml/planstats.sgml:   multivariate
doc/src/sgml/planstats.sgml:multivariate statistics on the two columns:
doc/src/sgml/planstats.sgml:  
doc/src/sgml/planstats.sgml:But without multivariate statistics, the 
estimate for the number of
doc/src/sgml/planstats.sgml:This section introduces multivariate variant of 
MCV
doc/src/sgml/ref/create_statistics.sgml:  and .
doc/src/sgml/release-13.sgml:2020-01-13 [eae056c19] Apply multiple multivariate 
MCV lists when possible

So I think the answer is for create-statistics to expose that word in a
user-facing way in its reference to multivariate-statistics-examples.

-- 
Justin

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Dean Rasheed

On Wed, 24 Mar 2021 at 16:48, Tomas Vondra
 wrote:
>
> As for the changes proposed in the create_statistics, do we really want
> to use univariate / multivariate there? Yes, the terms are correct, but
> I'm not sure how many people looking at CREATE STATISTICS will
> understand them.
>

Hmm, I think "univariate" and "multivariate" are pretty ubiquitous,
when used to describe statistics. You could use "single-column" and
"multi-column", but then "column" isn't really right anymore, since it
might be a column or an expression. I can't think of any other terms
that fit.

Regards,
Dean

Re: Change default of checkpoint_completion_target

2021-03-24 Thread Stephen Frost

Greetings,

* Bossart, Nathan (bossa...@amazon.com) wrote:
> On 3/23/21, 12:19 PM, "Stephen Frost"  wrote:
> > * Bossart, Nathan (bossa...@amazon.com) wrote:
> > > LGTM.  I just have a few small wording suggestions.
> >
> > Agreed, those looked like good suggestions and so I've incorporated
> > them.
> >
> > Updated patch attached.
> 
> Looks good!

Great, pushed!  Thanks to everyone for your thoughts, comments,
suggestions, and improvments.

Stephen


signature.asc
Description: PGP signature

Re: Support for NSS as a libpq TLS backend

2021-03-24 Thread Stephen Frost

Greetings Jacob,

* Jacob Champion (pchamp...@vmware.com) wrote:
> On Wed, 2021-03-24 at 09:28 +0900, Michael Paquier wrote:
> > On Wed, Mar 24, 2021 at 12:05:35AM +, Jacob Champion wrote:
> > > I can work around it temporarily for the
> > > tests, but this will be a problem if any libpq clients load up multiple
> > > independent databases for use with separate connections. Anyone know if
> > > this is a supported use case for NSS?
> > 
> > Are you referring to the case of threading here?  This should be a
> > supported case, as threads created by an application through libpq
> > could perfectly use completely different connection strings.
> Right, but to clarify -- I was asking if *NSS* supports loading and
> using separate certificate databases as part of its API. It seems like
> the internals make it possible, but I don't see the public interfaces
> to actually use those internals.

Yes, this is done using SECMOD_OpenUserDB, see:

https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS/PKCS11_Functions#SECMOD_OpenUserDB

also there's info here:

https://groups.google.com/g/mozilla.dev.tech.crypto/c/Xz6Emfcue0E

We should document that, as mentioned in the link above, the NSS find
functions will find certs in all the opened databases.  As this would
all be under one application which is linked against libpq and passing
in different values for ssl_database for different connections, this
doesn't seem like it's really that much of an issue.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Tomas Vondra




On 3/24/21 5:28 PM, Dean Rasheed wrote:
> On Wed, 24 Mar 2021 at 14:48, Tomas Vondra
>  wrote:
>>
>> AFAIK the primary issue here is that the two places disagree. While
>> estimate_num_groups does this
>>
>> varnos = pull_varnos(root, (Node *) varshere);
>> if (bms_membership(varnos) == BMS_SINGLETON)
>> { ... }
>>
>> the add_unique_group_expr does this
>>
>> varnos = pull_varnos(root, (Node *) groupexpr);
>>
>> That is, one looks at the group expression, while the other look at vars
>> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
>> this can differ, causing the crash.
>>
>> So we need to change one of those places - my fix tweaked the second
>> place to also look at the vars, but maybe we should change the other
>> place? Or maybe it's not the right fix for PlaceHolderVars ...
>>
> 
> I think that it doesn't make any difference which place is changed.
> 
> This is a case of an expression with no stats. With your change,
> you'll get a single GroupExprInfo containing a list of
> VariableStatData's for each of it's Var's, whereas if you changed it
> the other way, you'd get a separate GroupExprInfo for each Var. But I
> think they'd both end up being treated the same by
> estimate_multivariate_ndistinct(), since there wouldn't be any stats
> matching the expression, only the individual Var's. Maybe changing the
> first place would be the more bulletproof fix though.
> 

Yeah, I think that's true. I'll do a bit more research / experiments.

As for the changes proposed in the create_statistics, do we really want
to use univariate / multivariate there? Yes, the terms are correct, but
I'm not sure how many people looking at CREATE STATISTICS will
understand them.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Proposal: Save user's original authenticated identity for logging

2021-03-24 Thread Jacob Champion

On Tue, 2021-03-23 at 14:21 +0900, Michael Paquier wrote:
> I am not really sure that we need to bother about the ordering of the
> entries here, as long as we check for all of them within the same
> fragment of the log file, so I would just go down to the simplest
> solution that I posted upthread that is enough to make sure that the
> verbosity is protected.  That's what we do elsewhere, like with
> command_checks_all() and such.
With low-coverage test suites, I think it's useful to allow as little
strange behavior as possible -- in this case, printing authorization
before authentication could signal a serious bug -- but I don't feel
too strongly about it.

v10 attached, which reverts to v8 test behavior, with minor updates to
the commit message and test comment.

--Jacob
From 0b8764ac61ef37809a79ded2596c5fcd2caf25bb Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 8 Feb 2021 10:53:20 -0800
Subject: [PATCH v10 1/2] ssl: store client's DN in port->peer_dn

Original patch by Andrew Dunstan:

https://www.postgresql.org/message-id/fd96ae76-a8e3-ef8e-a642-a592f5b76771%40dunslane.net

but I've taken out the clientname=DN functionality; all that will be
needed for the next patch is the DN string.
---
 src/backend/libpq/be-secure-openssl.c | 53 +++
 src/include/libpq/libpq-be.h  |  1 +
 2 files changed, 47 insertions(+), 7 deletions(-)

diff --git a/src/backend/libpq/be-secure-openssl.c b/src/backend/libpq/be-secure-openssl.c
index 5ce3f27855..18321703da 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -551,22 +551,25 @@ aloop:
 	/* Get client certificate, if available. */
 	port->peer = SSL_get_peer_certificate(port->ssl);
 
-	/* and extract the Common Name from it. */
+	/* and extract the Common Name / Distinguished Name from it. */
 	port->peer_cn = NULL;
+	port->peer_dn = NULL;
 	port->peer_cert_valid = false;
 	if (port->peer != NULL)
 	{
 		int			len;
+		X509_NAME  *x509name = X509_get_subject_name(port->peer);
+		char	   *peer_cn;
+		char	   *peer_dn;
+		BIO		   *bio = NULL;
+		BUF_MEM*bio_buf = NULL;
 
-		len = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer),
-		NID_commonName, NULL, 0);
+		len = X509_NAME_get_text_by_NID(x509name, NID_commonName, NULL, 0);
 		if (len != -1)
 		{
-			char	   *peer_cn;
-
 			peer_cn = MemoryContextAlloc(TopMemoryContext, len + 1);
-			r = X509_NAME_get_text_by_NID(X509_get_subject_name(port->peer),
-		  NID_commonName, peer_cn, len + 1);
+			r = X509_NAME_get_text_by_NID(x509name, NID_commonName, peer_cn,
+		  len + 1);
 			peer_cn[len] = '\0';
 			if (r != len)
 			{
@@ -590,6 +593,36 @@ aloop:
 
 			port->peer_cn = peer_cn;
 		}
+
+		bio = BIO_new(BIO_s_mem());
+		if (!bio)
+		{
+			pfree(port->peer_cn);
+			port->peer_cn = NULL;
+			return -1;
+		}
+		/* use commas instead of slashes */
+		X509_NAME_print_ex(bio, x509name, 0, XN_FLAG_RFC2253);
+		BIO_get_mem_ptr(bio, &bio_buf);
+		peer_dn = MemoryContextAlloc(TopMemoryContext, bio_buf->length + 1);
+		memcpy(peer_dn, bio_buf->data, bio_buf->length);
+		peer_dn[bio_buf->length] = '\0';
+		if (bio_buf->length != strlen(peer_dn))
+		{
+			ereport(COMMERROR,
+	(errcode(ERRCODE_PROTOCOL_VIOLATION),
+	 errmsg("SSL certificate's distinguished name contains embedded null")));
+			BIO_free(bio);
+			pfree(peer_dn);
+			pfree(port->peer_cn);
+			port->peer_cn = NULL;
+			return -1;
+		}
+
+		BIO_free(bio);
+
+		port->peer_dn = peer_dn;
+
 		port->peer_cert_valid = true;
 	}
 
@@ -618,6 +651,12 @@ be_tls_close(Port *port)
 		pfree(port->peer_cn);
 		port->peer_cn = NULL;
 	}
+
+	if (port->peer_dn)
+	{
+		pfree(port->peer_dn);
+		port->peer_dn = NULL;
+	}
 }
 
 ssize_t
diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h
index 30fb4e613d..d970277894 100644
--- a/src/include/libpq/libpq-be.h
+++ b/src/include/libpq/libpq-be.h
@@ -190,6 +190,7 @@ typedef struct Port
 	 */
 	bool		ssl_in_use;
 	char	   *peer_cn;
+	char	   *peer_dn;
 	bool		peer_cert_valid;
 
 	/*
-- 
2.25.1

From ac4812f4c30456849462d77982f9ce8139ba51a4 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Wed, 3 Feb 2021 11:42:05 -0800
Subject: [PATCH v10 2/2] Log authenticated identity from all auth backends

The "authenticated identity" is the string used by an auth method to
identify a particular user. In many common cases this is the same as the
Postgres username, but for some third-party auth methods, the identifier
in use may be shortened or otherwise translated (e.g. through pg_ident
mappings) before the server stores it.

To help DBAs see who has actually interacted with the system, store the
original identity when authentication succeeds, and expose it via the
log_connections setting. The log entries look something like this
example (where a local user named "pchampion" is connecting to the
database as the "admin" user):

LOG:  connection received: host=[local]
LOG:  connection authenticated

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Tom Lane

Robert Haas  writes:
> On Wed, Mar 24, 2021 at 11:42 AM Tom Lane  wrote:
>> On reflection, though, I wonder if we've made pg_dump do the right
>> thing anyway.  There is a strong case to be made for the idea that
>> when dumping from a pre-14 server, it should emit
>> SET default_toast_compression = 'pglz';
>> rather than omitting any mention of the variable, which is what
>> I made it do in aa25d1089.

> But also ... aren't we just doing this to work around a test case that
> isn't especially good in the first place? Counting the number of lines
> in the diff between A and B is an extremely crude proxy for "they're
> similar enough that we probably haven't broken anything."

I wouldn't be proposing this if the xversion failures were the only
reason; making them go away is just a nice side-effect.  The core
point is that the charter of pg_dump is to reproduce the source
database's state, and as things stand we're failing to ensure we
do that.

(But yeah, we really need a better way of making this check in
the xversion tests.  I don't like the arbitrary "n lines of diff
is probably OK" business one bit.)

regards, tom lane

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Robert Haas

On Mon, Mar 22, 2021 at 4:57 PM Robert Haas  wrote:
> > Fixed.
>
> Fixed some more.

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Justin Pryzby

On Wed, Mar 24, 2021 at 01:24:46AM -0500, Justin Pryzby wrote:
> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me.  For example:
> > Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".  

In HEAD:catalogs.sgml, pg_statistic_ext (the table) says "object":
|Name of the statistics object
|Owner of the statistics object
|An array of attribute numbers, indicating which table columns are covered by 
this statistics object;

But pg_stats_ext (the view) doesn't say "object", which sounds wrong:
|Name of extended statistics
|Owner of the extended statistics
|Names of the columns the extended statistics is defined on

Other pre-existing issues: should be singular "statistic":
doc/src/sgml/perform.sgml: Another type of statistics stored for each 
column are most-common value
doc/src/sgml/ref/psql-ref.sgml:The status of each kind of extended 
statistics is shown in a column

Pre-existing issues: doesn't say "object" but I think it should:
src/backend/commands/statscmds.c:
errmsg("statistics creation on system columns is not supported")));
src/backend/commands/statscmds.c:
errmsg("cannot have more than %d columns in statistics",
src/backend/commands/statscmds.c:* If we got here and the OID is not 
valid, it means the statistics does
src/backend/commands/statscmds.c: * Select a nonconflicting name for a new 
statistics.
src/backend/commands/statscmds.c: * Generate "name2" for a new statistics given 
the list of column names for it
src/backend/statistics/extended_stats.c:/* compute statistics 
target for this statistics */
src/backend/statistics/extended_stats.c: * attributes the statistics is defined 
on, and then the default statistics
src/backend/statistics/mcv.c: * The input is the OID of the statistics, and 
there are no rows returned if

should say "for a statistics object" or "for statistics objects"
src/backend/statistics/extended_stats.c: * target for a statistics objects 
(from the object target, attribute targets

Your patch adds these:

Should say "object":
+* Check if we actually have a matching statistics for the expression.  

+   /* evaluate expressions (if the statistics has any) */  

+* for the extended statistics. The second option seems more 
reasonable. 

+* the statistics had all options enabled on the original 
version.

+* But if the statistics is defined on just a single column, it 
has to  

+   /* has the statistics expressions? */   

+   /* expression - see if it's in the statistics */

+* column(s) the statistics depends on. 
 Also require all   

+* statistics is defined on more than one column/expression).   

+* statistics is useless, but harmless).

+* If there are no simply-referenced columns, give the statistics an 
auto

+* Then the first statistics matches no expressions and 
3 vars,

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Tomas Vondra



On 3/24/21 7:24 AM, Justin Pryzby wrote:
> Most importantly, it looks like this forgets to update catalog documentation
> for stxexprs and stxkind='e'
> 

Good catch.

> It seems like you're preferring to use pluralized "statistics" in a lot of
> places that sound wrong to me.  For example:
>> Currently the first statistics wins, which seems silly.
> I can write more separately, but I think this is resolved and clarified if you
> write "statistics object" and not just "statistics".  
> 

OK "statistics object" seems better and more consistent.

>> +   Name of schema containing table
> 
> I don't know about the nearby descriptions, but this one sounds too much like 
> a
> "schema-containing" table.  Say "Name of the schema which contains the table" 
> ?
> 

I think the current spelling is OK / consistent with the other catalogs.

>> +   Name of table
> 
> Say "name of table on which the extended statistics are defined"
> 

I've used "Name of table the statistics object is defined on".

>> +   Name of extended statistics
> 
> "Name of the extended statistic object"
> 
>> +   Owner of the extended statistics
> 
> ..object
> 

OK

>> +   Expression the extended statistics is defined on
> 
> I think it should say "the extended statistic", or "the extended statistics
> object".  Maybe "..on which the extended statistic is defined"
> 

OK

>> +   of random access to the disk.  (This expression is null if the 
>> expression
>> +   data type does not have a < operator.)
> 
> expression's data type
> 

OK

>> +   much-too-small row count estimate in the first two queries. Moreover, the
> 
> maybe say "dramatically underestimates the rowcount"
> 

I've changed this to "... results in a significant underestimate of row
count".

>> +   planner has no information about relationship between the expressions, 
>> so it
> 
> the relationship
> 

OK

>> +   assumes the two WHERE and GROUP BY
>> +   conditions are independent, and multiplies their selectivities together 
>> to
>> +   arrive at a much-too-high group count estimate in the aggregate query.
> 
> severe overestimate ?
> 

OK

>> +   This is further exacerbated by the lack of accurate statistics for the
>> +   expressions, forcing the planner to use default ndistinct estimate for 
>> the
> 
> use *a default
> 

OK

>> +   expression derived from ndistinct for the column. With such statistics, 
>> the
>> +   planner recognizes that the conditions are correlated and arrives at much
>> +   more accurate estimates.
> 
> are correlated comma
> 

OK

>> +if (type->lt_opr == InvalidOid)
> 
> These could be !OidIsValid
> 

Maybe, but it's like this already. I'll leave this alone and then
fix/backpatch separately.

>> + * expressions. It's either expensive or very easy to defeat for
>> + * determined used, and there's no risk if we allow such statistics (the
>> + * statistics is useless, but harmless).
> 
> I think it's meant to say "for a determined user" ?
> 

Right.

>> + * If there are no simply-referenced columns, give the statistics an 
>> auto
>> + * dependency on the whole table.  In most cases, this will be 
>> redundant,
>> + * but it might not be if the statistics expressions contain no Vars
>> + * (which might seem strange but possible).
>> + */
>> +if (!nattnums)
>> +{
>> +ObjectAddressSet(parentobject, RelationRelationId, relid);
>> +recordDependencyOn(&myself, &parentobject, DEPENDENCY_AUTO);
>> +}
> 
> Can this be unconditional ?
> 

What would be the benefit? This behavior copied from index_create, so
I'd prefer keeping it the same for consistency reason. Presumably it's
like that for some reason (a bit of cargo cult programming, I know).

>> + * Translate the array of indexs to regular attnums for the dependency 
>> (we
> 
> sp: indexes
> 

OK

>> + * Not found a matching expression, so 
>> we can simply skip
> 
> Found no matching expr
> 

OK

>> +/* if found a matching, */
> 
> matching ..
> 

Matching dependency.

>> +examine_attribute(Node *expr)
> 
> Maybe you should rename this to something distinct ?  So it's easy to add a
> breakpoint there, for example.
> 

What would be a better name? It's not difficult to add a breakpoint
using line number, for example.

>> +stats->anl_context = CurrentMemoryContext;  /* XXX should be using
>> +
>>  * something else? */
> 
>> +boolnulls[Natts_pg_statistic];
> ...
>> + * Construct a new pg_statistic tuple
>> + */
>> +for (i = 0; i < Natts_pg_statistic; ++i)
>> +{
>> +nulls[i] = false;
>> +}
> 
> Shouldn't you just write nulls[Natts_pg_statistic] = {false};
> or at least: memset(nulls, 0, sizeof(nulls));
> 

Maybe, but it's a copy

Re: postgres_fdw: IMPORT FOREIGN SCHEMA ... LIMIT TO (partition)

2021-03-24 Thread Bernd Helmle

Am Mittwoch, dem 24.03.2021 um 13:23 +0100 schrieb Matthias van de
Meent:
> Yes, but it should be noted that the main reason that was mentioned
> as
> a reason to exclude partitions is to not cause table catalog bloat,
> and I argue that this argument is not as solid in the case of the
> explicitly named tables of the LIMIT TO clause. Except if SQL
> standard
> prescribes otherwise, I think allowing partitions in LIMIT TO clauses
> is an improvement overall.

Don't get me wrong, i find this useful, too. Especially because it's a
very minor change in the code. And i don't see negative aspects here
currently, either (which doesn't mean there aren't some).

> 
> I myself have had this need, in that I've had to import some
> partitions manually as a result of this limitation. IMPORT FORAIGN
> SCHEMA really is great when it works, but limitations like these are
> crippling for some more specific use cases (e.g. allowing
> long-duration read-only access to one partition in the partition tree
> while also allowing the partition layout of the parents to be
> modified).

Interesting use case. 


-- 
Thanks,
Bernd

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Dean Rasheed

On Wed, 24 Mar 2021 at 14:48, Tomas Vondra
 wrote:
>
> AFAIK the primary issue here is that the two places disagree. While
> estimate_num_groups does this
>
> varnos = pull_varnos(root, (Node *) varshere);
> if (bms_membership(varnos) == BMS_SINGLETON)
> { ... }
>
> the add_unique_group_expr does this
>
> varnos = pull_varnos(root, (Node *) groupexpr);
>
> That is, one looks at the group expression, while the other look at vars
> extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
> this can differ, causing the crash.
>
> So we need to change one of those places - my fix tweaked the second
> place to also look at the vars, but maybe we should change the other
> place? Or maybe it's not the right fix for PlaceHolderVars ...
>

I think that it doesn't make any difference which place is changed.

This is a case of an expression with no stats. With your change,
you'll get a single GroupExprInfo containing a list of
VariableStatData's for each of it's Var's, whereas if you changed it
the other way, you'd get a separate GroupExprInfo for each Var. But I
think they'd both end up being treated the same by
estimate_multivariate_ndistinct(), since there wouldn't be any stats
matching the expression, only the individual Var's. Maybe changing the
first place would be the more bulletproof fix though.

Regards,
Dean

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Robert Haas

On Wed, Mar 24, 2021 at 11:42 AM Tom Lane  wrote:
> On reflection, though, I wonder if we've made pg_dump do the right
> thing anyway.  There is a strong case to be made for the idea that
> when dumping from a pre-14 server, it should emit
> SET default_toast_compression = 'pglz';
> rather than omitting any mention of the variable, which is what
> I made it do in aa25d1089.  If we changed that, I think all these
> diffs would go away.  Am I right in thinking that what's being
> compared here is new pg_dump's dump from old server versus new
> pg_dump's dump from new server?
>
> The "strong case" goes like this: initdb a v14 cluster, change
> default_toast_compression to lz4 in its postgresql.conf, then
> try to pg_upgrade from an old server.  If the dump script doesn't
> set default_toast_compression = 'pglz' then the upgrade will
> do the wrong thing because all the tables will be recreated with
> a different behavior than they had before.  IIUC, this wouldn't
> result in broken data, but it still seems to me to be undesirable.
> dump/restore ought to do its best to preserve the old DB state,
> unless you explicitly tell it --no-toast-compression or the like.

This feels a bit like letting the tail wag the dog, because one might
reasonably guess that the user's intention in such a case was to
switch to using LZ4, and we've subverted that intention by deciding
that we know better. I wouldn't blame someone for thinking that using
--no-toast-compression with a pre-v14 server ought to have no effect,
but with your proposal here, it would. Furthermore, IIUC, the user has
no way of passing --no-toast-compression through to pg_upgrade, so
they're just going to have to do the upgrade and then fix everything
manually afterward to the state that they intended to have all along.
Now, on the other hand, if they wanted to make practically any other
kind of change while upgrading, they'd have to do something like that
anyway, so I guess this is no worse.

But also ... aren't we just doing this to work around a test case that
isn't especially good in the first place? Counting the number of lines
in the diff between A and B is an extremely crude proxy for "they're
similar enough that we probably haven't broken anything."

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Dilip Kumar

On Wed, Mar 24, 2021 at 9:32 PM Robert Haas  wrote:
>
> On Wed, Mar 24, 2021 at 11:41 AM Dilip Kumar  wrote:
> > Okay, that sounds like a reasonable design idea.  But the problem is
> > that in index_form_tuple we only have index tuple descriptor, not the
> > heap tuple descriptor. Maybe we will have to pass the heap tuple
> > descriptor as a parameter to index_form_tuple.   I will think more
> > about this that how can we do that.
>
> Another option might be to decide that the pg_attribute tuples for the
> index columns always have to match the corresponding table columns.
> So, if you alter with ALTER TABLE, it runs around and updates all of
> the indexes to match. For expression index columns, we could store
> InvalidCompressionMethod, causing index_form_tuple() to substitute the
> run-time default. That kinda sucks, because it's a significant
> impediment to ever reducing the lock level for ALTER TABLE .. ALTER
> COLUMN .. SET COMPRESSION, but I'm not sure we have the luxury of
> worrying about that problem right now.

Actually, we are already doing this, I mean ALTER TABLE .. ALTER
COLUMN .. SET COMPRESSION is already updating the compression method
of the index attribute.  So 0003 doesn't make sense, sorry for the
noise.  However, 0001 and 0002 are still valid, or do you think that
we don't want 0001 also?  If we don't need 0001 also then we need to
update the test output for 0002 slightly.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: pg_upgrade failing for 200+ million Large Objects

2021-03-24 Thread Jan Wieck


On 3/24/21 12:04 PM, Jan Wieck wrote:

In any case I changed the options so that they behave the same way, the
existing -o and -O (for old/new postmaster options) work. I don't think
it would be wise to have option forwarding work differently between
options for postmaster and options for pg_dump/pg_restore.


Attaching the actual diff might help.

--
Jan Wieck
Principle Database Engineer
Amazon Web Services
diff --git a/src/bin/pg_dump/parallel.c b/src/bin/pg_dump/parallel.c
index c7351a4..4a611d0 100644
--- a/src/bin/pg_dump/parallel.c
+++ b/src/bin/pg_dump/parallel.c
@@ -864,6 +864,11 @@ RunWorker(ArchiveHandle *AH, ParallelSlot *slot)
 	WaitForCommands(AH, pipefd);
 
 	/*
+	 * Close an eventually open BLOB batch transaction.
+	 */
+	CommitBlobTransaction((Archive *)AH);
+
+	/*
 	 * Disconnect from database and clean up.
 	 */
 	set_cancel_slot_archive(slot, NULL);
diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 0296b9b..cd8a590 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -203,6 +203,8 @@ typedef struct Archive
 	int			numWorkers;		/* number of parallel processes */
 	char	   *sync_snapshot_id;	/* sync snapshot id for parallel operation */
 
+	int			blobBatchSize;	/* # of blobs to restore per transaction */
+
 	/* info needed for string escaping */
 	int			encoding;		/* libpq code for client_encoding */
 	bool		std_strings;	/* standard_conforming_strings */
@@ -269,6 +271,7 @@ extern void WriteData(Archive *AH, const void *data, size_t dLen);
 extern int	StartBlob(Archive *AH, Oid oid);
 extern int	EndBlob(Archive *AH, Oid oid);
 
+extern void	CommitBlobTransaction(Archive *AH);
 extern void CloseArchive(Archive *AH);
 
 extern void SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt);
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 1f82c64..8331e8a 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -68,6 +68,7 @@ typedef struct _parallelReadyList
 	bool		sorted;			/* are valid entries currently sorted? */
 } ParallelReadyList;
 
+static int		blobBatchCount = 0;
 
 static ArchiveHandle *_allocAH(const char *FileSpec, const ArchiveFormat fmt,
 			   const int compression, bool dosync, ArchiveMode mode,
@@ -265,6 +266,8 @@ CloseArchive(Archive *AHX)
 	int			res = 0;
 	ArchiveHandle *AH = (ArchiveHandle *) AHX;
 
+	CommitBlobTransaction(AHX);
+
 	AH->ClosePtr(AH);
 
 	/* Close the output */
@@ -279,6 +282,23 @@ CloseArchive(Archive *AHX)
 
 /* Public */
 void
+CommitBlobTransaction(Archive *AHX)
+{
+	ArchiveHandle *AH = (ArchiveHandle *) AHX;
+
+	if (blobBatchCount > 0)
+	{
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "-- End BLOB restore batch\n");
+		ahprintf(AH, "--\n");
+		ahprintf(AH, "COMMIT;\n\n");
+
+		blobBatchCount = 0;
+	}
+}
+
+/* Public */
+void
 SetArchiveOptions(Archive *AH, DumpOptions *dopt, RestoreOptions *ropt)
 {
 	/* Caller can omit dump options, in which case we synthesize them */
@@ -3531,6 +3551,57 @@ _printTocEntry(ArchiveHandle *AH, TocEntry *te, bool isData)
 {
 	RestoreOptions *ropt = AH->public.ropt;
 
+	/* We restore BLOBs in batches to reduce XID consumption */
+	if (strcmp(te->desc, "BLOB") == 0 && AH->public.blobBatchSize > 0)
+	{
+		if (blobBatchCount > 0)
+		{
+			/* We are inside a BLOB restore transaction */
+			if (blobBatchCount >= AH->public.blobBatchSize)
+			{
+/*
+ * We did reach the batch size with the previous BLOB.
+ * Commit and start a new batch.
+ */
+ahprintf(AH, "--\n");
+ahprintf(AH, "-- BLOB batch size reached\n");
+ahprintf(AH, "--\n");
+ahprintf(AH, "COMMIT;\n");
+ahprintf(AH, "BEGIN;\n\n");
+
+blobBatchCount = 1;
+			}
+			else
+			{
+/* This one still fits into the current batch */
+blobBatchCount++;
+			}
+		}
+		else
+		{
+			/* Not inside a transaction, start a new batch */
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- Start BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "BEGIN;\n\n");
+
+			blobBatchCount = 1;
+		}
+	}
+	else
+	{
+		/* Not a BLOB. If we have a BLOB batch open, close it. */
+		if (blobBatchCount > 0)
+		{
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "-- End BLOB restore batch\n");
+			ahprintf(AH, "--\n");
+			ahprintf(AH, "COMMIT;\n\n");
+
+			blobBatchCount = 0;
+		}
+	}
+
 	/* Select owner, schema, tablespace and default AM as necessary */
 	_becomeOwner(AH, te);
 	_selectOutputSchema(AH, te->namespace);
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f8bec3f..f153f08 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -165,12 +165,20 @@ static void guessConstraintInheritance(TableInfo *tblinfo, int numTables);
 static void dumpComment(Archive *fout, const char *type, const char *name,
 		const char *namespace, const char *owner,
 		CatalogId catalogId, int subid, DumpId dumpId);
+static bool dumpCommentQuery(Archive *fout, PQExpBuffer

Re: default result formats setting

2021-03-24 Thread Robert Haas

On Wed, Mar 24, 2021 at 12:01 PM Tom Lane  wrote:
> I don't think I buy the premise that there are exactly two levels
> on the client side.

Thanks for sharing your thoughts on this. I agree it's a complex
issue, and the idea that there are possibly more than two logical
levels is, for me, maybe your most interesting observation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_upgrade failing for 200+ million Large Objects

2021-03-24 Thread Jan Wieck


On 3/23/21 4:55 PM, Tom Lane wrote:

Jan Wieck  writes:
Have we even reached a consensus yet on that doing it the way, my patch 
is proposing, is the right way to go? Like that emitting BLOB TOC 
entries into SECTION_DATA when in binary upgrade mode is a good thing? 
Or that bunching all the SQL statements for creating the blob, changing 
the ACL and COMMENT and SECLABEL all in one multi-statement-query is.


Now you're asking for actual review effort, which is a little hard
to come by towards the tail end of the last CF of a cycle.  I'm
interested in this topic, but I can't justify spending much time
on it right now.


Understood.

In any case I changed the options so that they behave the same way, the 
existing -o and -O (for old/new postmaster options) work. I don't think 
it would be wise to have option forwarding work differently between 
options for postmaster and options for pg_dump/pg_restore.



Regards, Jan

--
Jan Wieck
Principle Database Engineer
Amazon Web Services

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Robert Haas

On Wed, Mar 24, 2021 at 11:41 AM Dilip Kumar  wrote:
> Okay, that sounds like a reasonable design idea.  But the problem is
> that in index_form_tuple we only have index tuple descriptor, not the
> heap tuple descriptor. Maybe we will have to pass the heap tuple
> descriptor as a parameter to index_form_tuple.   I will think more
> about this that how can we do that.

Another option might be to decide that the pg_attribute tuples for the
index columns always have to match the corresponding table columns.
So, if you alter with ALTER TABLE, it runs around and updates all of
the indexes to match. For expression index columns, we could store
InvalidCompressionMethod, causing index_form_tuple() to substitute the
run-time default. That kinda sucks, because it's a significant
impediment to ever reducing the lock level for ALTER TABLE .. ALTER
COLUMN .. SET COMPRESSION, but I'm not sure we have the luxury of
worrying about that problem right now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-24 Thread Alvaro Herrera

On 2021-Mar-24, Julien Rouhaud wrote:

> From e08c9d5fc86ba722844d97000798de868890aba3 Mon Sep 17 00:00:00 2001
> From: Bruce Momjian 
> Date: Mon, 22 Mar 2021 17:43:23 -0400
> Subject: [PATCH v20 2/3] Expose queryid in pg_stat_activity and

>  src/backend/executor/execMain.c   |   9 ++
>  src/backend/executor/execParallel.c   |  14 ++-
>  src/backend/executor/nodeGather.c |   3 +-
>  src/backend/executor/nodeGatherMerge.c|   4 +-

Hmm...

I find it odd that there's executor code that acquires the current query
ID from pgstat, after having been put there by planner or ExecutorStart
itself.  Seems like a modularity violation.  I wonder if it would make
more sense to have the value maybe in struct EState (or perhaps there's
a better place -- but I don't think they have a way to reach the
QueryDesc anyhow), put there by ExecutorStart, so that places such as
execParallel, nodeGather etc don't have to fetch it from pgstat but from
EState.

-- 
Álvaro Herrera   Valdivia, Chile
"We're here to devour each other alive"(Hobbes)

Re: default result formats setting

2021-03-24 Thread Tom Lane

Robert Haas  writes:
> On Wed, Mar 24, 2021 at 10:58 AM Tom Lane  wrote:
>> A client that is sending -1 and assuming that it will get back
>> a particular format could get broken if the GUC doesn't have the
>> value it thinks, true.  But I'd argue that such code is unreasonably
>> non-robust.  Can't we solve this by recommending that clients using
>> this feature always double-check which format they actually got?
>> ISTM that the use-cases for the feature involve checking what data
>> type you got anyway, so that's not an unreasonable added requirement.

> I suppose that's a fair idea, but to me it still feels a bit like a
> round peg in the square hole. Suppose for example that there's a
> client application which wants to talk to a connection pooler which in
> turn wants to talk to the server. Let's also suppose that connection
> pooler isn't just a pass-through, but wants to redirect client
> connections to various servers or even intercept queries and result
> sets and make changes as the data passes by. It can do that by parsing
> SQL and solving the halting problem, whereas if this were a
> protocol-level option it would be completely doable. Now you could say
> "well, by that argument, DateStyle ought to be a protocol-level
> option, too," and that's pretty a pretty fair criticism of what I'm
> saying here. On the other hand, I'm not too sure that wouldn't have
> been the right call. Using SQL to tailor the wire protocol format
> feels like some kind of layering inversion to me.

I can't say that I'm 100% comfortable with it either, but the alternative
seems quite unpleasant, precisely because the client side might have
multiple layers involved.  If we make it a wire-protocol thing then
a whole lot of client API thrashing is going to ensue to transmit the
desired setting up and down the stack.  As an example, libpq doesn't
really give a darn which data format is returned: it is the application
using libpq that would want to be able to set this option.  If libpq
has to be involved in transmitting the option to the backend, then we
need a new libpq API call to tell it to do that.  Rinse and repeat
for anything that wraps libpq.  And, in the end, it's not real clear
which client-side layer *should* have control of this.  In some cases
you might want the decision to be taken quite high up, because which
format is really more efficient will depend on the total usage picture
for a given application, which low-level code like libpq wouldn't know.
Having a library decide that "this buck stops with me" is likely to be
the wrong thing.

I do not understand the structure of the client stack for JDBC, but
I wonder whether there won't be similar issues there.

As you say, DateStyle and the like are precedents for things that
*could* break application stacks, and in another universe maybe we'd
have managed them differently.  In the end though, they've been like
that for a long time and we've not heard many complaints about them.
So I'm inclined to think that that precedent says this is OK too.

BTW, I thought briefly about whether we could try to lock things down
a bit by marking the GUC as PGC_BACKEND, which would effectively mean
that clients would have to send it in the startup packet.  However,
that would verge on making it unusable for non-built-in datatypes,
for which you need to look up the OID first.  So I don't think that'd
be an improvement.

> I think we should be
> working toward a state where it's more clear which things are "owned"
> at the wire protocol level and which things are "owned" at the SQL
> level, and this seems to be going in exactly the opposite direction,

I don't think I buy the premise that there are exactly two levels
on the client side.

regards, tom lane

Re: Tying an object's ownership to datdba

2021-03-24 Thread John Naylor

On Mon, Dec 28, 2020 at 12:32 AM Noah Misch  wrote:
> [v2]

Hi Noah,

In the refactoring patch, there is a lingering comment reference to
roles_has_privs_of(). Aside from that, it looks good to me. A possible
thing to consider is an assert that is_admin is not null where we expect
that.

The database owner role patch looks good as well.

> I ended up blocking DDL that creates role memberships involving the new
role;
> see reasons in user.c comments.  Lifting those restrictions looked
feasible,
> but it was inessential to the mission, and avoiding unintended
consequences
> would have been tricky.  Views "information_schema.enabled_roles" and
> "information_schema.applicable_roles" list any implicit membership in
> pg_database_owner, but pg_catalog.pg_group and psql \dgS do not.  If this
> leads any reviewer to look closely at applicable_roles, note that its
behavior
> doesn't match its documentation
> (https://postgr.es/m/flat/20060728170615.gy20...@kenobi.snowman.net).

Is this something that needs fixing separately?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Support for NSS as a libpq TLS backend

2021-03-24 Thread Jacob Champion

On Wed, 2021-03-24 at 09:28 +0900, Michael Paquier wrote:
> On Wed, Mar 24, 2021 at 12:05:35AM +, Jacob Champion wrote:
> > I can work around it temporarily for the
> > tests, but this will be a problem if any libpq clients load up multiple
> > independent databases for use with separate connections. Anyone know if
> > this is a supported use case for NSS?
> 
> Are you referring to the case of threading here?  This should be a
> supported case, as threads created by an application through libpq
> could perfectly use completely different connection strings.
Right, but to clarify -- I was asking if *NSS* supports loading and
using separate certificate databases as part of its API. It seems like
the internals make it possible, but I don't see the public interfaces
to actually use those internals.

--Jacob

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Tom Lane

Andrew Dunstan  writes:
> On 3/20/21 3:03 PM, Tom Lane wrote:
>> I fixed up some issues in 0008/0009 (mostly cosmetic, except that
>> you forgot a server version check in dumpToastCompression) and
>> pushed that, so we can see if it makes crake happy.

> It's still produced a significant amount more difference between the
> dumps. For now I've increased the fuzz factor a bit like this:
> -   if (   ($oversion ne $this_branch && $difflines < 2000)
> +   if (   ($oversion ne $this_branch && $difflines < 2700)
> I'll try to come up with something better. Maybe just ignore lines like
> SET default_toast_compression = 'pglz';
> when taking the diff.

I see that some other buildfarm animals besides your own critters
are still failing the xversion tests, presumably because they lack
this hack :-(.

On reflection, though, I wonder if we've made pg_dump do the right
thing anyway.  There is a strong case to be made for the idea that
when dumping from a pre-14 server, it should emit
SET default_toast_compression = 'pglz';
rather than omitting any mention of the variable, which is what
I made it do in aa25d1089.  If we changed that, I think all these
diffs would go away.  Am I right in thinking that what's being
compared here is new pg_dump's dump from old server versus new
pg_dump's dump from new server?

The "strong case" goes like this: initdb a v14 cluster, change
default_toast_compression to lz4 in its postgresql.conf, then
try to pg_upgrade from an old server.  If the dump script doesn't
set default_toast_compression = 'pglz' then the upgrade will
do the wrong thing because all the tables will be recreated with
a different behavior than they had before.  IIUC, this wouldn't
result in broken data, but it still seems to me to be undesirable.
dump/restore ought to do its best to preserve the old DB state,
unless you explicitly tell it --no-toast-compression or the like.

regards, tom lane

Re: multi-install PostgresNode

2021-03-24 Thread Alvaro Herrera

On 2021-Mar-24, Andrew Dunstan wrote:

> 
> On 3/24/21 9:23 AM, Andrew Dunstan wrote:
> > On 3/24/21 8:29 AM, Alvaro Herrera wrote:

> > If we're going to do that we probably shouldn't special case any
> > particular settings, but simply take any extra arguments as extra env
> > settings. And if any setting has undef (e.g. PGAPPNAME) we should unset it.

> like this.

Hmm, I like that PGAPPNAME handling has resulted in an overall
simplification.  I'm not sure why you prefer to keep PGHOST and PGPORT
handled individually at each callsite however; why not do it like
_install, and add them to the environment always?  I doubt there's
anything that requires them *not* to be set; and if there is, it's easy
to make the claim that that's broken and should be fixed.

I'm just saying that cluttering _get_install_env() with those two
settings would result in less clutter overall.

-- 
Álvaro Herrera   Valdivia, Chile

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Dilip Kumar

On Wed, Mar 24, 2021 at 8:41 PM Robert Haas  wrote:
>
> On Wed, Mar 24, 2021 at 7:45 AM Dilip Kumar  wrote:
> > I have anyway created a patch for this as well.  Including all three
> > patches so we don't lose track.
> >
> > 0001 ->shows compression method for the index attribute in index describe
> > 0002 -> fix the reported bug (test case included)
> > (optional) 0003-> Alter set compression for index column
>
> As I understand it, the design idea here up until now has been that
> the index's attcompression values are irrelevant and ignored and that
> any compression which happens for index attributes is based either on
> the table attribute's assigned attcompression value, or the default.
> If that's the idea, then all of these patches are wrong.

The current design is that whenever we create an index, the index's
attribute copies the attcompression from the table's attribute.  And,
while compressing the index tuple we will use the attcompression from
the index attribute.

> Now, a possible alternative design would be that the index's
> attcompression controls compression for the index same as a table's
> does for the table. But in that case, it seems to me that these
> patches are insufficient, because then we'd also need to, for example,
> dump and restore the setting, which I don't think anything in these
> patches or the existing code will do.

Yeah, you are right.

> My vote, as of now, is for the first design, in which case you need to
> forget about trying to get pg_attribute to have the right contents -
> in fact, I think we should set all the values there to
> InvalidCompressionMethod to make sure we're not relying on them
> anywhere. And then you need to make sure that everything that tries to
> compress an index value uses the setting from the table column or the
> default, not the setting on the index column.

Okay, that sounds like a reasonable design idea.  But the problem is
that in index_form_tuple we only have index tuple descriptor, not the
heap tuple descriptor. Maybe we will have to pass the heap tuple
descriptor as a parameter to index_form_tuple.   I will think more
about this that how can we do that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: truncating timestamps on arbitrary intervals

2021-03-24 Thread Peter Eisentraut

On 18.01.21 21:54, John Naylor wrote:
On Mon, Nov 23, 2020 at 1:44 PM John Naylor 
mailto:john.nay...@enterprisedb.com>> wrote:

 >
 > On Thu, Nov 12, 2020 at 9:56 AM Peter Eisentraut 
> wrote:

 > > - After reading the discussion a few times, I'm not so sure anymore
 > > whether making this a cousin of date_trunc is the right way to go.  As
 > > you mentioned, there are some behaviors specific to date_trunc that
 > > don't really make sense in date_trunc_interval, and maybe we'll have
 > > more of those.

For v10, I simplified the behavior by decoupling the behavior from 
date_trunc() and putting in some restrictions as discussed earlier. It's 
much simpler now. It could be argued that it goes too far in that 
direction, but it's easy to reason about and we can put back some 
features as we see fit.

Committed.

I noticed that some of the documentation disappeared between v9 and v10. 
 So I put that back and updated it appropriately.  I also added a few 
more test cases to cover some things that have been discussed during the 
course of this thread.

As a potential follow-up, should we perhaps add named arguments?  That 
might make the invocations easier to read, depending on taste.

Re: default result formats setting

2021-03-24 Thread Robert Haas

On Wed, Mar 24, 2021 at 10:58 AM Tom Lane  wrote:
> Robert Haas  writes:
> > But ... if it's just a GUC, it can be set by code on the server side
> > that the client knows nothing about, breaking the client. That seems
> > pretty bad to me.
>
> It's impossible for the proposed patch to break *existing* clients,
> because they all send requested format 0 or 1, and that is exactly
> what they'll get back.

OK.

> A client that is sending -1 and assuming that it will get back
> a particular format could get broken if the GUC doesn't have the
> value it thinks, true.  But I'd argue that such code is unreasonably
> non-robust.  Can't we solve this by recommending that clients using
> this feature always double-check which format they actually got?
> ISTM that the use-cases for the feature involve checking what data
> type you got anyway, so that's not an unreasonable added requirement.

I suppose that's a fair idea, but to me it still feels a bit like a
round peg in the square hole. Suppose for example that there's a
client application which wants to talk to a connection pooler which in
turn wants to talk to the server. Let's also suppose that connection
pooler isn't just a pass-through, but wants to redirect client
connections to various servers or even intercept queries and result
sets and make changes as the data passes by. It can do that by parsing
SQL and solving the halting problem, whereas if this were a
protocol-level option it would be completely doable. Now you could say
"well, by that argument, DateStyle ought to be a protocol-level
option, too," and that's pretty a pretty fair criticism of what I'm
saying here. On the other hand, I'm not too sure that wouldn't have
been the right call. Using SQL to tailor the wire protocol format
feels like some kind of layering inversion to me. I think we should be
working toward a state where it's more clear which things are "owned"
at the wire protocol level and which things are "owned" at the SQL
level, and this seems to be going in exactly the opposite direction,
and in fact probably taking things further in that direction than
we've ever gone before.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-24 Thread Julien Rouhaud

On Wed, Mar 24, 2021 at 08:13:40AM -0400, Bruce Momjian wrote:
> On Wed, Mar 24, 2021 at 04:51:40PM +0800, Julien Rouhaud wrote:
> > > but if you do that it'd be better to avoid
> > > introducing a function with one name and renaming it in your next
> > > commit.
> > 
> > Oops, I apparently messed a fixup when working on it.  Bruce, should I take
> > care of that of do you want to?  I think you have some local modifications
> > already I'd rather not miss some changes.
> 
> I have no local modifications.  Please modify the patch I posted and
> repost your version, thanks.

Ok!  I used the last version of the patch you sent and addressed the following
comments from earlier messages in attached v20:

- copyright year to 2021
- s/has has been compute/has been compute/
- use the name CleanQuerytext in the first commit

I didn't change the position of queryid in pg_stat_get_activity(), as the
"real" order is actually define in system_views.sql when creating
pg_stat_activity view.  Adding the new fields at the end of
pg_stat_get_activity() helps to keep the C code simpler and less bug prone, so
I think it's best to continue this way.

I also used the previous commit message if that helps.
>From 5df95cb13c3505d654bd480c8978fe6f5eba00bb Mon Sep 17 00:00:00 2001
From: Bruce Momjian 
Date: Mon, 22 Mar 2021 17:43:22 -0400
Subject: [PATCH v20 1/3] Move pg_stat_statements query jumbling to core.

A new compute_query_id GUC is also added, to control whether a query identifier
should be computed by the core.  It's thefore now possible to disable core
queryid computation and use pg_stat_statements with a different algorithm to
compute the query identifier by using third-party module.

To ensure that a single source of query identifier can be used and is well
defined, modules that calculate a query identifier should throw an error if
compute_query_id is enabled or if a query idenfitier was already calculated.
---
 .../pg_stat_statements/pg_stat_statements.c   | 805 +
 .../pg_stat_statements.conf   |   1 +
 doc/src/sgml/config.sgml  |  25 +
 doc/src/sgml/pgstatstatements.sgml|  20 +-
 src/backend/parser/analyze.c  |  14 +-
 src/backend/tcop/postgres.c   |   6 +-
 src/backend/utils/misc/Makefile   |   1 +
 src/backend/utils/misc/guc.c  |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/backend/utils/misc/queryjumble.c  | 834 ++
 src/include/parser/analyze.h  |   4 +-
 src/include/utils/guc.h   |   1 +
 src/include/utils/queryjumble.h   |  58 ++
 13 files changed, 995 insertions(+), 785 deletions(-)
 create mode 100644 src/backend/utils/misc/queryjumble.c
 create mode 100644 src/include/utils/queryjumble.h

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 62cccbfa44..bd8c96728c 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -8,24 +8,9 @@
  * a shared hashtable.  (We track only as many distinct queries as will fit
  * in the designated amount of shared memory.)
  *
- * As of Postgres 9.2, this module normalizes query entries.  Normalization
- * is a process whereby similar queries, typically differing only in their
- * constants (though the exact rules are somewhat more subtle than that) are
- * recognized as equivalent, and are tracked as a single entry.  This is
- * particularly useful for non-prepared queries.
- *
- * Normalization is implemented by fingerprinting queries, selectively
- * serializing those fields of each query tree's nodes that are judged to be
- * essential to the query.  This is referred to as a query jumble.  This is
- * distinct from a regular serialization in that various extraneous
- * information is ignored as irrelevant or not essential to the query, such
- * as the collations of Vars and, most notably, the values of constants.
- *
- * This jumble is acquired at the end of parse analysis of each query, and
- * a 64-bit hash of it is stored into the query's Query.queryId field.
- * The server then copies this value around, making it available in plan
- * tree(s) generated from the query.  The executor can then use this value
- * to blame query costs on the proper queryId.
+ * Starting in Postgres 9.2, this module normalized query entries.  As of
+ * Postgres 14, the normalization is done by the core if compute_query_id is
+ * enabled, or optionally by third-party modules.
  *
  * To facilitate presenting entries to users, we create "representative" query
  * strings in which constants are replaced with parameter symbols ($n), to
@@ -114,8 +99,6 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
 #define USAGE_DEALLOC_PERCENT	5	/* free this % of entries at once */
 #define IS_STICKY(c)	((c.calls[PGSS_PLAN] + c.calls[PGSS_EXEC]) == 0)
 
-#

Re: Get memory contexts of an arbitrary backend process

2021-03-24 Thread torikoshia

On 2021-03-23 17:24, Kyotaro Horiguchi wrote:

Thanks for reviewing and suggestions!

At Mon, 22 Mar 2021 15:09:58 +0900, torikoshia
 wrote in

>> If MemoryContextStatsPrint(), i.e. showing 100 children at most is
>> enough, this hard limit may be acceptable.
> Can't this number be passed via shared memory?

The attached patch uses static shared memory to pass the number.

"pg_print_backend_memory_contexts"

That name looks like as if it returns the result as text when used on
command-line. We could have pg_get_backend_memory_context(bool
dump_to_log (or where to dump), int limit).  Or couldn't we name it
differently even in the ase we add a separate function?

Redefined pg_get_backend_memory_contexts() as
pg_get_backend_memory_contexts(pid, int max_children).

When pid equals 0, pg_get_backend_memory_contexts() prints local memory
contexts as original pg_get_backend_memory_contexts() does.
In this case, 'max_children' is ignored.

When 'pid' does not equal 0 and it is the PID of the client backend,
 memory contexts are logged through elog().

+/*
+ * MaxChildrenPerContext
+ *   Max number of children to print per one parent context.
+ */
+int  *MaxChildrenPerContext = NULL;

Perhaps it'd be better to have a struct even if it consists only of
one member.  (Aligned) C-int values are atomic so we can omit the
McxtPrintLock. (I don't think it's a problem even if it is modifed
while reading^^:)

Fixed them.

+ if(max_children <= 0)
+ {
+ ereport(WARNING,
+ (errmsg("%d is invalid value", 
max_children),
+  errhint("second parameter is the number 
of context and it must

be set to a value greater than or equal to 1")));

It's annoying to choose a number large enough when I want to dump
children unlimitedly.  Couldn't we use 0 to specify "unlimited"?

Modified as you suggested.

+ (errmsg("%d is invalid value", 
max_children),
+  errhint("second parameter is the number 
of context and it must

be set to a value greater than or equal to 1")));

For the main message, (I think) we usually spell the "%d is invalid
value" as "maximum number of children must be positive" or such.  For
the hint, we don't need a copy of the primary section of the
documentation here.

Modified it to "The maximum number of children must be greater than 0".

I think we should ERROR out for invalid parameters, at least for
max_children.  I'm not sure about pid since we might call it based on
pg_stat_activity..

Changed to ERROR out when the 'max_children' is less than 0.

Regarding pid, I left it untouched considering the consistency with 
other

signal sending functions such as pg_cancel_backend().

+ if(!SendProcSignal(pid, PROCSIG_PRINT_MEMORY_CONTEXT, 
InvalidBackendId))

We know the backendid of the process here.

Added it.

+ if (is_dst_stderr)
+ {
+ for (i = 0; i <= level; i++)
+ fprintf(stderr, "  ");

The fprintf path is used nowhere in the patch at all. It can be used
while attaching debugger but I'm not sure we need that code.  The
footprint of this patch is largely shrinked by removing it.

According to the past discussion[1], people wanted MemoryContextStats
as it was, so I think it's better that MemoryContextStats can be used
as before.

+ strcat(truncated_ident, delimiter);

strcpy is sufficient here.  And we don't need the delimiter to be a
variable.  (we can copy a string literal into truncate_ident, then
count the length of truncate_ident, instead of the delimiter
variable.)

True.

+ $current_logfiles = slurp_file($node->data_dir . 
'/current_logfiles');

...
+my $lfname = $current_logfiles;
+$lfname =~ s/^stderr //;
+chomp $lfname;

$node->logfile is the current log file name.

+ 'target PID is not PostgreSQL server process');

Maybe "check if PID check is working" or such?  And, we can do
something like the following to exercise in a more practical way.

 select pg_print_backend...(pid,) from pg_stat_activity where
backend_type = 'checkpointer';

It seems better.

As documented, the current implementation allows that when multiple
pg_print_backend_memory_contexts() called in succession or
simultaneously, max_children can be the one of another
pg_print_backend_memory_contexts().
I had tried to avoid this by adding some state information and using
before_shmem_exit() in case of process termination for cleaning up the
state information as in the patch I presented earlier, but since
kill()
returns success before the dumper called signal handler, it seemed
there were times when we couldn't clean up the state.
Since this happens only when multiple
pg_print_backend_memory_contexts()
are called and their specified number of children are different, and
the
effect is just the not intended number of children to print, it might
be
acceptab

Re: [HACKERS] Custom compression methods

2021-03-24 Thread Robert Haas

On Wed, Mar 24, 2021 at 7:45 AM Dilip Kumar  wrote:
> I have anyway created a patch for this as well.  Including all three
> patches so we don't lose track.
>
> 0001 ->shows compression method for the index attribute in index describe
> 0002 -> fix the reported bug (test case included)
> (optional) 0003-> Alter set compression for index column

As I understand it, the design idea here up until now has been that
the index's attcompression values are irrelevant and ignored and that
any compression which happens for index attributes is based either on
the table attribute's assigned attcompression value, or the default.
If that's the idea, then all of these patches are wrong.

Now, a possible alternative design would be that the index's
attcompression controls compression for the index same as a table's
does for the table. But in that case, it seems to me that these
patches are insufficient, because then we'd also need to, for example,
dump and restore the setting, which I don't think anything in these
patches or the existing code will do.

My vote, as of now, is for the first design, in which case you need to
forget about trying to get pg_attribute to have the right contents -
in fact, I think we should set all the values there to
InvalidCompressionMethod to make sure we're not relying on them
anywhere. And then you need to make sure that everything that tries to
compress an index value uses the setting from the table column or the
default, not the setting on the index column.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: multi-install PostgresNode

2021-03-24 Thread Andrew Dunstan


On 3/24/21 9:23 AM, Andrew Dunstan wrote:
> On 3/24/21 8:29 AM, Alvaro Herrera wrote:
>> On 2021-Mar-24, Dagfinn Ilmari Mannsåker wrote:
>>
>>> I think it would be even neater having a method that returns the desired
>>> environment and then have the other methods do:
>>>
>>>   local %ENV = $self->_get_install_env();
>> Hmm, is it possible to integrate PGHOST and PGPORT handling into this
>> too?  Seems like it is, so the name of the routine should be something
>> more general (and also it should not have the quick "return unless
>> $inst").
>>
>
> If we're going to do that we probably shouldn't special case any
> particular settings, but simply take any extra arguments as extra env
> settings. And if any setting has undef (e.g. PGAPPNAME) we should unset it.
>
>


like this.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 97e05993be..fa2ccf027f 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -355,6 +355,7 @@ sub info
 	print $fh "Archive directory: " . $self->archive_dir . "\n";
 	print $fh "Connection string: " . $self->connstr . "\n";
 	print $fh "Log file: " . $self->logfile . "\n";
+	print $fh "Install Path: " , $self->{_install_path} . "\n" if $self->{_install_path};
 	close $fh or die;
 	return $_info;
 }
@@ -428,6 +429,8 @@ sub init
 	my $pgdata = $self->data_dir;
 	my $host   = $self->host;
 
+	local %ENV = $self->_get_install_env();
+
 	$params{allows_streaming} = 0 unless defined $params{allows_streaming};
 	$params{has_archiving}= 0 unless defined $params{has_archiving};
 
@@ -555,6 +558,8 @@ sub backup
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name= $self->name;
 
+	local %ENV = $self->_get_install_env();
+
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail(
 		'pg_basebackup', '-D', $backup_path, '-h',
@@ -784,18 +789,15 @@ sub start
 
 	print("### Starting node \"$name\"\n");
 
-	{
-		# Temporarily unset PGAPPNAME so that the server doesn't
-		# inherit it.  Otherwise this could affect libpqwalreceiver
-		# connections in confusing ways.
-		local %ENV = %ENV;
-		delete $ENV{PGAPPNAME};
-
-		# Note: We set the cluster_name here, not in postgresql.conf (in
-		# sub init) so that it does not get copied to standbys.
-		$ret = TestLib::system_log('pg_ctl', '-D', $self->data_dir, '-l',
+	# Temporarily unset PGAPPNAME so that the server doesn't
+	# inherit it.  Otherwise this could affect libpqwalreceiver
+	# connections in confusing ways.
+	local %ENV = $self->_get_install_env(PGAPPNAME => undef);
+
+	# Note: We set the cluster_name here, not in postgresql.conf (in
+	# sub init) so that it does not get copied to standbys.
+	$ret = TestLib::system_log('pg_ctl', '-D', $self->data_dir, '-l',
 			$self->logfile, '-o', "--cluster-name=$name", 'start');
-	}
 
 	if ($ret != 0)
 	{
@@ -826,6 +828,9 @@ sub kill9
 	my ($self) = @_;
 	my $name = $self->name;
 	return unless defined $self->{_pid};
+
+	local %ENV = $self->_get_install_env();
+
 	print "### Killing node \"$name\" using signal 9\n";
 	# kill(9, ...) fails under msys Perl 5.8.8, so fall back on pg_ctl.
 	kill(9, $self->{_pid})
@@ -852,6 +857,9 @@ sub stop
 	my $port   = $self->port;
 	my $pgdata = $self->data_dir;
 	my $name   = $self->name;
+
+	local %ENV = $self->_get_install_env();
+
 	$mode = 'fast' unless defined $mode;
 	return unless defined $self->{_pid};
 	print "### Stopping node \"$name\" using mode $mode\n";
@@ -874,6 +882,9 @@ sub reload
 	my $port   = $self->port;
 	my $pgdata = $self->data_dir;
 	my $name   = $self->name;
+
+	local %ENV = $self->_get_install_env();
+
 	print "### Reloading node \"$name\"\n";
 	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, 'reload');
 	return;
@@ -895,15 +906,12 @@ sub restart
 	my $logfile = $self->logfile;
 	my $name= $self->name;
 
-	print "### Restarting node \"$name\"\n";
+	local %ENV = $self->_get_install_env(PGAPPNAME => undef);
 
-	{
-		local %ENV = %ENV;
-		delete $ENV{PGAPPNAME};
+	print "### Restarting node \"$name\"\n";
 
-		TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
+	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
 			'restart');
-	}
 
 	$self->_update_pid(1);
 	return;
@@ -924,6 +932,9 @@ sub promote
 	my $pgdata  = $self->data_dir;
 	my $logfile = $self->logfile;
 	my $name= $self->name;
+
+	local %ENV = $self->_get_install_env();
+
 	print "### Promoting node \"$name\"\n";
 	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
 		'promote');
@@ -945,6 +956,9 @@ sub logrotate
 	my $pgdata  = $self->data_dir;
 	my $logfile = $self->logfile;
 	my $name= $self->name;
+
+	local %ENV = $self->_get_install_env();
+
 	print "### Rotating log in node \"$name\"\n";
 	TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
 		'logrotate');
@@ -1117,6 +1131,14 @@ By default, all nodes use the same PG

Re: default result formats setting

2021-03-24 Thread Tom Lane

Robert Haas  writes:
> But ... if it's just a GUC, it can be set by code on the server side
> that the client knows nothing about, breaking the client. That seems
> pretty bad to me.

It's impossible for the proposed patch to break *existing* clients,
because they all send requested format 0 or 1, and that is exactly
what they'll get back.

A client that is sending -1 and assuming that it will get back
a particular format could get broken if the GUC doesn't have the
value it thinks, true.  But I'd argue that such code is unreasonably
non-robust.  Can't we solve this by recommending that clients using
this feature always double-check which format they actually got?
ISTM that the use-cases for the feature involve checking what data
type you got anyway, so that's not an unreasonable added requirement.

regards, tom lane

Re: pgsql: Move tablespace path re-creation from the makefiles to pg_regres

2021-03-24 Thread Robert Haas

On Wed, Mar 24, 2021 at 5:56 AM Christoph Berg  wrote:
> Maybe creating the tablespace directory from within the testsuite
> would suffice?
>
> CREATE TABLE foo (t text);
> COPY foo FROM PROGRAM 'mkdir @testtablespace@';
> CREATE TABLESPACE regress_tblspace LOCATION '@testtablespace@';

Would that work on Windows?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: default result formats setting

2021-03-24 Thread Robert Haas

On Thu, Nov 5, 2020 at 3:49 PM Peter Eisentraut
 wrote:
> We could also make it a protocol message, but it would essentially
> implement the same thing, just again separately.  And then you'd have no
> support to inspect the current setting, test out different settings
> interactively, etc.  That seems pretty wasteful and complicated for no
> real gain.

But ... if it's just a GUC, it can be set by code on the server side
that the client knows nothing about, breaking the client. That seems
pretty bad to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Tomas Vondra

On 3/24/21 2:36 PM, Dean Rasheed wrote:
> On Wed, 24 Mar 2021 at 10:22, Tomas Vondra
>  wrote:
>>
>> Thanks, it seems to be some thinko in handling in PlaceHolderVars, which
>> seem to break the code's assumptions about varnos. This fixes it for me,
>> but I need to look at it more closely.
>>
> 
> I think that makes sense.
> 

AFAIK the primary issue here is that the two places disagree. While
estimate_num_groups does this

varnos = pull_varnos(root, (Node *) varshere);
if (bms_membership(varnos) == BMS_SINGLETON)
{ ... }

the add_unique_group_expr does this

varnos = pull_varnos(root, (Node *) groupexpr);

That is, one looks at the group expression, while the other look at vars
extracted from it by pull_var_clause(). Apparently for PlaceHolderVar
this can differ, causing the crash.

So we need to change one of those places - my fix tweaked the second
place to also look at the vars, but maybe we should change the other
place? Or maybe it's not the right fix for PlaceHolderVars ...

> Reviewing the docs, I noticed a couple of omissions, and had a few
> other suggestions (attached).
> 

Thanks! I'll include that in the next version of the patch.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_amcheck contrib application

2021-03-24 Thread Robert Haas

On Wed, Mar 24, 2021 at 9:12 AM Robert Haas  wrote:
> On Wed, Mar 24, 2021 at 2:13 AM Mark Dilger
>  wrote:
> > The visibility rules fix is different in v11, relying on a visibility check 
> > which more closely follows the implementation of 
> > HeapTupleSatisfiesVacuumHorizon.
>
> Hmm. The header comment you wrote says "If a tuple might not be
> visible to any running transaction, then we must not check it." But, I
> don't find that statement very clear: does it mean "if there could be
> even one transaction to which this tuple is not visible, we must not
> check it"? Or does it mean "if the number of transactions that can see
> this tuple could potentially be zero, then we must not check it"? I
> don't think either of those is actually what we care about. I think
> what we should be saying is "if the tuple could have been inserted by
> a transaction that also added a column to the table, but which
> ultimately did not commit, then the table's current TupleDesc might
> differ from the one used to construct this tuple, so we must not check
> it."

Hit send too soon. And I was wrong, too. Wahoo. Thinking about the
buildfarm failure, I realized that there's a second danger here,
unrelated to the possibility of different TupleDescs, which we talked
about before: if the tuple is dead, we can't safely follow any TOAST
pointers, because the TOAST chunks might disappear at any time. So
technically we could split the return value up into something
three-way: if the inserted is known to have committed, we can check
the tuple itself, because the TupleDesc has to be reasonable. And, if
the tuple is known not to be dead already, and known not to be in a
state where it could become dead while we're doing stuff, we can
follow the TOAST pointer. I'm not sure whether it's worth trying to be
that fancy or not.

If we were only concerned about the mismatched-TupleDesc problem, this
function could return true in a lot more cases. Once we get to the
comment that says "Okay, the inserter committed..." we could just
return true. Similarly, the HEAP_MOVED_IN and HEAP_MOVED_OFF cases
could just skip all the interior test and return true, because if the
tuple is being moved, the original inserter has to have committed.
Conversely, however, the !HeapTupleHeaderXminCommitted ->
TransactionIdIsCurrentTransactionId case probably ought to always
return false. One could argue otherwise: if we're the inserter, then
the only in-progress transaction that might have changed the TupleDesc
is us, so we could just consider this case to be a true return value
also, regardless of what's going on with xmax. In that case, we're not
asking "did the inserter definitely commit?" but "are the inserter's
possible DDL changes definitely visible to us?" which might be an OK
definition too.

However, the could-the-TOAST-data-disappear problem is another story.
I don't see how we can answer that question correctly with the logic
you've got here, because you have no XID threshold. Consider the case
where we reach this code:

+   if (!(tuphdr->t_infomask & HEAP_XMAX_COMMITTED))
+   {
+   if
(TransactionIdIsInProgress(HeapTupleHeaderGetRawXmax(tuphdr)))
+   return true;/*
HEAPTUPLE_DELETE_IN_PROGRESS */
+   else if
(!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuphdr)))
+
+   /*
+* Not in Progress, Not Committed, so either
Aborted or crashed
+*/
+   return true;/* HEAPTUPLE_LIVE */
+
+   /* At this point the xmax is known committed */
+   }

If we reach the case where the code comment says
HEAPTUPLE_DELETE_IN_PROGRESS, we know that the tuple isn't dead right
now, and so the TOAST tuples aren't dead either. But, by the time we
go try to look at the TOAST tuples, they might have become dead and
been pruned away, because the deleting transaction can commit at any
time, and after that pruning can happen at any time. Our only
guarantee that that won't happen is if the deleting XID is new enough
that it's invisible to some snapshot that our backend has registered.
That's approximately why HeapTupleSatisfiesVacuumHorizon needs to set
*dead_after in this case and one other, and I think you have the same
requirement.

I just noticed that this whole thing has another, related problem:
check_tuple_header_and_visibilty() and check_tuple_attribute() are
called from within check_tuple(), which is called while we hold a
buffer lock on the heap page. We should not be going and doing complex
operations that might take their own buffer locks - like TOAST index
checks - while we're holding an lwlock. That's going to have to be
changed so that the TOAST pointer checking happens after
UnlockReleaseBuffer(); in other words, we'll need to remember the
TOAST pointers to go look up and actually look them up after
UnlockReleaseBuffer(). But, when we do that, then the HEAPTUPLE_LIVE
case above has the s

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

2021-03-24 Thread Tom Lane

Amul Sul  writes:
> On Tue, Mar 23, 2021 at 8:59 PM Tom Lane  wrote:
>> On closer inspection, I believe the true culprit is c6b92041d,
>> which did this:
>> -   heap_sync(state->rs_new_rel);
>> +   smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
>> heap_sync was careful about opening rd_smgr, the new code not so much.

> So we also need to make sure of the
> RelationOpenSmgr() call before smgrimmedsync() as proposed previously.

I wonder if we should try to get rid of this sort of bug by banning
direct references to rd_smgr?  That is, write the above and all
similar code like

smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);

where we provide something like

static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
if (unlikely(rel->rd_smgr == NULL))
RelationOpenSmgr(rel);
return rel->rd_smgr;
}

and then we could get rid of most or all other RelationOpenSmgr calls.

This might create more code bloat than it's really worth, but
it'd be a simple and mechanically-checkable scheme.

regards, tom lane

psql lacking clearerr()

2021-03-24 Thread Alvaro Herrera

psql seems to never call clearerr() on its output file.  So if it gets
an error while printing a result, it'll show

could not print result table: Success

after each and every result, even though the output file isn't in error
state anymore.

It seems that the simplest fix is just to do clearerr() at the start of
printTable(), as in the attached.

I haven't been able to find a good reproducer.  Sometimes doing C-s C-c
does it, but I'm not sure it is fully reproducible.

-- 
Álvaro Herrera   Valdivia, Chile
diff --git a/src/fe_utils/print.c b/src/fe_utils/print.c
index e8772a278c..0d0b5ef782 100644
--- a/src/fe_utils/print.c
+++ b/src/fe_utils/print.c
@@ -3322,6 +3322,9 @@ printTable(const printTableContent *cont,
 	if (cont->opt->format == PRINT_NOTHING)
 		return;
 
+	/* Clear any prior error indicator */
+	clearerr(fout);
+
 	/* print_aligned_*() handle the pager themselves */
 	if (!is_pager &&
 		cont->opt->format != PRINT_ALIGNED &&

Re: shared memory stats: high level design decisions: consistency, dropping

2021-03-24 Thread Magnus Hagander

On Sun, Mar 21, 2021 at 11:34 PM Andres Freund  wrote:
>
> Hi,
>
> On 2021-03-21 12:14:35 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > 1) What kind of consistency do we want from the pg_stats_* views?
> >
> > That's a hard choice to make.  But let me set the record straight:
> > when we did the initial implementation, the stats snapshotting behavior
> > was considered a FEATURE, not an "efficiency hack required by the old
> > storage model".
>
> Oh - sorry for misstating that then. I did try to look for the origins of the
> approach, and all that I found was that it'd be too expensive to do multiple
> stats file reads.
>
>
> > If I understand what you are proposing, all stats views would become
> > completely volatile, without even within-query consistency.  That really
> > is not gonna work.  As an example, you could get not-even-self-consistent
> > results from a join to a stats view if the planner decides to implement
> > it as a nestloop with the view on the inside.
>
> I don't really think it's a problem that's worth incuring that much cost to
> prevent. We already have that behaviour for a number of of the pg_stat_* 
> views,
> e.g. pg_stat_xact_all_tables, pg_stat_replication.

Aren't those both pretty bad examples though?

pg_stat_xact_all_tables surely is within-query consistent, and would
be pretty useless if it wwas within-transaction consistent?

pg_stat_replication is a snapshot of what things are right now (like
pg_stat_activity), and not collected statistics.

Maybe there's inconsistency in that they should've had a different
name to separate it out, but fundamentally having xact consistent
views there would be a bad thing, no?

> If the cost were low - or we can find a reasonable way to get to low costs - I
> think it'd be worth preserving for backward compatibility's sake alone.  From
> an application perspective, I actually rarely want that behaviour for stats
> views - I'm querying them to get the most recent information, not an older
> snapshot. And in the cases I do want snapshots, I'd want them for longer than 
> a
> transaction.

I agree in general, but I'd want them to be *query-consistent*, not
*transaction-consistent*. But the question is as you say, am I willing
to pay for that. Less certain of that.

> There's just a huge difference between being able to access a table's stats in
> O(1) time, or having a single stats access be O(database-objects).
>
> And that includes accesses to things like pg_stat_bgwriter, pg_stat_database
> (for IO over time stats etc) that often are monitored at a somewhat high
> frequency - they also pay the price of reading in all object stats.  On my
> example database with 1M tables it takes 0.4s to read pg_stat_database.

IMV, singeling things out into "larger groups" would be one perfectly
acceptable compromise. That is, say that pg_stat_user_tables can be
inconsistent vs with pg_stat_bgwriter, but it cannot be inconsistent
with itself.

Basically anything that's "global" seems like it could be treated that
way, independent of each other.

For relations and such having a way to get just a single relation
stats or a number of them that will be consistent with each other
without getting all of them, could also be a reasonable optimization.
Mabye an SRF that takes an array of oids as a parameter and returns
consistent data across those, without having to copy/mess with the
rest?

> We currently also fetch the full stats in places like autovacuum.c. Where we
> don't need repeated access to be consistent - we even explicitly force the
> stats to be re-read for every single table that's getting vacuumed.
>
> Even if we to just cache already accessed stats, places like do_autovacuum()
> would end up with a completely unnecessary cache of all tables, blowing up
> memory usage by a large amount on systems with lots of relations.

autovacuum is already dealing with things being pretty fuzzy though,
so it shouldn't matter much there?

But autovacuum might also deserve it's own interface to access the
data directly and doesn't have to follow the same one as the stats
views in this new scheme, perhaps?

> > I also believe that the snapshotting behavior has advantages in terms
> > of being able to perform multiple successive queries and get consistent
> > results from them.  Only the most trivial sorts of analysis don't need
> > that.
>
> In most cases you'd not do that in a transaction tho, and you'd need to create
> temporary tables with a snapshot of the stats anyway.

I'd say in most cases this analysis happens in snapshots anyway, and
those are snapshots unrelated to what we do in pg_stat. It's either
snapshotted to tables, or to storage in a completely separate
database.

> > In short, what you are proposing sounds absolutely disastrous for
> > usability of the stats views, and I for one will not sign off on it
> > being acceptable.
>
> :(
>
> That's why I thought it'd be important to bring this up to a wider
> audience. This has been discuss

Re: PoC/WIP: Extended statistics on expressions

2021-03-24 Thread Dean Rasheed

On Wed, 24 Mar 2021 at 10:22, Tomas Vondra
 wrote:
>
> Thanks, it seems to be some thinko in handling in PlaceHolderVars, which
> seem to break the code's assumptions about varnos. This fixes it for me,
> but I need to look at it more closely.
>

I think that makes sense.

Reviewing the docs, I noticed a couple of omissions, and had a few
other suggestions (attached).

Regards,
Dean
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
new file mode 100644
index dadca67..382cbd7
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -7377,6 +7377,15 @@ SCRAM-SHA-256$

Re: shared memory stats: high level design decisions: consistency, dropping

2021-03-24 Thread Magnus Hagander

On Tue, Mar 23, 2021 at 4:21 AM Greg Stark  wrote:
>
> On Sun, 21 Mar 2021 at 18:16, Stephen Frost  wrote:
> >
> > Greetings,
> >
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > > I also believe that the snapshotting behavior has advantages in terms
> > > of being able to perform multiple successive queries and get consistent
> > > results from them.  Only the most trivial sorts of analysis don't need
> > > that.
> > >
> > > In short, what you are proposing sounds absolutely disastrous for
> > > usability of the stats views, and I for one will not sign off on it
> > > being acceptable.
> > >
> > > I do think we could relax the consistency guarantees a little bit,
> > > perhaps along the lines of only caching view rows that have already
> > > been read, rather than grabbing everything up front.  But we can't
> > > just toss the snapshot concept out the window.  It'd be like deciding
> > > that nobody needs MVCC, or even any sort of repeatable read.
> >
> > This isn't the same use-case as traditional tables or relational
> > concepts in general- there aren't any foreign keys for the fields that
> > would actually be changing across these accesses to the shared memory
> > stats- we're talking about gross stats numbers like the number of
> > inserts into a table, not an employee_id column.  In short, I don't
> > agree that this is a fair comparison.
>
> I use these stats quite a bit and do lots of slicing and dicing with
> them. I don't think it's as bad as Tom says but I also don't think we
> can be quite as loosy-goosy as I think Andres or Stephen might be
> proposing either (though I note that haven't said they don't want any
> consistency at all).
>
> The cases where the consistency really matter for me is when I'm doing
> math involving more than one statistic.
>
> Typically that's ratios. E.g. with pg_stat_*_tables I routinely divide
> seq_tup_read by seq_scan or idx_tup_* by idx_scans. I also often look
> at the ratio between n_tup_upd and n_tup_hot_upd.
>
> And no, it doesn't help that these are often large numbers after a
> long time because I'm actually working with the first derivative of
> these numbers using snapshots or a time series database. So if you
> have the seq_tup_read incremented but not seq_scan incremented you
> could get a wildly incorrect calculation of "tup read per seq scan"
> which actually matters.
>
> I don't think I've ever done math across stats for different objects.
> I mean, I've plotted them together and looked at which was higher but
> I don't think that's affected by some plots having peaks slightly out
> of sync with the other. I suppose you could look at the ratio of
> access patterns between two tables and know that they're only ever
> accessed by a single code path at the same time and therefore the
> ratios would be meaningful. But I don't think users would be surprised
> to find they're not consistent that way either.

Yeah, it's important to differentiate if things can be inconsistent
within a single object, or just between objects. And I agree that in a
lot of cases, just having per-object consistent data is probably
enough.

Normally when you graph things for example, your peaks will look
across >1 sample point anyway, and in that case it doesn't much matter
does it?

But if we said we try to offer per-object consistency only, then for
example the idx_scans value in the tables view may see changes to some
but not all indexes on that table. Would that be acceptable?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: multi-install PostgresNode

2021-03-24 Thread Andrew Dunstan



On 3/24/21 8:29 AM, Alvaro Herrera wrote:
> On 2021-Mar-24, Dagfinn Ilmari Mannsåker wrote:
>
>> I think it would be even neater having a method that returns the desired
>> environment and then have the other methods do:
>>
>>   local %ENV = $self->_get_install_env();
> Hmm, is it possible to integrate PGHOST and PGPORT handling into this
> too?  Seems like it is, so the name of the routine should be something
> more general (and also it should not have the quick "return unless
> $inst").
>


If we're going to do that we probably shouldn't special case any
particular settings, but simply take any extra arguments as extra env
settings. And if any setting has undef (e.g. PGAPPNAME) we should unset it.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

1 2 >

1 - 100 of 140 matches

Mail list logo