date:20201221

On Mon, Dec 21, 2020 at 04:35:14PM -0500, Bruce Momjian wrote:
> On Sun, Dec 20, 2020 at 05:59:09PM -0500, Stephen Frost wrote:
> OK, here it the updated patch.  The major change, as Stephen pointed
> out, is that the cluser_key_command (was cluster_passphrase_command) now
> returns a hex-string representing the 32-byte KEK, rather than having
> the server hash whatever the command used to return.  This avoids
> double-hashing in cases where you are _not_ using a passphrase, but are
> computing a random 32-byte value in the script.  This does mean the
> script has to run sha256 for passphrases, but it seems like a win. 
> Updated script are attached too.

Attached is the script patch.  It is also at:


https://github.com/postgres/postgres/compare/master...bmomjian:cfe-sh.diff

I think it still needs docs but those will have to be done after the
code doc patch is added.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

diff --git a/src/backend/utils/auth_key_cmds/Makefile b/src/backend/utils/auth_key_cmds/Makefile
new file mode 100644
index ...e0bee97
*** a/src/backend/utils/auth_key_cmds/Makefile
--- b/src/backend/utils/auth_key_cmds/Makefile
***
*** 0 
--- 1,41 
+ #-
+ #
+ # Makefile for src/backend/utils/auth_key_cmds
+ #
+ # src/backend/utils/auth_key_cmds/Makefile
+ #
+ #-
+ 
+ PGFILEDESC = "auth_key_cmds - authentication key commands"
+ PGAPPICON = win32
+ 
+ subdir = src/backend/utils/auth_key_cmds
+ top_builddir = ../../../..
+ include $(top_builddir)/src/Makefile.global
+ 
+ SCRIPTS = ckey_aws.sh.sample \
+ 	ckey_direct.sh.sample \
+ 	ckey_passphrase.sh.sample \
+ 	ckey_piv_nopin.sh.sample  \
+ 	ckey_piv_pin.sh.sample \
+ 	ssl_paasphrase.sh.sample
+ 
+ SCRIPTDIR=auth_key_cmds
+ 
+ install: all installdirs
+ 	$(INSTALL_DATA) $(SCRIPTS) '$(DESTDIR)$(datadir)/$(SCRIPTDIR)'
+ 
+ installdirs:
+ 	$(MKDIR_P) '$(DESTDIR)$(datadir)' '$(DESTDIR)$(datadir)/$(SCRIPTDIR)'
+ 
+ uninstall:
+ 	@set -e; \
+ 	set $(SCRIPT) ; \
+ 	while [ "$$#" -gt 0 ] ; \
+ 	do \
+ 		script=$$1; shift;  \
+ 		rm -f '$(DESTDIR)$(datadir)/$(SCRIPTDIR)/'$${lang}.stop ; \
+ 	done
+ 
+ clean distclean maintainer-clean: clean-lib
+ 	rm -f $(OBJS) $(SQLSCRIPT)
diff --git a/src/backend/utils/auth_key_cmds/ckey_aws.sh.sample b/src/backend/utils/auth_key_cmds/ckey_aws.sh.sample
new file mode 100755
index ...0341621
*** a/src/backend/utils/auth_key_cmds/ckey_aws.sh.sample
--- b/src/backend/utils/auth_key_cmds/ckey_aws.sh.sample
***
*** 0 
--- 1,50 
+ #!/bin/sh
+ 
+ # This uses the AWS Secrets Manager using the AWS CLI and OpenSSL.
+ 
+ [ "$#" -ne 1 ] && echo "cluster_key_command usage: $0 \"%d\"" 1>&2 && exit 1
+ # No need for %R or -R since we are not prompting
+ 
+ DIR="$1"
+ [ ! -e "$DIR" ] && echo "$DIR does not exist" 1>&2 && exit 1
+ [ ! -d "$DIR" ] && echo "$DIR is not a directory" 1>&2 && exit 1
+ 
+ # File containing the id of the AWS secret
+ AWS_ID_FILE="$DIR/aws-secret.id"
+ 
+ 
+ # --
+ 
+ 
+ # Create an AWS Secrets Manager secret?
+ if [ ! -e "$AWS_ID_FILE" ]
+ then	# The 'postgres' operating system user must have permission to
+ 	# access the AWS CLI
+ 
+ 	# The epoch-time/directory/hostname combination is unique
+ 	HASH=$(echo -n "$(date '+%s')$DIR$(hostname)" | sha1sum | cut -d' ' -f1)
+ 	AWS_SECRET_ID="Postgres-cluster-key-$HASH"
+ 
+ 	# Use stdin to avoid passing the secret on the command line
+ 	openssl rand -hex 32 |
+ 	aws secretsmanager create-secret \
+ 		--name "$AWS_SECRET_ID" \
+ 		--description 'Used for Postgres cluster file encryption' \
+ 		--secret-string 'file:///dev/stdin' \
+ 		--output text > /dev/null
+ 	if [ "$?" -ne 0 ]
+ 	then	echo 'cluster key generation failed' 1>&2
+ 		exit 1
+ 	fi
+ 
+ 	echo "$AWS_SECRET_ID" > "$AWS_ID_FILE"
+ fi
+ 
+ if ! aws secretsmanager get-secret-value \
+ 	--secret-id "$(cat "$AWS_ID_FILE")" \
+ 	--output text
+ then	echo 'cluster key retrieval failed' 1>&2
+ 	exit 1
+ fi | awk -F'\t' 'NR == 1 {print $4}'
+ 
+ exit 0
diff --git a/src/backend/utils/auth_key_cmds/ckey_direct.sh.sample b/src/backend/utils/auth_key_cmds/ckey_direct.sh.sample
new file mode 100755
index ...4a56cc3
*** a/src/backend/utils/auth_key_cmds/ckey_direct.sh.sample
--- b/src/backend/utils/auth_key_cmds/ckey_direct.sh.sample
***
*** 0 
--- 1,37 
+ #!/bin/sh
+ 
+ # This uses a key supplied by the user
+ # If OpenSSL is installed, you can generate a pseudo-random key by running:
+ #	openssl rand -hex 32
+ # To get a true random key, run:
+ #	wget -q -O - 'https://www.random.org/cgi-bin/randbyte?nbytes=32&format=h' | tr -d ' \n'; echo
+ 
+ [ "$#" -lt 1 ] && echo "cluster_key_command usage: $0 %R [\"%P\"]" 1>&2 &&

RE: [Patch] Optimize dropping of relation buffers using dlist

From: Kyotaro Horiguchi 
> Mmm. If that is true, doesn't the unoptimized path also need the
> rechecking?

Yes, the traditional processing does the recheck after acquiring the buffer 
header spinlock.


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

At Tue, 22 Dec 2020 02:48:22 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Amit Kapila 
> > Why would all client backends wait for AccessExclusive lock on this
> > relation? Say, a client needs a buffer for some other relation and
> > that might evict this buffer after we release the lock on the
> > partition. In StrategyGetBuffer, it is important to either have a pin
> > on the buffer or the buffer header itself must be locked to avoid
> > getting picked as victim buffer. Am I missing something?
> 
> Ouch, right.  (The year-end business must be making me crazy...)
> 
> So, there are two choices here:
> 
> 1) The current patch.
> 2) Acquire the buffer header spinlock before releasing the buffer mapping 
> lwlock, and eliminate the buffer tag comparison as follows:
> 
>   BufTableLookup();
>   LockBufHdr();
>   LWLockRelease();
>   InvalidateBuffer();
> 
> I think both are okay.  If I must choose either, I kind of prefer 1), because 
> LWLockRelease() could take longer time to wake up other processes waiting on 
> the lwlock, which is not very good to do while holding a spinlock.

I like, as said before, the current patch.

regareds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

On Tue, Dec 22, 2020 at 8:12 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 22 Dec 2020 08:08:10 +0530, Amit Kapila  
> wrote in
>
> > Why would all client backends wait for AccessExclusive lock on this
> > relation? Say, a client needs a buffer for some other relation and
> > that might evict this buffer after we release the lock on the
> > partition. In StrategyGetBuffer, it is important to either have a pin
> > on the buffer or the buffer header itself must be locked to avoid
> > getting picked as victim buffer. Am I missing something?
>
> I think exactly like that.  If we acquire the bufHdr lock before
> releasing the partition lock, that steal doesn't happen but it doesn't
> seem good as a locking protocol.
>

Right, so let's keep the code as it is but I feel it is better to add
some comments explaining the rationale behind this code.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

On Tue, Dec 22, 2020 at 8:18 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > Why would all client backends wait for AccessExclusive lock on this
> > relation? Say, a client needs a buffer for some other relation and
> > that might evict this buffer after we release the lock on the
> > partition. In StrategyGetBuffer, it is important to either have a pin
> > on the buffer or the buffer header itself must be locked to avoid
> > getting picked as victim buffer. Am I missing something?
>
> Ouch, right.  (The year-end business must be making me crazy...)
>
> So, there are two choices here:
>
> 1) The current patch.
> 2) Acquire the buffer header spinlock before releasing the buffer mapping 
> lwlock, and eliminate the buffer tag comparison as follows:
>
>   BufTableLookup();
>   LockBufHdr();
>   LWLockRelease();
>   InvalidateBuffer();
>
> I think both are okay.  If I must choose either, I kind of prefer 1), because 
> LWLockRelease() could take longer time to wake up other processes waiting on 
> the lwlock, which is not very good to do while holding a spinlock.
>
>

I also prefer (1). I will add some comments about the locking protocol
in the next version of the patch.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

From: Amit Kapila 
> Why would all client backends wait for AccessExclusive lock on this
> relation? Say, a client needs a buffer for some other relation and
> that might evict this buffer after we release the lock on the
> partition. In StrategyGetBuffer, it is important to either have a pin
> on the buffer or the buffer header itself must be locked to avoid
> getting picked as victim buffer. Am I missing something?

Ouch, right.  (The year-end business must be making me crazy...)

So, there are two choices here:

1) The current patch.
2) Acquire the buffer header spinlock before releasing the buffer mapping 
lwlock, and eliminate the buffer tag comparison as follows:

  BufTableLookup();
  LockBufHdr();
  LWLockRelease();
  InvalidateBuffer();

I think both are okay.  If I must choose either, I kind of prefer 1), because 
LWLockRelease() could take longer time to wake up other processes waiting on 
the lwlock, which is not very good to do while holding a spinlock.


Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

At Tue, 22 Dec 2020 08:08:10 +0530, Amit Kapila  wrote 
in 
> On Tue, Dec 22, 2020 at 7:13 AM tsunakawa.ta...@fujitsu.com
>  wrote:
> >
> > From: Amit Kapila 
> > > This answers the second part of the question but what about the first
> > > part (We hold a buffer partition lock, and have done a lookup in th
> > > mapping table. Why are we then rechecking the
> > > relfilenode/fork/blocknum?)
> > >
> > > I think we don't need such a check, rather we can have an Assert
> > > corresponding to that if-condition in the patch. I understand it is
> > > safe to compare relfilenode/fork/blocknum but it might confuse readers
> > > of the code.
> >
> > Hmm, you're right. I thought someone else could steal the found buffer and 
> > use it for another block because the buffer mapping lwlock is released 
> > without pinning the buffer or acquiring the buffer header spinlock.
> >
> 
> Okay, I see your point.
> 
> >  However, in this case (replay of TRUNCATE during recovery), nobody steals 
> > the buffer: bgwriter or checkpointer doesn't use a buffer for a new block, 
> > and the client backend waits for AccessExclusive lock.
> >
> >

I understood that you are thinking that the rechecking is useless.

> Why would all client backends wait for AccessExclusive lock on this
> relation? Say, a client needs a buffer for some other relation and
> that might evict this buffer after we release the lock on the
> partition. In StrategyGetBuffer, it is important to either have a pin
> on the buffer or the buffer header itself must be locked to avoid
> getting picked as victim buffer. Am I missing something?

I think exactly like that.  If we acquire the bufHdr lock before
releasing the partition lock, that steal doesn't happen but it doesn't
seem good as a locking protocol.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

At Tue, 22 Dec 2020 01:42:55 +, "tsunakawa.ta...@fujitsu.com" 
 wrote in 
> From: Amit Kapila 
> > This answers the second part of the question but what about the first
> > part (We hold a buffer partition lock, and have done a lookup in th
> > mapping table. Why are we then rechecking the
> > relfilenode/fork/blocknum?)
> > 
> > I think we don't need such a check, rather we can have an Assert
> > corresponding to that if-condition in the patch. I understand it is
> > safe to compare relfilenode/fork/blocknum but it might confuse readers
> > of the code.
> 
> Hmm, you're right.  I thought someone else could steal the found
> buffer and use it for another block because the buffer mapping
> lwlock is released without pinning the buffer or acquiring the
> buffer header spinlock.  However, in this case (replay of TRUNCATE
> during recovery), nobody steals the buffer: bgwriter or checkpointer
> doesn't use a buffer for a new block, and the client backend waits
> for AccessExclusive lock.

Mmm. If that is true, doesn't the unoptimized path also need the
rechecking?

The AEL doesn't work for a buffer block. No new block can be allocted
for the relation but still BufferAlloc can steal the block for other
relations since the AEL doesn't work for each buffer block.  Am I
still missing something?


> > I have started doing minor edits to the patch especially planning to
> > write a theory why is this optimization safe and here is what I can
> > come up with:
> 
> Thank you, that's fluent and easier to understand.

+1

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Patch] Optimize dropping of relation buffers using dlist

On Tue, Dec 22, 2020 at 7:13 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > This answers the second part of the question but what about the first
> > part (We hold a buffer partition lock, and have done a lookup in th
> > mapping table. Why are we then rechecking the
> > relfilenode/fork/blocknum?)
> >
> > I think we don't need such a check, rather we can have an Assert
> > corresponding to that if-condition in the patch. I understand it is
> > safe to compare relfilenode/fork/blocknum but it might confuse readers
> > of the code.
>
> Hmm, you're right. I thought someone else could steal the found buffer and 
> use it for another block because the buffer mapping lwlock is released 
> without pinning the buffer or acquiring the buffer header spinlock.
>

Okay, I see your point.

>  However, in this case (replay of TRUNCATE during recovery), nobody steals 
> the buffer: bgwriter or checkpointer doesn't use a buffer for a new block, 
> and the client backend waits for AccessExclusive lock.
>
>

Why would all client backends wait for AccessExclusive lock on this
relation? Say, a client needs a buffer for some other relation and
that might evict this buffer after we release the lock on the
partition. In StrategyGetBuffer, it is important to either have a pin
on the buffer or the buffer header itself must be locked to avoid
getting picked as victim buffer. Am I missing something?

-- 
With Regards,
Amit Kapila.

Re: zstd compression for pg_dump

2020-12-21 Thread Justin Pryzby

On Mon, Dec 21, 2020 at 03:02:40PM -0500, Tom Lane wrote:
> Justin Pryzby  writes:
> > I found that our largest tables are 40% smaller and 20% faster to pipe
> > pg_dump -Fc -Z0 |zstd relative to native zlib
> 
> The patch might be a tad smaller if you hadn't included a core file in it.

About 89% smaller.

This also fixes the extension (.zst)
And fixes zlib default compression.
And a bunch of cleanup.

-- 
Justin
>From c506c8a93ae70726a5b4b33a1d0098caf5665f3a Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 21 Dec 2020 00:32:32 -0600
Subject: [PATCH 1/7] fix pre-existing docs/comments

---
 doc/src/sgml/ref/pg_dump.sgml | 2 +-
 src/bin/pg_dump/pg_backup_directory.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0aa35cf0c3..dcb25dc3cd 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -621,7 +621,7 @@ PostgreSQL documentation
   

 Specify the compression level to use.  Zero means no compression.
-For the custom archive format, this specifies compression of
+For the custom and directory archive formats, this specifies compression of
 individual table-data segments, and the default is to compress
 at a moderate level.
 For plain text output, setting a nonzero compression level causes
diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c
index 48fa7cb1a3..650b542fce 100644
--- a/src/bin/pg_dump/pg_backup_directory.c
+++ b/src/bin/pg_dump/pg_backup_directory.c
@@ -4,7 +4,7 @@
  *
  *	A directory format dump is a directory, which contains a "toc.dat" file
  *	for the TOC, and a separate file for each data entry, named ".dat".
- *	Large objects (BLOBs) are stored in separate files named "blob_.dat",
+ *	Large objects (BLOBs) are stored in separate files named "blob_.dat",
  *	and there's a plain-text TOC file for them called "blobs.toc". If
  *	compression is used, each data file is individually compressed and the
  *	".gz" suffix is added to the filenames. The TOC files are never
-- 
2.17.0

>From 49f400b5b2e559cc2ed6be82e04a4dc4ba5c63b3 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 12 Dec 2020 00:11:37 -0600
Subject: [PATCH 2/7] Fix malformed comment..

broken since bf50caf10.
---
 src/bin/pg_dump/pg_backup_archiver.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 177360ed6e..6a5a22637b 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -329,9 +329,12 @@ struct _archiveHandle
 	DumpId	   *tableDataId;	/* TABLE DATA ids, indexed by table dumpId */
 
 	struct _tocEntry *currToc;	/* Used when dumping data */
-	int			compression;	/* Compression requested on open Possible
- * values for compression: -1
- * Z_DEFAULT_COMPRESSION 0	COMPRESSION_NONE
+	int			compression;	/*-
+ * Compression requested on open
+ * Possible values for compression:
+ * -2	ZSTD_COMPRESSION
+ * -1	Z_DEFAULT_COMPRESSION
+ *  0	COMPRESSION_NONE
  * 1-9 levels for gzip compression */
 	bool		dosync;			/* data requested to be synced on sight */
 	ArchiveMode mode;			/* File mode - r or w */
-- 
2.17.0

>From 8a8939a3060c984af289b5fe62754d94f2675248 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 11 Dec 2020 15:01:25 -0600
Subject: [PATCH 3/7] pg_dump: zstd compression

document any change in search for .gz?
docs
---
 configure | 123 ++-
 configure.ac  |  22 ++
 src/bin/pg_dump/compress_io.c | 475 --
 src/bin/pg_dump/compress_io.h |   5 +-
 src/bin/pg_dump/pg_backup_archiver.h  |   5 +
 src/bin/pg_dump/pg_backup_directory.c |   6 +-
 src/bin/pg_dump/pg_dump.c |   3 +-
 src/include/pg_config.h.in|   3 +
 8 files changed, 597 insertions(+), 45 deletions(-)

diff --git a/configure b/configure
index 11a4284e5b..240e536e04 100755
--- a/configure
+++ b/configure
@@ -698,6 +698,7 @@ with_gnu_ld
 LD
 LDFLAGS_SL
 LDFLAGS_EX
+with_zstd
 with_zlib
 with_system_tzdata
 with_libxslt
@@ -798,6 +799,7 @@ infodir
 docdir
 oldincludedir
 includedir
+runstatedir
 localstatedir
 sharedstatedir
 sysconfdir
@@ -866,6 +868,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 '
@@ -935,6 +938,7 @@ datadir='${datarootdir}'
 sysconfdir='${prefix}/etc'
 sharedstatedir='${prefix}/com'
 localstatedir='${prefix}/var'
+runstatedir='${localstatedir}/run'
 includedir='${prefix}/include'
 oldincludedir='/usr/include'
 docdir='${datarootdir}/doc/${PACKAGE_TARNAME}'
@@ -1187,6 +1191,15 @@ do
   | -silent | --silent | --silen | --sile | --sil)
 silent=yes ;;
 
+  -runstatedir | --runstatedir | --runstatedi | --runstated \
+  | --run

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-12-21 Thread Masahiro Ikeda


Thanks for your comments.

On 2020-12-22 09:39, Andres Freund wrote:

Hi,

On 2020-12-21 13:16:50 -0800, Andres Freund wrote:

On 2020-12-02 13:52:43 +0900, Fujii Masao wrote:
> Pushed. Thanks!

Why are wal_records/fpi long, instead of uint64?
longwal_records;/* # of WAL records produced */
longwal_fpi;/* # of WAL full page images 
produced */
uint64  wal_bytes;  /* size of WAL records produced 
*/

long is only 4 byte e.g. on windows, and it is entirely possible to 
wrap

a 4 byte record counter. It's also somewhat weird that wal_bytes is
unsigned, but the others are signed?

This is made doubly weird because on the SQL level you chose to make
wal_records, wal_fpi bigint. And wal_bytes numeric?


I'm sorry I don't know the reason.

The following thread is related to the patch and the type of wal_bytes
is changed from long to uint64 because XLogRecPtr is uint64.
https://www.postgresql.org/message-id/flat/20200402144438.GF64485%40nol#1f0127c98df430104c63426fdc285c20

I assumed that the reason why the type of wal_records/fpi is long
is BufferUsage have the members (i.e, shared_blks_hit) of the same 
types.


So, I think it's better if to change the type of wal_records/fpi from 
long to uint64,

to change the types of BufferUsage's members too.



Some more things:
- There's both PgStat_MsgWal WalStats; and static PgStat_WalStats 
walStats;

  that seems *WAY* too confusing. And the former imo shouldn't be
  global.


Sorry for the confusing name.
I referenced the following variable name.

 static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];
 static PgStat_SLRUStats slruStats[SLRU_NUM_ELEMENTS];

How about to change from "PgStat_MsgWal WalStats"
to "PgStat_MsgWal MsgWalStats"?

Is it better to change the following name too?
 "PgStat_MsgBgWriter BgWriterStats;"
 "static PgStat_MsgSLRU SLRUStats[SLRU_NUM_ELEMENTS];"

Since PgStat_MsgWal is called in xlog.c and pgstat.c,
I thought it's should be global.


- AdvanceXLInsertBuffer() does WalStats.m_wal_buffers_full, but as far
  as I can tell there's nothing actually sending that?


IIUC, when pgstat_send_wal() is called by backends and so on,
it is sent to the statistic collector and it is exposed via pg_stat_wal 
view.


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

RE: [Patch] Optimize dropping of relation buffers using dlist

From: Amit Kapila 
> This answers the second part of the question but what about the first
> part (We hold a buffer partition lock, and have done a lookup in th
> mapping table. Why are we then rechecking the
> relfilenode/fork/blocknum?)
> 
> I think we don't need such a check, rather we can have an Assert
> corresponding to that if-condition in the patch. I understand it is
> safe to compare relfilenode/fork/blocknum but it might confuse readers
> of the code.

Hmm, you're right.  I thought someone else could steal the found buffer and use 
it for another block because the buffer mapping lwlock is released without 
pinning the buffer or acquiring the buffer header spinlock.  However, in this 
case (replay of TRUNCATE during recovery), nobody steals the buffer: bgwriter 
or checkpointer doesn't use a buffer for a new block, and the client backend 
waits for AccessExclusive lock.


> I have started doing minor edits to the patch especially planning to
> write a theory why is this optimization safe and here is what I can
> come up with:

Thank you, that's fluent and easier to understand.


Regards
Takayuki Tsunakawa

Re: Better client reporting for "immediate stop" shutdowns

2020-12-21 Thread Bharath Rupireddy

On Tue, Dec 22, 2020 at 3:13 AM Tom Lane  wrote:
> Up to now, if you shut down the database with "pg_ctl stop -m immediate"
> then clients get a scary message claiming that something has crashed,
> because backends can't tell whether the SIGQUIT they got was sent for
> a crash-and-restart situation or because of an immediate-stop command.
>
> This isn't great from a fit-and-finish perspective, and it occurs to me
> that it's really easy to do better: the postmaster can stick a flag
> into shared memory explaining the reason for SIGQUIT.  While we don't
> like the postmaster touching shared memory, there doesn't seem to be
> any need for interlocking or anything like that, so there is no risk
> involved that's greater than those already taken by pmsignal.c.

+1 to improve the message.

> So, here's a very simple proposed patch.  Some issues for possible
> bikeshedding:
>
> * Up to now, pmsignal.c has only been for child-to-postmaster
> communication, so maybe there is some better place to put the
> support code.  I can't think of one though.

+1 to have it here as we already have the required shared memory
initialization code to add in new flags there -
PMSignalState->sigquit_reason.

If I'm correct, quickdie() doesn't access any shared memory because
one of the reason we can be in quickdie() is when the shared memory
itself is corrupted(the comment down below on why we don't call
roc_exit() or atexit() says), in such case, will GetQuitSignalReason()
have some problem in accessing the shared memory i.e. +return
PMSignalState->sigquit_reason;?

> * I chose to report the same message for immediate shutdown as we
> already use for SIGTERM (fast shutdown or pg_terminate_backend()).
> Should it be different, and if so what?

AFAIK, errmsg(terminating connection due to administrator command") is
emitted when there's no specific reason. But we know exactly why we
are terminating in this case, how about having an error message like
errmsg("terminating connection due to immediate shutdown request")));
? There are other places where errmsg("terminating connection due to
 reasons"); is used. We also log messages when an immediate
shutdown request is received errmsg("received immediate shutdown
request").

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Deadlock between backend and recovery may not be detected

2020-12-21 Thread Masahiko Sawada

On Fri, Dec 18, 2020 at 6:36 PM Fujii Masao  wrote:
>
>
>
> On 2020/12/17 2:15, Fujii Masao wrote:
> >
> >
> > On 2020/12/16 23:28, Drouvot, Bertrand wrote:
> >> Hi,
> >>
> >> On 12/16/20 2:36 PM, Victor Yegorov wrote:
> >>>
> >>> *CAUTION*: This email originated from outside of the organization. Do not 
> >>> click links or open attachments unless you can confirm the sender and 
> >>> know the content is safe.
> >>>
> >>>
> >>> ср, 16 дек. 2020 г. в 13:49, Fujii Masao  >>> >:
> >>>
> >>> After doing this procedure, you can see the startup process and 
> >>> backend
> >>> wait for the table lock each other, i.e., deadlock. But this deadlock 
> >>> remains
> >>> even after deadlock_timeout passes.
> >>>
> >>> This seems a bug to me.
> >>>
> >> +1
> >>
> >>>
> >>> > * Deadlocks involving the Startup process and an ordinary backend 
> >>> process
> >>> > * will be detected by the deadlock detector within the ordinary 
> >>> backend.
> >>>
> >>> The cause of this issue seems that ResolveRecoveryConflictWithLock() 
> >>> that
> >>> the startup process calls when recovery conflict on lock happens 
> >>> doesn't
> >>> take care of deadlock case at all. You can see this fact by reading 
> >>> the above
> >>> source code comment for ResolveRecoveryConflictWithLock().
> >>>
> >>> To fix this issue, I think that we should enable 
> >>> STANDBY_DEADLOCK_TIMEOUT
> >>> timer in ResolveRecoveryConflictWithLock() so that the startup 
> >>> process can
> >>> send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
> >>> Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
> >>> the backend should check whether the deadlock actually happens or not.
> >>> Attached is the POC patch implimenting this.
> >>>
> >> good catch!
> >>
> >> I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't 
> >> be set in ResolveRecoveryConflictWithLock() too (it is already set in 
> >> ResolveRecoveryConflictWithBufferPin()).
> >>
> >> So + 1 to consider this as a bug and for the way the patch proposes to fix 
> >> it.
> >
> > Thanks Victor and Bertrand for agreeing!
> > Attached is the updated version of the patch.
>
> Attached is v3 of the patch. Could you review this version?
>
> While the startup process is waiting for recovery conflict on buffer pin,
> it repeats sending the request for deadlock check to all the backends
> every deadlock_timeout. This may increase the workload in the startup
> process and backends, but since this is the original behavior, the patch
> doesn't change that. Also in practice this may not be so harmful because
> the period that the buffer is kept pinned is basically not so long.
>

@@ -529,6 +603,26 @@ ResolveRecoveryConflictWithBufferPin(void)
 */
ProcWaitForSignal(PG_WAIT_BUFFER_PIN);

+   if (got_standby_deadlock_timeout)
+   {
+   /*
+* Send out a request for hot-standby backends to check themselves for
+* deadlocks.
+*
+* XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+* to be signaled by UnpinBuffer() again and send a request for
+* deadlocks check if deadlock_timeout happens. This causes the
+* request to continue to be sent every deadlock_timeout until the
+* buffer is unpinned or ltime is reached. This would increase the
+* workload in the startup process and backends. In practice it may
+* not be so harmful because the period that the buffer is kept pinned
+* is basically no so long. But we should fix this?
+*/
+   SendRecoveryConflictWithBufferPin(
+
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+   got_standby_deadlock_timeout = false;
+   }
+

Since SendRecoveryConflictWithBufferPin() sends the signal to all
backends every backend who is waiting on a lock at ProcSleep() and not
holding a buffer pin blocking the startup process will end up doing a
deadlock check, which seems expensive. What is worse is that the
deadlock will not be detected because the deadlock involving a buffer
pin isn't detected by CheckDeadLock(). I thought we can replace
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK with
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN but it’s not good because the
backend who has a buffer pin blocking the startup process and not
waiting on a lock is also canceled after deadlock_timeout. We can have
the backend return from RecoveryConflictInterrupt() when it received
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN and is not waiting on any lock,
but it’s also not good because we cannot cancel the backend after
max_standby_streaming_delay that has a buffer pin blocking the startup
process. So I wonder if we can have a new signal. When the backend
received this signal it returns from RecoveryConflictInterrupt()
without deadlock check either if it’s not waiting on any lock or if it
doesn’t have a buffer pin blocking the startup process. Otherwise

Re: Fix generate_useful_gather_paths for parallel unsafe pathkeys

2020-12-21 Thread Tomas Vondra


Hi,

I've pushed (and backpatched) all the fixes posted to this thread. I 
believe that covers all the incremental sort fixes, so I've marked [1] 
as committed.


[1] https://commitfest.postgresql.org/31/2754/


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

RE: libpq debug log

From: k.jami...@fujitsu.com 
> I understand that protocol 2.0 is still supported, but it is only used for
> PostgreSQL versions 7.3 and earlier, which is not updated by fixes anymore
> since we only backpatch up to previous 5 versions. However I am not sure if
> it's a good idea, but how about if we just support this feature for protocol 
> 3.

+1
I thought psqlODBC (still) allows the user to choose the protocol version, it 
looks like the current psqlODBC disallows pre-3 protocol:

[libpq_connect()]
MYLOG(0, "libpq connection to the database established.\n");
pversion = PQprotocolVersion(pqconn);
if (pversion < 3)
{
MYLOG(0, "Protocol version %d is not supported\n", pversion);
goto cleanup;
}


> There are similar chunks of code in fe-exec.c, PQsendPrepare(),
> PQsendDescribe(),
> etc. that already do something similar, so I just thought in hindsight if we 
> can
> do the same.
>   if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3)
>   {
> 
> }
> else
> {
> /* Protocol 2.0 isn't supported */
>  printfPQExpBuffer(&conn->errorMessage,
>libpq_gettext("function requires at least protocol
> version 3.0\n"));
>  return 0;
> }


I haven't looked at the patch and don't know the code structure, but I want the 
code clutter to be minimized.  So, I think we should avoid putting if 
statements like above here and there.

Plus, I don't think it's not necessary to fail the processing when the protocol 
version is 2; we can just stop the trace output.  So, how about disabling the 
trace output silently if the protocol turns out to be < 3 upon connection?


Regards
Takayuki Tsunakawa

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-12-21 Thread Andres Freund

Hi,

On 2020-12-21 13:16:50 -0800, Andres Freund wrote:
> On 2020-12-02 13:52:43 +0900, Fujii Masao wrote:
> > Pushed. Thanks!
>
> Why are wal_records/fpi long, instead of uint64?
>   longwal_records;/* # of WAL records produced */
>   longwal_fpi;/* # of WAL full page images 
> produced */
>   uint64  wal_bytes;  /* size of WAL records produced 
> */
>
> long is only 4 byte e.g. on windows, and it is entirely possible to wrap
> a 4 byte record counter. It's also somewhat weird that wal_bytes is
> unsigned, but the others are signed?
>
> This is made doubly weird because on the SQL level you chose to make
> wal_records, wal_fpi bigint. And wal_bytes numeric?

Some more things:
- There's both PgStat_MsgWal WalStats; and static PgStat_WalStats walStats;
  that seems *WAY* too confusing. And the former imo shouldn't be
  global.
- AdvanceXLInsertBuffer() does WalStats.m_wal_buffers_full, but as far
  as I can tell there's nothing actually sending that?

- Andres

Re: Proposed patch for key managment

On Mon, Dec 21, 2020 at 03:00:32PM -0700, David G. Johnston wrote:
> I agree with Stephen; so maybe that part of the Wiki needs to be updated
> instead of having it sit there for use as a hammer when someone disagrees with
> a proffered patch.
> 
> Or maybe consider that "a patch" doesn't only mean "implement" - it is simply
> another language through which desirability and design can also be
> communicated.  The author gets to decide how wordy their prose is and choosing
> a wordy but concise language that is understandable by the vast majority of 
> the
> target audience seems like it should be considered acceptable.

If you want to debate that TODO section, you will need to start a new
thread.  I suggest Alastair's patch also appear in a new thread.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Proposed patch for key managment

2020-12-21 Thread David G. Johnston

On Mon, Dec 21, 2020 at 2:44 PM Bruce Momjian  wrote:

> On Sun, Dec 20, 2020 at 09:38:55PM -0500, Stephen Frost wrote:
> > > I'll have a go at adding another keyring and/or abstracting the key
> > > wrap and see how well a true peer to the passphrase approach fits in.
> >
> > Having patches to review and consider definitely helps, imv.
>
> I disagree.  Our order is:
>
> Desirability -> Design -> Implement -> Test -> Review -> Commit
> https://wiki.postgresql.org/wiki/Todo#Development_Process
>
> so, by posting a patch, you are decided to skip the first _two_
> requirements.  I think it might be time for me to create a new thread
> so it is not confused by whatever is posted here.
>
>
I agree with Stephen; so maybe that part of the Wiki needs to be updated
instead of having it sit there for use as a hammer when someone disagrees
with a proffered patch.

Or maybe consider that "a patch" doesn't only mean "implement" - it is
simply another language through which desirability and design can also be
communicated.  The author gets to decide how wordy their prose is and
choosing a wordy but concise language that is understandable by the vast
majority of the target audience seems like it should be considered
acceptable.

David J.

Re: Proposed patch for key managment

On Sun, Dec 20, 2020 at 05:59:09PM -0500, Stephen Frost wrote:
> I do generally like the idea of having the single command able to be
> used to either provide the KEK (where PG manages the keyring
> internally), or called multiple times to retrieve the DEKs (in which
> case PG wouldn't be managing the keyring).
> 
> However, after chatting with Bruce about it for a bit this weekend, I'm
> not sure that we need to tackle the second case today.  I don't think
> there's any doubt that there will be users who will want PG to manage
> the keyring and to be able to work with just a passphrase, as Bruce has
> worked to make happen here and which we have a patch for.  I'm hoping
> Bruce will post a new patch soon based on the work that he and I
> discussed today (mostly just clarifications around keys vs. passphrases
> and having the PG side accept a key which the command that's run will
> produce).  With that, I'm feeling pretty comfortable that we can move
> forward and at least get this initial work committed.

I agree with very little of this sub-discussion, but I am still reading
it to see if I can get useful ideas from it.  Looking at what we used to
do with 'passphrase', we did the prompting in the script, and did the
hash, HMAC validation, data key generation and its wrap in the server. 
With the 'cluster key' patch I just posted, the hashing of the
passphrase is removed from the server and happens only in the script,
because in many cases the hashing is not needed, and double-hashing is
less secure, so that seems like a win.

To go any further, you are starting to look at possible data key
generation in the script, either at boot time, and then wrapped with a
passphrase, or just retrieved on every boot.  That would create a larger
burden for any script, meaning a passphrase usage would have to not only
hash, which it does now, but also verify its MAC, then decrypt keys
stored in the file system, then echo those to its output, to be read by
the server.  I think this is the big point --- I have five scripts, and
only one needs to hash its input (passphrase).  If you go any farther in
pushing code into the scripts, these scripts become much harder to
write, and much harder to do error checks.

I think the common case is passphrase, so I want to optimize for that. 
Pushing anymore features into the scripts is going to make that common
case less reliable, which I am opposed to.

Also, as far as Desirability, we only have _one_ person saying people
will want more options.  I need to hear from a lot more people before
this gets added to Postgres.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Proposed patch for key managment

On Sun, Dec 20, 2020 at 09:38:55PM -0500, Stephen Frost wrote:
> > I'll have a go at adding another keyring and/or abstracting the key
> > wrap and see how well a true peer to the passphrase approach fits in.
> 
> Having patches to review and consider definitely helps, imv.

I disagree.  Our order is:

Desirability -> Design -> Implement -> Test -> Review -> Commit
https://wiki.postgresql.org/wiki/Todo#Development_Process

so, by posting a patch, you are decided to skip the first _two_
requirements.  I think it might be time for me to create a new thread
so it is not confused by whatever is posted here.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Better client reporting for "immediate stop" shutdowns

Up to now, if you shut down the database with "pg_ctl stop -m immediate"
then clients get a scary message claiming that something has crashed,
because backends can't tell whether the SIGQUIT they got was sent for
a crash-and-restart situation or because of an immediate-stop command.

This isn't great from a fit-and-finish perspective, and it occurs to me
that it's really easy to do better: the postmaster can stick a flag
into shared memory explaining the reason for SIGQUIT.  While we don't
like the postmaster touching shared memory, there doesn't seem to be
any need for interlocking or anything like that, so there is no risk
involved that's greater than those already taken by pmsignal.c.

So, here's a very simple proposed patch.  Some issues for possible
bikeshedding:

* Up to now, pmsignal.c has only been for child-to-postmaster
communication, so maybe there is some better place to put the
support code.  I can't think of one though.

* I chose to report the same message for immediate shutdown as we
already use for SIGTERM (fast shutdown or pg_terminate_backend()).
Should it be different, and if so what?

regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5d09822c81..3dd0a022f7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2887,6 +2887,8 @@ pmdie(SIGNAL_ARGS)
 			sd_notify(0, "STOPPING=1");
 #endif
 
+			/* tell children to shut down ASAP */
+			SetQuitSignalReason(PMQUIT_FOR_STOP);
 			TerminateChildren(SIGQUIT);
 			pmState = PM_WAIT_BACKENDS;
 
@@ -3464,6 +3466,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		LogChildExit(LOG, procname, pid, exitstatus);
 		ereport(LOG,
 (errmsg("terminating any other active server processes")));
+		SetQuitSignalReason(PMQUIT_FOR_CRASH);
 	}
 
 	/* Process background workers. */
diff --git a/src/backend/storage/ipc/pmsignal.c b/src/backend/storage/ipc/pmsignal.c
index 94c65877c1..8ef3f6da4a 100644
--- a/src/backend/storage/ipc/pmsignal.c
+++ b/src/backend/storage/ipc/pmsignal.c
@@ -1,7 +1,7 @@
 /*-
  *
  * pmsignal.c
- *	  routines for signaling the postmaster from its child processes
+ *	  routines for signaling between the postmaster and its child processes
  *
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
@@ -55,6 +55,10 @@
  * but carries the extra information that the child is a WAL sender.
  * WAL senders too start in ACTIVE state, but switch to WALSENDER once they
  * start streaming the WAL (and they never go back to ACTIVE after that).
+ *
+ * We also have a shared-memory field that is used for communication in
+ * the opposite direction, from postmaster to children: it tells why the
+ * postmaster has broadcasted SIGQUIT signals, if indeed it has done so.
  */
 
 #define PM_CHILD_UNUSED		0	/* these values must fit in sig_atomic_t */
@@ -65,8 +69,10 @@
 /* "typedef struct PMSignalData PMSignalData" appears in pmsignal.h */
 struct PMSignalData
 {
-	/* per-reason flags */
+	/* per-reason flags for signaling the postmaster */
 	sig_atomic_t PMSignalFlags[NUM_PMSIGNALS];
+	/* global flags for signals from postmaster to children */
+	QuitSignalReason sigquit_reason;	/* why SIGQUIT was sent */
 	/* per-child-process flags */
 	int			num_child_flags;	/* # of entries in PMChildFlags[] */
 	int			next_child_flag;	/* next slot to try to assign */
@@ -134,6 +140,7 @@ PMSignalShmemInit(void)
 
 	if (!found)
 	{
+		/* initialize all flags to zeroes */
 		MemSet(unvolatize(PMSignalData *, PMSignalState), 0, PMSignalShmemSize());
 		PMSignalState->num_child_flags = MaxLivePostmasterChildren();
 	}
@@ -171,6 +178,34 @@ CheckPostmasterSignal(PMSignalReason reason)
 	return false;
 }
 
+/*
+ * SetQuitSignalReason - broadcast the reason for a system shutdown.
+ * Should be called by postmaster before sending SIGQUIT to children.
+ *
+ * Note: in a crash-and-restart scenario, the "reason" field gets cleared
+ * as a part of rebuilding shared memory; the postmaster need not do it
+ * explicitly.
+ */
+void
+SetQuitSignalReason(QuitSignalReason reason)
+{
+	PMSignalState->sigquit_reason = reason;
+}
+
+/*
+ * GetQuitSignalReason - obtain the reason for a system shutdown.
+ * Called by child processes when they receive SIGQUIT.
+ * If the postmaster hasn't actually sent SIGQUIT, will return PMQUIT_NOT_SENT.
+ */
+QuitSignalReason
+GetQuitSignalReason(void)
+{
+	/* This is called in signal handlers, so be extra paranoid. */
+	if (!IsUnderPostmaster || PMSignalState == NULL)
+		return PMQUIT_NOT_SENT;
+	return PMSignalState->sigquit_reason;
+}
+
 
 /*
  * AssignPostmasterChildSlot - select an unused slot for a new postmaster
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 3679799e50..c3c0a0c01d 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -67,6 +67,7 @@
 #include

Re: Proposed patch for key managment

On Sun, Dec 20, 2020 at 05:59:09PM -0500, Stephen Frost wrote:
> However, after chatting with Bruce about it for a bit this weekend, I'm
> not sure that we need to tackle the second case today.  I don't think
> there's any doubt that there will be users who will want PG to manage
> the keyring and to be able to work with just a passphrase, as Bruce has
> worked to make happen here and which we have a patch for.  I'm hoping
> Bruce will post a new patch soon based on the work that he and I
> discussed today (mostly just clarifications around keys vs. passphrases
> and having the PG side accept a key which the command that's run will
> produce).  With that, I'm feeling pretty comfortable that we can move
> forward and at least get this initial work committed.

OK, here it the updated patch.  The major change, as Stephen pointed
out, is that the cluser_key_command (was cluster_passphrase_command) now
returns a hex-string representing the 32-byte KEK, rather than having
the server hash whatever the command used to return.  This avoids
double-hashing in cases where you are _not_ using a passphrase, but are
computing a random 32-byte value in the script.  This does mean the
script has to run sha256 for passphrases, but it seems like a win. 
Updated script are attached too.

The URLs are still accurate:

https://github.com/postgres/postgres/compare/master...bmomjian:key.diff

https://github.com/bmomjian/postgres/compare/key...bmomjian:key-alter.diff

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee



key.diff.gz
Description: application/gzip


key-alter.diff.gz
Description: application/gzip


ckey_aws.sh
Description: Bourne shell script


ckey_direct.sh
Description: Bourne shell script


ckey_passphrase.sh
Description: Bourne shell script


ckey_piv_nopin.sh
Description: Bourne shell script


ckey_piv_pin.sh
Description: Bourne shell script

Re: pg_preadv() and pg_pwritev()

2020-12-21 Thread Thomas Munro

On Mon, Dec 21, 2020 at 8:25 PM Jakub Wartak  wrote:
> > > I'm drawing a blank on trivial candidate uses for preadv(), without
> > > infrastructure from later patches.
> >
> > Can't immediately think of something either.
>
> This might be not that trivial , but maybe acquire_sample_rows() from 
> analyze.c ?
>
> Please note however there's patch 
> https://www.postgresql.org/message-id/20201109180644.GJ16415%40tamriel.snowman.net
>  ( https://commitfest.postgresql.org/30/2799/ ) for adding fadvise, but maybe 
> those two could be even combined so you would be doing e.g. 16x fadvise() and 
> then grab 8 pages in one preadv() call ?  I'm find unlikely however that 
> preadv give any additional performance benefit there after having fadvise, 
> but clearly a potential place to test.

Oh, interesting, that looks like another test case to study with the
AIO patch set, but I don't think it's worth trying to do a
simpler/half-baked version in the meantime.  (Since that ANALYZE patch
uses PrefetchBuffer() it should automatically benefit: the
posix_fadvise() calls will be replaced with consolidated preadv()
calls in a worker process or native AIO equivalent so that system
calls are mostly gone from the initiating process, and by the time you
try to access the buffer it'll hopefully see that it's finished
without any further system calls.  Refinements are possible though,
like making use of recent_buffer to avoid double-lookup, and
tuning/optimisation for how often IOs should be consolidated and
submitted.)

Re: Incorrect allocation handling for cryptohash functions with OpenSSL

2020-12-21 Thread Robert Haas

On Fri, Dec 18, 2020 at 6:04 AM Heikki Linnakangas  wrote:
> BTW, it's a bit weird that the pg_cryptohash_init/update/final()
> functions return success, if the ctx argument is NULL. It would seem
> more sensible for them to return an error. That way, if a caller forgets
> to check for NULL result from pg_cryptohash_create(), but correctly
> checks the result of those other functions, it would catch the error. In
> fact, if we documented that pg_cryptohash_create() can return NULL, and
> pg_cryptohash_final() always returns error on NULL argument, then it
> would be sufficient for the callers to only check the return value of
> pg_cryptohash_final(). So the usage pattern would be:
>
> ctx = pg_cryptohash_create(PG_MD5);
> pg_cryptohash_inti(ctx);
> pg_update(ctx, data, size);
> pg_update(ctx, moredata, size);
> if (pg_cryptohash_final(ctx, &hash) < 0)
>  elog(ERROR, "md5 calculation failed");
> pg_cryptohash_free(ctx);

TBH, I think there's no point in return an error here at all, because
it's totally non-specific. You have no idea what failed, just that
something failed. Blech. If we want to check that ctx is non-NULL, we
should do that with an Assert(). Complicating the code with error
checks that have to be added in multiple places that are far removed
from where the actual problem was detected stinks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-12-21 Thread Andres Freund

Hi,

On 2020-12-02 13:52:43 +0900, Fujii Masao wrote:
> Pushed. Thanks!

Why are wal_records/fpi long, instead of uint64?
longwal_records;/* # of WAL records produced */
longwal_fpi;/* # of WAL full page images 
produced */
uint64  wal_bytes;  /* size of WAL records produced 
*/

long is only 4 byte e.g. on windows, and it is entirely possible to wrap
a 4 byte record counter. It's also somewhat weird that wal_bytes is
unsigned, but the others are signed?

This is made doubly weird because on the SQL level you chose to make
wal_records, wal_fpi bigint. And wal_bytes numeric?

Greetings,

Andres Freund

Re: zstd compression for pg_dump

Justin Pryzby  writes:
> I found that our largest tables are 40% smaller and 20% faster to pipe
> pg_dump -Fc -Z0 |zstd relative to native zlib

The patch might be a tad smaller if you hadn't included a core file in it.

regards, tom lane

Re: BUG #16079: Question Regarding the BUG #16064

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Mon, Dec 21, 2020 at 8:06 PM Stephen Frost  wrote:
> > * Magnus Hagander (mag...@hagander.net) wrote:
> > > On Mon, Dec 21, 2020 at 7:44 PM Tom Lane  wrote:
> > > > BTW, do we have a client-side setting to insist that passwords not be
> > > > sent in MD5 hashing either?  A person who is paranoid about this would
> > > > likely want to disable that code path as well.
> > >
> > > I don't think we do, and we possibly should. You can require channel
> > > binding which will require scram which solves the problem, but it does
> > > so only for scram.
> > >
> > > IIRC we've discussed having a parameter that says "allowed
> > > authentication methods" on the client as well, but I don't believe it
> > > has been built. But it wouldn't be bad to be able to for example force
> > > the client to only attempt gssapi auth, regardless of what the server
> > > asks for, and just fail if it's not there.
> >
> > The client is able to require a GSS encrypted connection, and a savy
> > user will realize that they should 'kinit' (or equivilant) locally and
> > never provide their password explicitly to the psql (or equivilant)
> > command, but that's certainly less than ideal.
> 
> Sure, but even if you do, then if you connect to a server that has gss
> support but is configured for password auth, it will perform password
> auth.

Right, and that's bad.  Think we agree on that.  I was just saying that
someone who understanding how GSS works wouldn't actually provide their
password at that point.  Trusting to that is definitely not sufficient
though.

> > Having a way to explicitly tell libpq what auth methods are acceptable
> > was discussed previously and does generally seem like a good idea, as
> > otherwise there's a lot of risk of what are essentially downgrade
> > attacks.
> 
> That was my point exactly..

Yes, it was my intention to agree with you on this. :)

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: BUG #16079: Question Regarding the BUG #16064

2020-12-21 Thread Magnus Hagander

On Mon, Dec 21, 2020 at 8:06 PM Stephen Frost  wrote:
>
> Greetings,
>
> * Magnus Hagander (mag...@hagander.net) wrote:
> > On Mon, Dec 21, 2020 at 7:44 PM Tom Lane  wrote:
> > > BTW, do we have a client-side setting to insist that passwords not be
> > > sent in MD5 hashing either?  A person who is paranoid about this would
> > > likely want to disable that code path as well.
> >
> > I don't think we do, and we possibly should. You can require channel
> > binding which will require scram which solves the problem, but it does
> > so only for scram.
> >
> > IIRC we've discussed having a parameter that says "allowed
> > authentication methods" on the client as well, but I don't believe it
> > has been built. But it wouldn't be bad to be able to for example force
> > the client to only attempt gssapi auth, regardless of what the server
> > asks for, and just fail if it's not there.
>
> The client is able to require a GSS encrypted connection, and a savy
> user will realize that they should 'kinit' (or equivilant) locally and
> never provide their password explicitly to the psql (or equivilant)
> command, but that's certainly less than ideal.

Sure, but even if you do, then if you connect to a server that has gss
support but is configured for password auth, it will perform password
auth.


> Having a way to explicitly tell libpq what auth methods are acceptable
> was discussed previously and does generally seem like a good idea, as
> otherwise there's a lot of risk of what are essentially downgrade
> attacks.

That was my point exactly..

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: BUG #16079: Question Regarding the BUG #16064

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Mon, Dec 21, 2020 at 7:44 PM Tom Lane  wrote:
> > BTW, do we have a client-side setting to insist that passwords not be
> > sent in MD5 hashing either?  A person who is paranoid about this would
> > likely want to disable that code path as well.
> 
> I don't think we do, and we possibly should. You can require channel
> binding which will require scram which solves the problem, but it does
> so only for scram.
> 
> IIRC we've discussed having a parameter that says "allowed
> authentication methods" on the client as well, but I don't believe it
> has been built. But it wouldn't be bad to be able to for example force
> the client to only attempt gssapi auth, regardless of what the server
> asks for, and just fail if it's not there.

The client is able to require a GSS encrypted connection, and a savy
user will realize that they should 'kinit' (or equivilant) locally and
never provide their password explicitly to the psql (or equivilant)
command, but that's certainly less than ideal.

Having a way to explicitly tell libpq what auth methods are acceptable
was discussed previously and does generally seem like a good idea, as
otherwise there's a lot of risk of what are essentially downgrade
attacks.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Invalidate acl.c caches for pg_authid.rolinherit changes

2020-12-21 Thread Bossart, Nathan

On 12/21/20, 1:51 AM, "Noah Misch"  wrote:
> Backends reflect "GRANT role_name" changes rather quickly, due to a syscache
> invalidation callback.  Let's register an additional callback to reflect
> "ALTER ROLE ... [NO]INHERIT" with equal speed.  I propose to back-patch this.
> While pg_authid changes may be more frequent than pg_auth_members changes, I
> expect neither is frequent enough to worry about the resulting acl.c cache
> miss rate.

+1 to back-patching.

> pg_authid changes don't affect cached_membership_roles, so I could have
> invalidated cached_privs_roles only.  That felt like needless complexity.  I
> expect cached_privs_role gets the bulk of traffic, since SELECT, INSERT,
> UPDATE and DELETE use it.  cached_membership_roles pertains to DDL and such.

The patch looks reasonable to me.

Nathan

Re: BUG #16079: Question Regarding the BUG #16064

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> Jeff Janes  writes:
> >>> I would suggest going further.  I would make the change on the client 
> >>> side,
> >>> and have libpq refuse to send unhashed passwords without having an
> >>> environment variable set which allows it.
> 
> >> As noted, that would break LDAP and RADIUS auth methods; likely also PAM.
> 
> > Which would be an altogether good thing as all of those end up exposing
> > sensitive information should the server be compromised and a user uses
> > one of them to log in.
> 
> Hm.  I'm less concerned about that scenario than about somebody snooping
> the on-the-wire traffic.  If we're going to invent a connection setting
> for this, I'd say that in addition to "ok to send cleartext password"
> and "never ok to send cleartext password", there should be a setting for
> "send cleartext password only if connection is encrypted".  Possibly
> that should even be the default.

I'd still strongly advocate for "never ok to send cleartext password" to
be the default, otherwise we put this out and then everyone ends up
having to include "set this on all your clients to never allow it" in
their hardening guidelines.  That's really not ideal.

That said, having such an option would certainly be better than not
having any reasonable way on the client side to make sure that the
user's password isn't being sent to the server.

> (I guess Unix-socket connections would be an exception, since we never
> encrypt those.)

For the middle-ground "I don't care if the server sees my password, but
don't want someone on the network seeing it" it would seem unix sockets
would be alright.

> BTW, do we have a client-side setting to insist that passwords not be
> sent in MD5 hashing either?  A person who is paranoid about this would
> likely want to disable that code path as well.

No, but it would surely be good if we did... or we could just rip out
the md5 support entirely.

(Yes, I appreciate that the position I'm taking here isn't likely to be
popular and I'm not going to completely say no to compromises, but every
kind of compromise like these invites users to end up doing the insecure
thing; the more difficult we make it to do the insecure thing the better
overall for security.)

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: BUG #16079: Question Regarding the BUG #16064

2020-12-21 Thread Magnus Hagander

On Mon, Dec 21, 2020 at 7:44 PM Tom Lane  wrote:
>
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> Jeff Janes  writes:
> >>> I would suggest going further.  I would make the change on the client 
> >>> side,
> >>> and have libpq refuse to send unhashed passwords without having an
> >>> environment variable set which allows it.
>
> >> As noted, that would break LDAP and RADIUS auth methods; likely also PAM.
>
> > Which would be an altogether good thing as all of those end up exposing
> > sensitive information should the server be compromised and a user uses
> > one of them to log in.
>
> Hm.  I'm less concerned about that scenario than about somebody snooping
> the on-the-wire traffic.  If we're going to invent a connection setting
> for this, I'd say that in addition to "ok to send cleartext password"
> and "never ok to send cleartext password", there should be a setting for
> "send cleartext password only if connection is encrypted".  Possibly
> that should even be the default.
>
> (I guess Unix-socket connections would be an exception, since we never
> encrypt those.)

"send cleartext password only if connection is secure", and define
secure as being tls encrypted, gss encrypted, or unix socket.


> BTW, do we have a client-side setting to insist that passwords not be
> sent in MD5 hashing either?  A person who is paranoid about this would
> likely want to disable that code path as well.

I don't think we do, and we possibly should. You can require channel
binding which will require scram which solves the problem, but it does
so only for scram.

IIRC we've discussed having a parameter that says "allowed
authentication methods" on the client as well, but I don't believe it
has been built. But it wouldn't be bad to be able to for example force
the client to only attempt gssapi auth, regardless of what the server
asks for, and just fail if it's not there.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: BUG #16079: Question Regarding the BUG #16064

Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Jeff Janes  writes:
>>> I would suggest going further.  I would make the change on the client side,
>>> and have libpq refuse to send unhashed passwords without having an
>>> environment variable set which allows it.

>> As noted, that would break LDAP and RADIUS auth methods; likely also PAM.

> Which would be an altogether good thing as all of those end up exposing
> sensitive information should the server be compromised and a user uses
> one of them to log in.

Hm.  I'm less concerned about that scenario than about somebody snooping
the on-the-wire traffic.  If we're going to invent a connection setting
for this, I'd say that in addition to "ok to send cleartext password"
and "never ok to send cleartext password", there should be a setting for
"send cleartext password only if connection is encrypted".  Possibly
that should even be the default.

(I guess Unix-socket connections would be an exception, since we never
encrypt those.)

BTW, do we have a client-side setting to insist that passwords not be
sent in MD5 hashing either?  A person who is paranoid about this would
likely want to disable that code path as well.

regards, tom lane

Re: BUG #16079: Question Regarding the BUG #16064

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Jeff Janes  writes:
> > On Sun, Dec 20, 2020 at 7:58 PM Stephen Frost  wrote:
> >> * Magnus Hagander (mag...@hagander.net) wrote:
> >>> Maybe we should do the same for LDAP (and RADIUS)? This seems like a
> >>> better place to put it than to log it at every time it's received?
> 
> >> A dollar short and a year late, but ... +1.
> 
> > I would suggest going further.  I would make the change on the client side,
> > and have libpq refuse to send unhashed passwords without having an
> > environment variable set which allows it.
> 
> As noted, that would break LDAP and RADIUS auth methods; likely also PAM.

Which would be an altogether good thing as all of those end up exposing
sensitive information should the server be compromised and a user uses
one of them to log in.

The point would be to make it clear to the user, while having an escape
hatch if necessary, that they're sending their password (or pin in the
RADIUS case) to the server.

> > What is the value of logging on the server side?
> 
> I do agree with this point, but mostly on the grounds of "nobody reads
> the server log".

I agree that doing this server side really isn't all that helpful.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: BUG #16079: Question Regarding the BUG #16064

Jeff Janes  writes:
> On Sun, Dec 20, 2020 at 7:58 PM Stephen Frost  wrote:
>> * Magnus Hagander (mag...@hagander.net) wrote:
>>> Maybe we should do the same for LDAP (and RADIUS)? This seems like a
>>> better place to put it than to log it at every time it's received?

>> A dollar short and a year late, but ... +1.

> I would suggest going further.  I would make the change on the client side,
> and have libpq refuse to send unhashed passwords without having an
> environment variable set which allows it.

As noted, that would break LDAP and RADIUS auth methods; likely also PAM.

> What is the value of logging on the server side?

I do agree with this point, but mostly on the grounds of "nobody reads
the server log".

regards, tom lane

Re: bad dependency in pg_dump output related to support function breaks binary upgrade

2020-12-21 Thread Pavel Stehule

po 21. 12. 2020 v 17:23 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > some Orafce's user reported problems with pg_upgrade. I checked this
> issue
> > and it looks like pg_dump problem:
> > ...
> > the supporting function should be dumped first before function where
> > supporting function is used.
>
> I tried to reproduce this and could not.  It should work, since
> ProcedureCreate definitely makes a dependency on the support function.
> Can you make a self-contained test case?
>

After some deeper investigation I found an old bug in Orafce :-/. I am
sorry for the noise.

This old bug is related to introduction aliases types of varchar -
nvarchar2 and varchar2. In this age the "in" function can use a
protransform column, but there was not a possibility how to set this column
externally, and Orafce used dirty update. The value was correct, but the
new dependency was not used. Originally it was not a problem, because the
transform function was built in. But there was a new issue related to
Postgres 12 when these functions were renamed. I fixed this issue by
introducing my own wrapping function - but without dependency I broke the
binary upgrade.

On Postgres 12 and higher I can use ALTER FUNCTION SUPPORT and all works
well. On older platforms I have to hack pg_depend, but it is working too.

Again I am sorry for false alarm

Regards

Pavel

>
> regards, tom lane
>

Re: Weird special case in jsonb_concat()

"Joel Jacobson"  writes:
> On Sat, Dec 19, 2020, at 21:35, Tom Lane wrote:
>> Here is a proposed patch for that.

> In addition, to the user wondering how to append a json array-value "as is",
> I think it would be useful to provide an example on how to do this
> in the documentation.

Done in v13 and HEAD; the older table format doesn't really have room
for more examples.

> +1 back-patch, I think it's a bug.

I'm not quite sure it's a bug, but it does seem like fairly unhelpful
behavior to throw an error instead of doing something useful, so
back-patched.

regards, tom lane

Re: how to find log

2020-12-21 Thread Dmitry Markman

Thanks Tom, Andrew
I’ll try out logging_collector facility

thanks again

dm


> On Dec 20, 2020, at 12:04 PM, Andrew Dunstan  wrote:
> 
> 
> On 12/20/20 11:31 AM, Tom Lane wrote:
>> Dmitry Markman  writes:
>>> suppose I started the server with the following command
>>> pg_ctl -D . . . start -l 
>>> is there a way to get  later by sending some query to the 
>>> server or
>> No, the server has no way to know where its stdout/stderr were
>> pointed to.  You might want to enable the syslogger output method
>> (see logging_collector) to have something a bit more featureful.
>> 
>> https://www.postgresql.org/docs/current/runtime-config-logging.html
>> 
>>  
> 
> 
> 
> Alternatively, asking the OS in many cases will work, e.g. on my linux
> machine:
> 
> 
> ls -l /proc/{postmasterpid}/fd/1
> 
> 
> cheers
> 
> 
> andrew
> 
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>

Re: WIP: System Versioned Temporal Table

2020-12-21 Thread Surafel Temesgen

Hi Ryan,

On Fri, Dec 18, 2020 at 10:28 PM Ryan Lambert 
wrote:

> On Thu, Nov 19, 2020 at 11:04 AM Surafel Temesgen 
> wrote:
>
> The docs have two instances of "EndtTime" that should be "EndTime".
>

Since my first language is not english i'm glad you find only this error
on doc. I will send rebased pach soon

regards
Surafel

Re: [PATCH] Logical decoding of TRUNCATE

2020-12-21 Thread Andres Freund

Hi,

On 2020-12-20 15:54:31 -0800, Peter Geoghegan wrote:
> On Sun, Dec 20, 2020 at 3:13 PM Andres Freund  wrote:
> > Hm. Do I understand correctly that this problem is hit solely because
> > the parallel mode code relies on there already have been a transaction
> > snapshot set, thus avoiding the error? And that the code normally only
> > works because GetTransactionSnapshot() will already have been called
> > somewhere, before EnterParallelMode()?
> 
> It seems unlikely that InitializeParallelDSM() was ever intended to be
> run in a background worker.

IDK, the environment in a bgworker shouldn't be that different from the
normal query environment in a normal connection. And it's far from
insane to want to be able to run a paralell query in a bgworker (and I
*think* I have seen that work before). This case here seems more like
an accidental dependency than anything to me, once that could perhaps
even hint at problems in normal backends too.

Greetings,

Andres Freund

Re: BUG #16079: Question Regarding the BUG #16064

2020-12-21 Thread Jeff Janes

On Sun, Dec 20, 2020 at 7:58 PM Stephen Frost  wrote:

>
> * Magnus Hagander (mag...@hagander.net) wrote:
>
>
Changed from bugs to hackers.

> > For the old plaintext "password" method, we log a warning when we parse
> the
> > configuration file.
>

Like Stephen, I don't see such a warning getting logged.

> >
> > Maybe we should do the same for LDAP (and RADIUS)? This seems like a
> better
> > place to put it than to log it at every time it's received?
>
> A dollar short and a year late, but ... +1.

I would suggest going further.  I would make the change on the client side,
and have libpq refuse to send unhashed passwords without having an
environment variable set which allows it.  (Also, change the client
behavior so it defaults to verify-full when PGSSLMODE is not set.)

What is the value of logging on the server side?  I can change the setting
from password to md5 or from ldap to gss, when I notice the log message.
But once compromised or during a MITM attack, the bad guy will just set it
back to the unsafe form and the client will silently go along with it.

Cheers,

Jeff

Re: bad dependency in pg_dump output related to support function breaks binary upgrade

Pavel Stehule  writes:
> some Orafce's user reported problems with pg_upgrade. I checked this issue
> and it looks like pg_dump problem:
> ...
> the supporting function should be dumped first before function where
> supporting function is used.

I tried to reproduce this and could not.  It should work, since
ProcedureCreate definitely makes a dependency on the support function.
Can you make a self-contained test case?

regards, tom lane

Re: Dependency isn't created between extension and schema

James Coleman  writes:
> On Mon, Dec 21, 2020 at 2:59 AM Michael Paquier  wrote:
>> On Mon, Dec 21, 2020 at 04:02:29PM +0900, Masahiko Sawada wrote:
>>> Is it a bug? Since the created schema obviously depends on the
>>> extension when we created the schema specified in the schema option, I
>>> think we might want to create the dependency so that DROP EXTENSION
>>> drops the schema as well.

>> FWIW, I recall that the "soft" behavior that exists now is wanted, as
>> it is more flexible for DROP EXTENSION: what you are suggesting here
>> has the disadvantage to make DROP EXTENSION fail if any non-extension
>> object has been created on this schema, so this could be disruptive
>> when it comes to some upgrade scenarios.

I think it absolutely is intentional.  For example, if several extensions
all list "schema1" in their control files, and you install them all, you
would not want dropping the first-created one to force dropping the rest.
I do not really see any problem here that's worth creating such hazards
to fix.

(At least in current usage, I think that control files probably always
list common schemas not per-extension schemas, so that this scenario
would be the norm not the exception.)

> Alternatively the behavior could be updated to match the docs, since
> that seems like reasonable intent.

That documentation is talking about the SCHEMA option in CREATE EXTENSION,
which is an entirely different matter from the control-file option.
A control-file entry is not going to know anything about the specific
installation it's being installed in, while the user issuing CREATE
EXTENSION presumably has local knowledge; so I don't see any strong
argument that the two cases must be treated alike.

regards, tom lane

Inconsistent/confusing handling of tablespaces for partitioned tables

2020-12-21 Thread Konstantin Knizhnik


Hi hackers,

The following sequence of statements:

CREATE SCHEMA testschema;
CREATE TABLE testschema.part (a int) PARTITION BY LIST (a);
SET default_tablespace TO pg_global;
ALTER TABLE testschema.part SET TABLESPACE pg_default;
CREATE TABLE testschema.part_78 PARTITION OF testschema.part FOR VALUES 
IN (7, 8) PARTITION BY LIST (a);


Produce error
ERROR:  only shared relations can be placed in pg_global tablespace
when been executed in database with default tablespace, but produce no 
error in database with assigned tablespace.


create tablespace my_tblspc location '/tmp/tblspc';
create databse test;
alter database test set tablespace my_tblspc;


There is the following code in tablecmds.c:

    else if (stmt->partbound)
    {
        /*
         * For partitions, when no other tablespace is specified, we 
default

         * the tablespace to the parent partitioned table's.
         */
        Assert(list_length(inheritOids) == 1);
        tablespaceId = get_rel_tablespace(linitial_oid(inheritOids));
    }

In first case get_rel_tablespace returns 0 (because parent table has no 
explicit tablespace)

and in the second: pg_default


Also I am confused that the following statement is rejected:

SET default_tablespace TO pg_default;
CREATE TABLE testschema.part (a int) PARTITION BY LIST (a);
ERROR:  cannot specify default tablespace for partitioned relations

but still it is possible to set tablespace of parent table to pg_default 
using alter tablespace command:


RESET default_tablespace;
CREATE TABLE testschema.part (a int) PARTITION BY LIST (a);
ALTER TABLE testschema.part SET TABLESPACE pg_default;

But ... it has no effect: testschema.part is till assumed to belong to 
default tablespace.

Because of the following code in tablecmds.c:


    /*
     * No work if no change in tablespace.
     */
    oldTableSpace = rel->rd_rel->reltablespace;
    if (newTableSpace == oldTableSpace ||
        (newTableSpace == MyDatabaseTableSpace && oldTableSpace == 0))
    {
        InvokeObjectPostAlterHook(RelationRelationId,
                                  RelationGetRelid(rel), 0);

        relation_close(rel, NoLock);
        return;
    }


I found the thread discussing the similar problem:
https://www.postgresql.org/message-id/flat/BY5PR18MB3170E372542F34694E630B12F10C0%40BY5PR18MB3170.namprd18.prod.outlook.com

and looks like the decision was to change nothing and leave everything 
as it is.


From my point of view the source of the problem is that pg_default 
(oid=1663) is treated as database default tablespace.
pg_default stands for concrete tablespace and it is not clear why it is 
treated in different way comparing with any other tablepsace.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: a misbehavior of partition row movement (?)

2020-12-21 Thread Arne Roland

Hi Amit,


thanks for the quick reply! Sadly I have been busy and the second part of your 
patch does no longer apply in src/include/nodes/execnodes.h:497.


I'm sorry, I should have been more careful rereading my posts. The case I meant 
is the one below. I checked the thread again. I can barely believe, I didn't 
post such an example along back then. Sorry for the confusion!


create table a (id serial, primary key (id)) partition by range (id);
create table b (id serial,  primary key (id)) partition by range (id);
create table a1 partition of a for values from (1) to (2);
create table a2 partition of a for values from (2) to (3);
create table b1 partition of b for values from (1) to (2);
create table b2 partition of b for values from (2) to (3);
insert into a (id) values (1);
insert into b (id) values (1);

create or replace function del_trig_fkt()
 returns trigger
 language plpgsql
as $$
  begin
raise notice 'Deleted!';
return old;
  end;
$$;
create trigger a_del_trig after delete on a for each row execute function 
del_trig_fkt();
create or replace function public.upd_trig_fkt()
 returns trigger
 language plpgsql
as $function$
begin
  raise notice 'Updated!';
  return new;
end;
$function$;
create trigger a_upd_trig after update on a for each row execute function 
upd_trig_fkt();

update a set id=2;

To me the issue seems to have litte to do with the fact that the trigger is 
executed on the leaf node in itself, but rather we lack the infrastructure to 
track whether the tuple is removed or only rerouted.

Regards
Arne



From: Amit Langote 
Sent: Tuesday, December 15, 2020 4:45:19 AM
To: Arne Roland
Cc: Tomas Vondra; David G. Johnston; PostgreSQL-development
Subject: Re: a misbehavior of partition row movement (?)

On Tue, Dec 15, 2020 at 12:43 PM Amit Langote  wrote:
> Quoting your original example:
>
> drop table a, b;
> create table a (id serial, primary key (id)) partition by range (id);
> create table b (id serial,  primary key (id)) partition by range (id);
> alter table b add constraint a_fk foreign key (id) references a (id)
> on delete cascade;
> create table a1 partition of a for values from (1) to (2);
> create table a2 partition of a for values from (2) to (3);
> create table b1 partition of b for values from (1) to (2);
> create table b2 partition of b for values from (2) to (3);
> insert into a (id) values (1);
> insert into b (id) values (1);
>
> -- correctly errors out instead of silently performing the ON DELETE CASCADE
> update a set id=2;
> ERROR:  update or delete on table "a" violates foreign key constraint
> "a_fk" on table "b"
> DETAIL:  Key (id)=(1) is still referenced from table "b".
>
> select * from b;
>  id
> 
>   1
> (1 row)
>
> Changing the example to specify ON UPDATE CASCADE:
>
> drop table a, b;
> create table a (id serial, primary key (id)) partition by range (id);
> create table b (id serial,  primary key (id)) partition by range (id);
> alter table b add constraint a_fk foreign key (id) references a (id)
> on delete cascade;

Oops, I copy-pasted the wrong block of text from my terminal.  I meant:

alter table b add constraint a_fk foreign key (id) references a (id)
on delete cascade on update cascade;

> create table a1 partition of a for values from (1) to (2);
> create table a2 partition of a for values from (2) to (3);
> create table b1 partition of b for values from (1) to (2);
> create table b2 partition of b for values from (2) to (3);
> insert into a (id) values (1);
> insert into b (id) values (1);
>
> -- correctly applies ON UPDATE CASCADE action
> update a set id=2;
> UPDATE 1
>
> select * from b;
>  id
> 
>   2
> (1 row)

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: allow to \dtS+ pg_toast.*

2020-12-21 Thread Laurenz Albe

On Fri, 2020-12-18 at 11:33 -0600, Justin Pryzby wrote:
> > > > > This makes toast tables a bit less special and easier to inspect.
> > 
> > I wonder why the modification in "listPartitionedTables" is necessary.
> > Surely there cannot be any partitioned toast tables, can there?
> 
> The comment should be removed for consistency.
> And I changed the code for consistency with listTables (from which I assume
> listPartitionedTables was derived - I was involved in the last stages of that
> patch).  It doesn't need to exclude pg_catalog or information_schema, either,
> but it's kept the same for consistency.  That part could also be removed.

I don't think that consistency with "listTables" is particularly useful here,
but I think this is somewhat academic.

I'll leave that for the committer to decide.

> > > > Another thing that is missing is tab completion for
> > > > regression=# \dtS pg_toast.pg_
> > > > This should work just like for \d and \dS.
> 
> I agree that it's nice to complete the schema name, but I'm still not 
> convinced
> this part should be included.
> 
> The way to include pg_toast.pg_toast is if toast relations are included, which
> is exactly what Tom pointed out is usually unhelpful.  If you include toast
> relations, tab completion might give "pg_toast.pg_toast_14..." when you wanted
> to paste "145678" - you'd need to remove the common suffix that it found.

Again a judgement call.  I am happy with the way the latest patch does it.

> I considered whether "toast table" should be capitalized (as it is for "\d")
> but I think it should stay lowercase.

Then you should also change the way \d does it (upper case).
I think we should be consistent.

I'd use TOAST for both to create no unnecessary change in \d output.

Anyway, I think that this is ready for committer and will mark it as such.

Yours,
Laurenz Albe

Re: Dependency isn't created between extension and schema

2020-12-21 Thread James Coleman

On Mon, Dec 21, 2020 at 2:59 AM Michael Paquier  wrote:
>
> On Mon, Dec 21, 2020 at 04:02:29PM +0900, Masahiko Sawada wrote:
> > Is it a bug? Since the created schema obviously depends on the
> > extension when we created the schema specified in the schema option, I
> > think we might want to create the dependency so that DROP EXTENSION
> > drops the schema as well. I’ve attached the draft patch so that CREATE
> > EXTENSION creates the dependency if it newly creates the schema.
>
> FWIW, I recall that the "soft" behavior that exists now is wanted, as
> it is more flexible for DROP EXTENSION: what you are suggesting here
> has the disadvantage to make DROP EXTENSION fail if any non-extension
> object has been created on this schema, so this could be disruptive
> when it comes to some upgrade scenarios.

That's potentially an issue even for a schema created explicitly by
the extension's install script, since anyone can create an object
within that schema at any time.

It seems that the only consistent behavior choice would be to mark the
dependency when Postgres is creating the extension automatically but
not when the schema already exists.

>   schema_name
>
> 
>  The name of the schema in which to install the extension's
>  objects, given that the extension allows its contents to be
>  relocated.  The named schema must already exist.
> While on it..  The docs are incorrect here.  As you say,
> CreateExtensionInternal() may internally create a schema.

Alternatively the behavior could be updated to match the docs, since
that seems like reasonable intent.

James

Re: [Patch] Optimize dropping of relation buffers using dlist

On Thu, Nov 19, 2020 at 12:37 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Andres Freund 
>
> > Smaller comment:
> >
> > +static void
> > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber *forkNum,
> > int nforks,
> > +   BlockNumber
> > *nForkBlocks, BlockNumber *firstDelBlock)
> > ...
> > + /* Check that it is in the buffer pool. If not, do 
> > nothing.
> > */
> > + LWLockAcquire(bufPartitionLock, LW_SHARED);
> > + buf_id = BufTableLookup(&bufTag, bufHash);
> > ...
> > + bufHdr = GetBufferDescriptor(buf_id);
> > +
> > + buf_state = LockBufHdr(bufHdr);
> > +
> > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
> > + bufHdr->tag.forkNum == forkNum[i] &&
> > + bufHdr->tag.blockNum >= firstDelBlock[i])
> > + InvalidateBuffer(bufHdr);   /* releases
> > spinlock */
> > + else
> > + UnlockBufHdr(bufHdr, buf_state);
> >
> > I'm a bit confused about the check here. We hold a buffer partition lock, 
> > and
> > have done a lookup in the mapping table. Why are we then rechecking the
> > relfilenode/fork/blocknum? And why are we doing so holding the buffer header
> > lock, which is essentially a spinlock, so should only ever be held for very 
> > short
> > portions?
> >
> > This looks like it's copying logic from DropRelFileNodeBuffers() etc, but 
> > there
> > the situation is different: We haven't done a buffer mapping lookup, and we
> > don't hold a partition lock!
>
> That's because the buffer partition lock is released immediately after the 
> hash table has been looked up.  As an aside, InvalidateBuffer() requires the 
> caller to hold the buffer header spinlock and doesn't hold the buffer 
> partition lock.
>

This answers the second part of the question but what about the first
part (We hold a buffer partition lock, and have done a lookup in the
mapping table. Why are we then rechecking the
relfilenode/fork/blocknum?)

I think we don't need such a check, rather we can have an Assert
corresponding to that if-condition in the patch. I understand it is
safe to compare relfilenode/fork/blocknum but it might confuse readers
of the code.

I have started doing minor edits to the patch especially planning to
write a theory why is this optimization safe and here is what I can
come up with: "To remove all the pages of the specified relation forks
from the buffer pool, we need to scan the entire buffer pool but we
can optimize it by finding the buffers from BufMapping table provided
we know the exact size of each fork of the relation. The exact size is
required to ensure that we don't leave any buffer for the relation
being dropped as otherwise the background writer or checkpointer can
lead to a PANIC error while flushing buffers corresponding to files
that don't exist.

To know the exact size, we rely on the size cached for each fork by us
during recovery which limits the optimization to recovery and on
standbys but we can easily extend it once we have shared cache for
relation size.

In recovery, we cache the value returned by the first lseek(SEEK_END)
and the future writes keeps the cached value up-to-date. See
smgrextend. It is possible that the value of the first lseek is
smaller than the actual number of existing blocks in the file due to
buggy Linux kernels that might not have accounted for the recent
write. But that should be fine because there must not be any buffers
after that file size.

XXX We would make the extra lseek call for the unoptimized paths but
that is okay because we do it just for the first fork and we anyway
have to scan the entire buffer pool the cost of which is so high that
the extra lseek call won't make any visible difference. However, we
can use InRecovery flag to avoid the additional cost but that doesn't
seem worth it."

Thoughts?

-- 
With Regards,
Amit Kapila.

Re: Single transaction in the tablesync worker?

On Mon, Dec 21, 2020 at 3:17 PM Peter Smith  wrote:
>
> On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila  wrote:
>
> > Few other comments:
> > ==
>
> Thanks for your feedback.
>
> > 1.
> > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > because I don't think
> > + * such workers are even iterated by this loop, and nobody else is
> > removing them.
> > + */
> > + if (slotname)
> > + {
> >
> > The above FIXME is not clear to me. Actually, the crashed workers
> > should restart, finish their work, and drop the slots. So not sure
> > what exactly this FIXME refers to?
>
> Yes, normally if the tablesync can complete it should behave like that.
> But I think there are other scenarios where it may be unable to
> clean-up after itself. For example:
>
> i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> handled by tablesync can give a PK violation which also will crash
> again and again for each re-launched/replacement tablesync worker.
> This can be reproduced in the debugger. If the DropSubscription
> doesn't clean-up the tablesync's slot then nobody will.
>
> ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> ensure that the launcher doesn't restart new worker during dropping
> the subscription".
>

Yeah, I have also read that comment but do you know how it is
preventing relaunch? How does the subscription lock help?

> So executing DROP SUBSCRIPTION will prevent a newly
> crashed tablesync from re-launching, so it won’t be able to take care
> of its own slot. If the DropSubscription doesn't clean-up that
> tablesync's slot then nobody will.
>


> >
> > 2.
> > DropSubscription()
> > {
> > ..
> > ReplicationSlotDropAtPubNode(
> > + NULL,
> > + conninfo, /* use conninfo to make a new connection. */
> > + subname,
> > + syncslotname);
> > ..
> > }
> >
> > With the above call, it will form a connection with the publisher and
> > drop the required slots. I think we need to save the connection info
> > so that we don't need to connect/disconnect for each slot to be
> > dropped. Later in this function, we again connect and drop the apply
> > worker slot. I think we should connect just once drop the apply and
> > table sync slots if any.
>
> OK. IIUC this is a suggestion for more efficient connection usage,
> rather than actual bug right?
>

Yes, it is for effective connection usage.

> I have added this suggestion to my TODO
> list.
>
> >
> > 3.
> > ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char
> > *conninfo, char *subname, char *slotname)
> > {
> > ..
> > + PG_TRY();
> > ..
> > + PG_CATCH();
> > + {
> > + /* NOP. Just gobble any ERROR. */
> > + }
> > + PG_END_TRY();
> >
> > Why are we suppressing the error instead of handling it the error in
> > the same way as we do while dropping the apply worker slot in
> > DropSubscription?
>
> This function is common - it is also called from the tablesync
> finish_sync_worker. But in the finish_sync_worker case I wanted to
> avoid throwing an ERROR which would cause the tablesync to crash and
> relaunch (and crash/relaunch/repeat...) when all it was trying to do
> in the first place was just cleanup and exit the process. Perhaps the
> error suppression should be conditional depending where this function
> is called from?
>

Yeah, that could be one way and if you follow my previous suggestion
this function might change a bit more.

-- 
With Regards,
Amit Kapila.

Incorrect assertion in ExecBuildAggTrans ?

2020-12-21 Thread Andrew Gierth

ExecBuildAggTrans has this sequence:

if (pertrans->deserialfn.fn_strict)
scratch.opcode = EEOP_AGG_STRICT_DESERIALIZE;
else
scratch.opcode = EEOP_AGG_DESERIALIZE;

scratch.d.agg_deserialize.fcinfo_data = ds_fcinfo;
scratch.d.agg_deserialize.jumpnull = -1;/* adjust later */
scratch.resvalue = &trans_fcinfo->args[argno + 1].value;
scratch.resnull = &trans_fcinfo->args[argno + 1].isnull;

ExprEvalPushStep(state, &scratch);
adjust_bailout = lappend_int(adjust_bailout,
 state->steps_len - 1);

but later on, where adjust_bailout is processed, we see this (note that
EEOP_AGG_DESERIALIZE is not checked for):

if (as->opcode == EEOP_JUMP_IF_NOT_TRUE)
{
Assert(as->d.jump.jumpdone == -1);
as->d.jump.jumpdone = state->steps_len;
}
else if (as->opcode == EEOP_AGG_STRICT_INPUT_CHECK_ARGS ||
 as->opcode == EEOP_AGG_STRICT_INPUT_CHECK_NULLS)
{
Assert(as->d.agg_strict_input_check.jumpnull == -1);
as->d.agg_strict_input_check.jumpnull = state->steps_len;
}
else if (as->opcode == EEOP_AGG_STRICT_DESERIALIZE)
{
Assert(as->d.agg_deserialize.jumpnull == -1);
as->d.agg_deserialize.jumpnull = state->steps_len;
}
else
Assert(false);

Seems clear to me that the assertion is wrong, and that even though a
non-strict DESERIALIZE opcode might not need jumpnull filled in, the
code added it to adjust_bailout anyway, so we crash out here on an
asserts build. This may have been overlooked because all the builtin
deserialize functions appear to be strict, but postgis has at least one
non-strict one and can hit this.

This could be fixed either by fixing the assert, or by not adding
non-strict deserialize opcodes to adjust_bailout; anyone have any
preferences?

-- 
Andrew (irc:RhodiumToad)

Re: [PATCH] Automatic HASH and LIST partition creation

2020-12-21 Thread Maxim Orlov


On 2020-12-18 21:54, Pavel Borisov wrote:

I've realized one strange effect in current grammar parsing: if I do

CREATE TABLE foo (a int) PARTITION BY LIST (a) CONFIGURATION (a 1);
ERROR:  unrecognized auto partition bound specification "a"

I consulted the patch code and realized that in fact, the patch
considers it the (invalid) HASH bounds (doesn't find a word 'modulus')
unless it is specified to be (still invalid) LIST. This is due to the
fact that the grammar parser is not context-aware and in the patch, we
tried to avoid the new parser keyword MODULUS. The effect is that
inside a CONFIGURATION parentheses in case of HASH bounds we don't
have a single keyword for the parser to determine it is really a HASH
case.

It doesn't make the patch work wrongly, besides it checks the validity
of all types of bounds in the HASH case even when the partitioning is
not HASH. I find this slightly bogus. This is because the parser can
not determine the type of partitioning inside the configuration clause
and this makes adding new syntax (e.g. adding RANGE partitioning
configuration inside CONFIGURATION parentheses) complicated.

So I have one more syntax proposal: to have separate keywords inside
CONFIGURATION parentheses for each partitioning type.
E.g:
CREATE TABLE foo(a int) PARTITION BY LIST(a) CONFIGURATION (FOR VALUES
IN (1,2),(3,4) DEFAULT PARTITION foo_def);
CREATE TABLE foo(a int) PARTITION BY HASH(a) CONFIGURATION (FOR VALUES
WITH MODULUS 3);
CREATE TABLE foo(a int) PARTITION BY RAGE(a) CONFIGURATION (FOR VALUES
FROM 1 TO 1000 INTERVAL 10 DEFAULT PARTITION foo_def);

This proposal is in accordance with the current syntax of declarative
partitioning: CREATE TABLE foo_1 PARTITION OF foo FOR VALUES ...

Some more facultative proposals incremental to the abovementioned:
1. Omit CONFIGURATION with/without parentheses. This makes syntax
closer to (non-automatic) declarative partitioning syntax but the
clause seems less legible (in my opinion).
2. Omit just FOR VALUES. This makes the clause short, but adds a
difference to (non-automatic) declarative partitioning syntax.

 I'm planning also to add RANGE partitioning syntax to this in the
future and I will be happy if all three types of the syntax could come
along easily.

 I very much appreciate your views on this especially regarding that
changes can be still made easily because the patch is not committed
yet.

--

Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com [1]


Links:
--
[1] http://www.postgrespro.com


In my view, next expressions are the golden ground here. On one hand, 
not far from the original
non-automatic declarative partitioning syntax syntax, on the other hand, 
omit CONFIGURATION
key-word (which is redundant here in terms of gram parsing) makes this 
expressions less apprehensible for the human.


CREATE TABLE foo(a int) PARTITION BY LIST(a) CONFIGURATION (FOR VALUES 
IN (1,2),(3,4) DEFAULT PARTITION foo_def);
CREATE TABLE foo(a int) PARTITION BY HASH(a) CONFIGURATION (FOR VALUES 
WITH MODULUS 3);
CREATE TABLE foo(a int) PARTITION BY RAGE(a) CONFIGURATION (FOR VALUES 
FROM 1 TO 1000 INTERVAL 10 DEFAULT PARTITION foo_def);


In addition to that, adding RANGE PARTITION would be much simpler since 
we would have specific "branches" in gram instead of using 
context-sensitive grammar and dealing with it in c-code.

---
Best regards,
Maxim Orlov.

Re: On login trigger: take three

2020-12-21 Thread Konstantin Knizhnik




On 20.12.2020 10:04, Pavel Stehule wrote:



čt 17. 12. 2020 v 19:30 odesílatel Pavel Stehule 
mailto:pavel.steh...@gmail.com>> napsal:




čt 17. 12. 2020 v 14:04 odesílatel Konstantin Knizhnik
mailto:k.knizh...@postgrespro.ru>> napsal:



On 17.12.2020 9:31, Pavel Stehule wrote:



st 16. 12. 2020 v 20:38 odesílatel Pavel Stehule
mailto:pavel.steh...@gmail.com>>
napsal:

Attached please find new versoin of the patch based on
on_connect_event_trigger_WITH_SUGGESTED_UPDATES.patch

So there is still only
"disable_client_connection_trigger" GUC? because we
need possibility to disable client connect triggers
and there is no such need for other event types.

As you suggested  have added "dathaslogontriggers"
flag which is set when client connection trigger is
created.
This flag is never cleaned (to avoid visibility
issues mentioned in my previous mail).


This is much better - I don't see any slowdown when logon
trigger is not defined

I did some benchmarks and looks so starting language
handler is relatively expensive - it is about 25% and
starting handler like event trigger has about 35%. But
this problem can be solved later and elsewhere.

I prefer the inverse form of disable_connection_trigger.
Almost all GUC are in positive form - so
enable_connection_triggger is better

I don't think so current handling dathaslogontriggers is
good for production usage. The protection against race
condition can be solved by lock on pg_event_trigger


I thought about it, and probably the counter of connect
triggers will be better there. The implementation will be
simpler and more robust.





I prefer to implement different approach: unset
dathaslogontriggers  flag in event trigger itself when no
triggers are returned by EventTriggerCommonSetup.
I am using double checking to prevent race condition.
The main reason for this approach is that dropping of triggers
is not done in event_trigger.c: it is done by generic code for
all resources.
It seems to be there is no right place where decrementing of
trigger counters can be inserted.
Also I prefer to keep all logon-trigger specific code in one file.

Also, as you suggested, I renamed disable_connection_trigger
to enable_connection_trigger.


New version of the patch is attached.


yes, it can work


for complete review the new field in pg_doc should be documented



Done.
Also I fixed some examples in documentation which involves pg_database.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 5698413..010bff5 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2946,6 +2946,17 @@ SCRAM-SHA-256$:&l
 
  
   
+   dathaslogontriggers bool
+  
+  
+Indicates that there are client connection triggers defined for this database.
+This flag is used to avoid extra lookup of pg_event_trigger table on each backend startup.
+This flag is used internally by Postgres and should not be manually changed by DBA or application.
+  
+ 
+
+ 
+  
datconnlimit int4
   
   
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f810789..45d30b3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -996,6 +996,23 @@ include_dir 'conf.d'
   
  
 
+ 
+  enable_client_connection_trigger (boolean)
+  
+   enable_client_connection_trigger configuration parameter
+  
+  
+  
+   
+Enables firing the client_connection
+trigger when a client connects. This parameter is switched on by default.
+Errors in trigger code can prevent user to login to the system.
+I this case disabling this parameter in connection string can solve the problem:
+psql "dbname=postgres options='-c enable_client_connection_trigger=false'". 
+   
+  
+ 
+
  
  
 
diff --git a/doc/src/sgml/ecpg.sgml b/doc/src/sgml/ecpg.sgml
index 14dcbdb..965e497 100644
--- a/doc/src/sgml/ecpg.sgml
+++ b/doc/src/sgml/ecpg.sgml
@@ -4673,6 +4673,7 @@ datdba = 10 (type: 1)
 encoding = 0 (type: 5)
 datistemplate = t (type: 1)
 datallowconn = t (type: 1)
+dathaslogontriggers = f (type: 1)
 datconnlimit = -1 (type: 5)
 datlastsysoid = 11510 (type: 1)
 datfrozenxid = 379 (type: 1)
diff --git a/doc/src/sgml/event-trigger.sgml b/doc/src/sgml/event-trigger.sgml
index 60366a9..ae40a8e 100644
--- a/doc/src/sgml/event-

Invalidate acl.c caches for pg_authid.rolinherit changes

2020-12-21 Thread Noah Misch

Backends reflect "GRANT role_name" changes rather quickly, due to a syscache
invalidation callback.  Let's register an additional callback to reflect
"ALTER ROLE ... [NO]INHERIT" with equal speed.  I propose to back-patch this.
While pg_authid changes may be more frequent than pg_auth_members changes, I
expect neither is frequent enough to worry about the resulting acl.c cache
miss rate.

pg_authid changes don't affect cached_membership_roles, so I could have
invalidated cached_privs_roles only.  That felt like needless complexity.  I
expect cached_privs_role gets the bulk of traffic, since SELECT, INSERT,
UPDATE and DELETE use it.  cached_membership_roles pertains to DDL and such.
Author: Noah Misch 
Commit: Noah Misch 

Invalidate acl.c caches when pg_authid changes.

This makes existing sessions reflect "ALTER ROLE ... [NO]INHERIT" as
quickly as they have been reflecting "GRANT role_name".  Back-patch to
9.5 (all supported versions).

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/acl.c
index f97489f..fe6c444 100644
--- a/src/backend/utils/adt/acl.c
+++ b/src/backend/utils/adt/acl.c
@@ -52,7 +52,6 @@ typedef struct
  * role.  In most of these tests the "given role" is the same, namely the
  * active current user.  So we can optimize it by keeping a cached list of
  * all the roles the "given role" is a member of, directly or indirectly.
- * The cache is flushed whenever we detect a change in pg_auth_members.
  *
  * There are actually two caches, one computed under "has_privs" rules
  * (do not recurse where rolinherit isn't true) and one computed under
@@ -4675,12 +4674,16 @@ initialize_acl(void)
if (!IsBootstrapProcessingMode())
{
/*
-* In normal mode, set a callback on any syscache invalidation 
of
-* pg_auth_members rows
+* In normal mode, set a callback on any syscache invalidation 
of rows
+* of pg_auth_members (for each AUTHMEM search in this file) or
+* pg_authid (for has_rolinherit())
 */
CacheRegisterSyscacheCallback(AUTHMEMROLEMEM,
  
RoleMembershipCacheCallback,
  
(Datum) 0);
+   CacheRegisterSyscacheCallback(AUTHOID,
+ 
RoleMembershipCacheCallback,
+ 
(Datum) 0);
}
 }
 
diff --git a/src/test/regress/expected/privileges.out 
b/src/test/regress/expected/privileges.out
index 0a2dd37..7754c20 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -350,6 +350,13 @@ SET SESSION AUTHORIZATION regress_priv_user1;
 SELECT * FROM atest3; -- fail
 ERROR:  permission denied for table atest3
 DELETE FROM atest3; -- ok
+BEGIN;
+RESET SESSION AUTHORIZATION;
+ALTER ROLE regress_priv_user1 NOINHERIT;
+SET SESSION AUTHORIZATION regress_priv_user1;
+DELETE FROM atest3;
+ERROR:  permission denied for table atest3
+ROLLBACK;
 -- views
 SET SESSION AUTHORIZATION regress_priv_user3;
 CREATE VIEW atestv1 AS SELECT * FROM atest1; -- ok
diff --git a/src/test/regress/sql/privileges.sql 
b/src/test/regress/sql/privileges.sql
index e0c1a29..4911ad4 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -220,6 +220,12 @@ SET SESSION AUTHORIZATION regress_priv_user1;
 SELECT * FROM atest3; -- fail
 DELETE FROM atest3; -- ok
 
+BEGIN;
+RESET SESSION AUTHORIZATION;
+ALTER ROLE regress_priv_user1 NOINHERIT;
+SET SESSION AUTHORIZATION regress_priv_user1;
+DELETE FROM atest3;
+ROLLBACK;
 
 -- views

Re: Single transaction in the tablesync worker?

2020-12-21 Thread Peter Smith

On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila  wrote:

> Few other comments:
> ==

Thanks for your feedback.

> 1.
> * FIXME 3 - Crashed tablesync workers may also have remaining slots
> because I don't think
> + * such workers are even iterated by this loop, and nobody else is
> removing them.
> + */
> + if (slotname)
> + {
>
> The above FIXME is not clear to me. Actually, the crashed workers
> should restart, finish their work, and drop the slots. So not sure
> what exactly this FIXME refers to?

Yes, normally if the tablesync can complete it should behave like that.
But I think there are other scenarios where it may be unable to
clean-up after itself. For example:

i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
handled by tablesync can give a PK violation which also will crash
again and again for each re-launched/replacement tablesync worker.
This can be reproduced in the debugger. If the DropSubscription
doesn't clean-up the tablesync's slot then nobody will.

ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
ensure that the launcher doesn't restart new worker during dropping
the subscription". So executing DROP SUBSCRIPTION will prevent a newly
crashed tablesync from re-launching, so it won’t be able to take care
of its own slot. If the DropSubscription doesn't clean-up that
tablesync's slot then nobody will.

>
> 2.
> DropSubscription()
> {
> ..
> ReplicationSlotDropAtPubNode(
> + NULL,
> + conninfo, /* use conninfo to make a new connection. */
> + subname,
> + syncslotname);
> ..
> }
>
> With the above call, it will form a connection with the publisher and
> drop the required slots. I think we need to save the connection info
> so that we don't need to connect/disconnect for each slot to be
> dropped. Later in this function, we again connect and drop the apply
> worker slot. I think we should connect just once drop the apply and
> table sync slots if any.

OK. IIUC this is a suggestion for more efficient connection usage,
rather than actual bug right? I have added this suggestion to my TODO
list.

>
> 3.
> ReplicationSlotDropAtPubNode(WalReceiverConn *wrconn_given, char
> *conninfo, char *subname, char *slotname)
> {
> ..
> + PG_TRY();
> ..
> + PG_CATCH();
> + {
> + /* NOP. Just gobble any ERROR. */
> + }
> + PG_END_TRY();
>
> Why are we suppressing the error instead of handling it the error in
> the same way as we do while dropping the apply worker slot in
> DropSubscription?

This function is common - it is also called from the tablesync
finish_sync_worker. But in the finish_sync_worker case I wanted to
avoid throwing an ERROR which would cause the tablesync to crash and
relaunch (and crash/relaunch/repeat...) when all it was trying to do
in the first place was just cleanup and exit the process. Perhaps the
error suppression should be conditional depending where this function
is called from?

>
> 4.
> @@ -139,6 +141,28 @@ finish_sync_worker(void)
>   get_rel_name(MyLogicalRepWorker->relid;
>   CommitTransactionCommand();
>
> + /*
> + * Cleanup the tablesync slot.
> + */
> + {
> + extern void ReplicationSlotDropAtPubNode(
> + WalReceiverConn *wrconn_given, char *conninfo, char *subname, char 
> *slotname);
>
> This is not how we export functions at other places?

Fixed in latest v5 patch -
https://www.postgresql.org/message-id/CAHut%2BPvmDJ_EO11_up%3D_cRbOjhdWCMG-n7kF-mdRhjtCHcjHRA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: [PATCH] Logical decoding of TRUNCATE

2020-12-21 Thread Noah Misch

On Sun, Dec 20, 2020 at 03:54:31PM -0800, Peter Geoghegan wrote:
> On Sun, Dec 20, 2020 at 3:13 PM Andres Freund  wrote:
> > Hm. Do I understand correctly that this problem is hit solely because
> > the parallel mode code relies on there already have been a transaction
> > snapshot set, thus avoiding the error? And that the code normally only
> > works because GetTransactionSnapshot() will already have been called
> > somewhere, before EnterParallelMode()?

I think so.

> It seems unlikely that InitializeParallelDSM() was ever intended to be
> run in a background worker.

That wouldn't surprise me.  Nonetheless, when worker_spi runs parallel
queries, they work fine.  The logical replication worker experiences novel
scenarios, because it calls ExecuteTruncateGuts() directly, not as part of an
actual TRUNCATE query.  That bypasses some of the usual once-per-query setup.

On Mon, Dec 21, 2020 at 12:29:37PM +0530, Amit Kapila wrote:
> I think the TRUNCATE operation should not use parallelism either via
> apply worker or without it because there is nothing to scan in heap.

That's fair.

> Additionally, we can have an Assert or elog in InitializeParallelDSM
> to ensure that it is never invoked by parallel worker.

I don't know whether InitializeParallelDSM() operates correctly from inside a
parallel worker.  That is orthogonal to the bug here.

Re: Single transaction in the tablesync worker?

2020-12-21 Thread Peter Smith

On Sat, Dec 19, 2020 at 5:38 PM Amit Kapila  wrote:
>
> On Fri, Dec 18, 2020 at 6:41 PM Peter Smith  wrote:
> >
> > TODO / Known Issues:
> >
> > * the current implementation of tablesync drop slot (e.g. from
> > DropSubscription or finish_sync_worker) regenerates the tablesync slot
> > name so it knows what slot to drop.
> >
>
> If you always drop the slot at finish_sync_worker, then in which case
> do you need to drop it during DropSubscription? Is it when the table
> sync workers are crashed?

Yes. It is not the normal case. But if the tablesync never yet got to
SYNCDONE state (maybe crashed) then finish_sync_worker may not be
called.
So I think a rogue tablesync slot might still exist during DropSubscription.

>
> > The current code might be ok for
> > normal use cases, but if there is an ALTER SUBSCRIPTION ... SET
> > (slot_name = newname) it would fail to be able to find the tablesync
> > slot.
> >
>
> Sure, but the same will be true for the apply worker slot as well. I
> agree the problem would be more for table sync workers but I think we
> can solve it, see below.
>
> > * I think if there are crashed tablesync workers then they are not
> > known to DropSubscription. So this might be a problem to cleanup slots
> > and/or origin tracking belonging to those unknown workers.
> >
>
> Yeah, I think we can do two things to avoid this and the previous
> problem. (a) We can generate the slot_name for the table sync worker
> based on only subscription_id and rel_id. (b) Immediately after
> creating the slot, advance the replication origin with the position
> (origin_startpos) we get from walrcv_create_slot, this will help us to
> start from the right location.
>
> Do you see anything which will still not be addressed after doing the above?

(a) V5 Patch is updated as suggested.
(b) V5 Patch is updated as suggested. Now calling replorigin_advance.
No problems seen so far. All TAP tests pass, but more testing needed
for the origin stuff

>
> I understand why you are trying to create this patch atop logical
> decoding of 2PC patch but I think it is better to create this as an
> independent patch and then use it to test 2PC problem.

OK. The latest patch still applies to v30 just for my convenience
today, but I will head towards converting this to an independent patch
ASAP.

> Also, please
> explain what kind of testing you did to ensure that it works properly
> after the table sync worker restarts after the crash.

So far tested like this - I caused the tablesync to crash after
COPYDONE (but before SYNCDONE) by sending a row to cause a PK
violation while holding the tablesync at the CATCHUP state in the
debugger. The tablesync then handles the insert, encounters the PK
violation error, and re-launches. Then I can remove the extra row so
the PK violation does not happen, so the (re-launched) tablesync can
complete and finish normally. The Apply worker then takes over.

I have attached some captured/annotated logging of my test scenario
which I ran using the V4 patch (the log has a lot of extra temporary
output to help see what is going on)

---
Kind Regards,
Peter Smith.
Fujitsu Australia.
TEST CASE:

§  Insert some row to subscriber side only
§  Start the SUBSCRIPTION – to start tablesync worker
§  Let tablesync do the copy (so mode will be COPYDONE)
§  In debugger hold the tablesync worker at the CATCHUP point
§  Then issue INSERT (from publisher) to cause PK violation with same row that 
was previous inserted as subscriber side
§  Tablesync worker should ERROR when handling this duplicate insert
§  Then a *new* tablesync worker may be launched for same table
§  And this newtablesync worker should detect the state was at COPYDONE, and so 
this time it should bypass the initial copy phase. 
§  In debugger hold the (re-launched) tablesync worker at the CATCHUP point
§  Delete the subscriber row which was causeing the PK viloation
§  Let the tablesync proceed. Now it should be ableto handle replaciation and 
finish OK.
§  Then Apply worker takes over...

LOG FRAGMENTS BELOW
===

##
## Start with empty tables
##

[postgres@CentOS7-x64 ~]$ psql -d test_pub -c "TRUNCATE test_tab;"
TRUNCATE TABLE
[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "TRUNCATE test_tab;"
TRUNCATE TABLE

##
## Poulate 1 row data on publisher.
##

[postgres@CentOS7-x64 ~]$ psql -d test_pub -c "INSERT INTO test_tab VALUES(1, 
'foo');"
INSERT 0 1

##
## Add a row to subscription side only (this is to cause a subsequent PK 
violation)
##

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "INSERT INTO test_tab 
VALUES(2, 'bar');"
INSERT 0 1

##
## Start SUBSCRIPTION
## Attach to tablesync
## Get to CATCHUP mode
##

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION 
tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' 
PUBLICATION tap_pub WITH (two_phase = on);"
2020-12-18 22:54:04.280 AEDT [18765] LOG:  logical decoding found consistent 
point at 0/1653608
2020-12-18 2

Re: Single transaction in the tablesync worker?

2020-12-21 Thread Peter Smith

Hi Amit.

PSA my v5 WIP patch for the Solution1.

This patch still applies onto the v30 patch set [1] from other 2PC thread:
[1]
https://www.postgresql.org/message-id/CAFPTHDYA8yE6tEmQ2USYS68kNt%2BkM%3DSwKgj%3Djy4AvFD5e9-UTQ%40mail.gmail.com

(I understand you would like this to be delivered as a separate patch
independent of v30. I will convert it ASAP)

Coded / WIP:

* tablesync slot is now permanent instead of temporary. The tablesync
slot name is no longer tied to the Subscription slot name.

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a relaunched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply

TODO / Known Issues:

* I think if there are crashed tablesync workers they may not be known
to DropSubscription current code. This might be a problem to cleanup
slots and/or origin tracking belonging to those unknown workers.

* Help / comments / cleanup

* There is temporary "!!>>" excessive logging of mine scattered around
which I added to help my testing during development

* Address review comments

---

Kind Regards,
Peter Smith.
Fujitsu Australia

v5-0001-2PC-change-tablesync-slot-to-use-same-two_phase-m.patch
Description: Binary data

v5-0002-WIP-patch-for-the-Solution1.patch
Description: Binary data

bad dependency in pg_dump output related to support function breaks binary upgrade

2020-12-21 Thread Pavel Stehule

Hi

some Orafce's user reported problems with pg_upgrade. I checked this issue
and it looks like pg_dump problem:


pg_restore: creating FUNCTION "public.nvarchar2("public"."nvarchar2",
integer, boolean)"
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 612; 1255 33206 FUNCTION
nvarchar2("public"."nvarchar2", integer, boolean) pavel
pg_restore: error: could not execute query: ERROR:  function
public.nvarchar2_transform(internal) does not exist
Command was: CREATE FUNCTION "public"."nvarchar2"("public"."nvarchar2",
integer, boolean) RETURNS "public"."nvarchar2"
LANGUAGE "c" IMMUTABLE STRICT SUPPORT "public"."nvarchar2_transform"
AS '$libdir/orafce', 'nvarchar2';




--
-- TOC entry 612 (class 1255 OID 33206)
-- Name: nvarchar2("public"."nvarchar2", integer, boolean); Type: FUNCTION;
Schema: public; Owner: pavel
--

CREATE FUNCTION "public"."nvarchar2"("public"."nvarchar2", integer,
boolean) RETURNS "public"."nvarchar2"
LANGUAGE "c" IMMUTABLE STRICT SUPPORT "public"."nvarchar2_transform"
AS '$libdir/orafce', 'nvarchar2';

-- For binary upgrade, handle extension membership the hard way
ALTER EXTENSION "orafce" ADD FUNCTION
"public"."nvarchar2"("public"."nvarchar2", integer, boolean);


ALTER FUNCTION "public"."nvarchar2"("public"."nvarchar2", integer, boolean)
OWNER TO "pavel";

--
-- TOC entry 607 (class 1255 OID 33201)
-- Name: nvarchar2_transform("internal"); Type: FUNCTION; Schema: public;
Owner: pavel
--

CREATE FUNCTION "public"."nvarchar2_transform"("internal") RETURNS
"internal"
LANGUAGE "c" IMMUTABLE STRICT
AS '$libdir/orafce', 'orafce_varchar_transform';

-- For binary upgrade, handle extension membership the hard way
ALTER EXTENSION "orafce" ADD FUNCTION
"public"."nvarchar2_transform"("internal");


ALTER FUNCTION "public"."nvarchar2_transform"("internal") OWNER TO "pavel";

the supporting function should be dumped first before function where
supporting function is used.

Regards

Pavel

Re: Strange behavior with polygon and NaN