date:20210112

Re: O(n^2) system calls in RemoveOldXlogFiles()

2021-01-12 Thread Michael Paquier

On Tue, Jan 12, 2021 at 12:10:24PM -0300, Alvaro Herrera wrote:
> Apparently b2a5545bd63f changed; before that commit, that code
> (including the quoted comment) was all in RemoveOldXlogFiles, and
> endlogSegNo was calculated only once.  But ISTM that even with that
> formulation it had the problem you point out.  The real problem is the
> loop hidden inside InstallXLogFileSegment().

I am not sure to get your point here.  The purposes of the two 
incrementations in InstallXLogFileSegment() and RemoveXlogFile() are
different, still the former incrementation makes the recycling done by
the latter faster,
--
Michael


signature.asc
Description: PGP signature

RE: ResourceOwner refactoring

2021-01-12 Thread kuroda.hay...@fujitsu.com

Dear Heikki,

Thank you for rebasing it, I confirmed it can be applied.
I will check the source.

Now I put the very elementary comment.
ResourceOwnerEnlarge(), ResourceOwnerRemember(), and ResourceOwnerForget()
are exported routines.
They should put below L418.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Track replica origin progress for Rollback Prepared

2021-01-12 Thread Ajin Cherian

On Wed, Jan 13, 2021 at 12:11 AM Amit Kapila  wrote:

> Thanks for doing these tests. I think you can put an elog in the below
> code change as well to show that the recovery code path is also hit:
>
> +xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
> + XLogRecPtr lsn, RepOriginId origin_id)
>  {
> ...
> + if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
> + {
> + /* recover apply progress */
> + replorigin_advance(origin_id, parsed->origin_lsn, lsn,
> +false /* backward */, false /* WAL */);
> + }

I added the below log:
@@ -2299,6 +2309,14 @@ RecordTransactionAbortPrepared(TransactionId xid,

MyXactFlags | XACT_FLAGS_ACQUIREDACCE
xid, gid);

+   if (replorigin)
+   {
+   /* Move LSNs forward for this replication origin */
+   replorigin_session_advance(replorigin_session_origin_lsn,
+
XactLastRecEnd);
+   elog(LOG,"advance replorigin for abort prepared");
+   }
+
/* Always flush, since we're about to remove the 2PC state file */
XLogFlush(recptr);

Then on server1:

postgres=# begin;
BEGIN
postgres=*# insert into tab values (5);
INSERT 0 1
postgres=*# prepare transaction 'test';
PREPARE TRANSACTION
postgres=# rollback prepared 'test';
ROLLBACK PREPARED

And I immediately stopped server 2 to prevent checkpoint:
pg_ctl  stop -m immediate
and restarted server 2:

I got the following logs:

2021-01-13 02:14:40.297 EST [4851] LOG:  listening on IPv6 address
"::1", port 54321
2021-01-13 02:14:40.297 EST [4851] LOG:  listening on IPv4 address
"127.0.0.1", port 54321
2021-01-13 02:14:40.323 EST [4851] LOG:  listening on Unix socket
"/tmp/.s.PGSQL.54321"
2021-01-13 02:14:40.333 EST [4852] LOG:  database system was
interrupted; last known up at 2021-01-13 02:04:08 EST
2021-01-13 02:14:40.842 EST [4852] LOG:  recovered replication state
of node 1 to 0/1606C60
2021-01-13 02:14:40.843 EST [4852] LOG:  database system was not
properly shut down; automatic recovery in progress
2021-01-13 02:14:40.850 EST [4852] LOG:  redo starts at 0/160BFE8
2021-01-13 02:14:40.850 EST [4852] LOG:  recover apply progress in red
abort  < the log added for this
2021-01-13 02:14:40.850 EST [4852] CONTEXT:  WAL redo at 0/160C0E8 for
Transaction/ABORT_PREPARED: 533: 2021-01-13 02:14:20.911933-05
2021-01-13 02:14:40.853 EST [4852] LOG:  invalid record length at
0/160C160: wanted 24, got 0
2021-01-13 02:14:40.853 EST [4852] LOG:  redo done at 0/160C128 system
usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2021-01-13 02:14:40.868 EST [4851] LOG:  database system is ready to
accept connections
2021-01-13 02:14:40.897 EST [4887] LOG:  logical replication apply
worker for subscription "tap_sub" has started

I see the same logs are seen in the test cases that have been added as part of
https://www.postgresql.org/message-id/caa4ek1l3p4z+9wtk77mbdpkagr4gs2y3r1je7zevlqvf9ga...@mail.gmail.com

regards,
Ajin Cherian
Fujitsu Australia
.

Re: O(n^2) system calls in RemoveOldXlogFiles()

2021-01-12 Thread Michael Paquier

On Tue, Jan 12, 2021 at 11:30:13PM +1300, Thomas Munro wrote:
> I haven't heard any user complaints, and I'd personally be happy with
> a fix on master only.

I have been looking again at that, and the rebased version that Andres
has provided would take care of that.  Any thoughts?
--
Michael
From 0a8fb2599283b9703466b79f0fd490b3a565ff62 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 13 Jan 2021 16:23:57 +0900
Subject: [PATCH] Fix O(recycled_segments^2) stat() calls while recycling WAL
 files.

Author: Michael Paquier 
Reviewed-By: Andres Freund 
Discussion: https://postgr.es/m/CAB7nPqTB3VcKSSrW2Qj59tYYR2H4+n=5pzbdwou+x9iqvnm...@mail.gmail.com
Backpatch:
---
 src/backend/access/transam/xlog.c | 60 +++
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b18257c198..1bc6c9683d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -930,7 +930,7 @@ static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
-static void RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr);
+static void RemoveXlogFile(const char *segname, XLogSegNo *endlogSegNo, XLogSegNo recycleSegNo);
 static void UpdateLastRemovedPtr(char *filename);
 static void ValidateXLOGDirectoryStructure(void);
 static void CleanupBackupHistory(void);
@@ -4055,6 +4055,12 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr)
 	DIR		   *xldir;
 	struct dirent *xlde;
 	char		lastoff[MAXFNAMELEN];
+	XLogSegNo   endlogSegNo;
+	XLogSegNo   recycleSegNo;
+
+	/* Initialize info about where to try to recycle to */
+	XLByteToSeg(endptr, endlogSegNo, wal_segment_size);
+	recycleSegNo = XLOGfileslop(lastredoptr);
 
 	/*
 	 * Construct a filename of the last segment to be kept. The timeline ID
@@ -4093,7 +4099,7 @@ RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr)
 /* Update the last removed location in shared memory first */
 UpdateLastRemovedPtr(xlde->d_name);
 
-RemoveXlogFile(xlde->d_name, lastredoptr, endptr);
+RemoveXlogFile(xlde->d_name, , recycleSegNo);
 			}
 		}
 	}
@@ -4123,13 +4129,21 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI)
 	struct dirent *xlde;
 	char		switchseg[MAXFNAMELEN];
 	XLogSegNo	endLogSegNo;
+	XLogSegNo	switchLogSegNo;
+	XLogSegNo	recycleSegNo;
 
-	XLByteToPrevSeg(switchpoint, endLogSegNo, wal_segment_size);
+	/*
+	 * Initialize info about where to begin the work. This will recycle,
+	 * somewhat arbitrarily, 10 future segments.
+	 */
+	XLByteToPrevSeg(switchpoint, switchLogSegNo, wal_segment_size);
+	XLByteToSeg(switchpoint, endLogSegNo, wal_segment_size);
+	recycleSegNo = endLogSegNo + 10;
 
 	/*
 	 * Construct a filename of the last segment to be kept.
 	 */
-	XLogFileName(switchseg, newTLI, endLogSegNo, wal_segment_size);
+	XLogFileName(switchseg, newTLI, switchLogSegNo, wal_segment_size);
 
 	elog(DEBUG2, "attempting to remove WAL segments newer than log file %s",
 		 switchseg);
@@ -4157,7 +4171,7 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI)
 			 * - but seems safer to let them be archived and removed later.
 			 */
 			if (!XLogArchiveIsReady(xlde->d_name))
-RemoveXlogFile(xlde->d_name, InvalidXLogRecPtr, switchpoint);
+RemoveXlogFile(xlde->d_name, , recycleSegNo);
 		}
 	}
 
@@ -4167,36 +4181,22 @@ RemoveNonParentXlogFiles(XLogRecPtr switchpoint, TimeLineID newTLI)
 /*
  * Recycle or remove a log file that's no longer needed.
  *
- * endptr is current (or recent) end of xlog, and lastredoptr is the
- * redo pointer of the last checkpoint. These are used to determine
- * whether we want to recycle rather than delete no-longer-wanted log files.
- * If lastredoptr is not known, pass invalid, and the function will recycle,
- * somewhat arbitrarily, 10 future segments.
+ * segname is the name of the segment to recycle or remove.  endlogSegNo
+ * is the segment number of the current (or recent) end of WAL.  recycleSegNo
+ * is the segment number to recycle up to.
+ *
+ * endlogSegNo gets incremented if the segment is recycled so as it is not
+ * checked again with future callers of this function.
  */
 static void
-RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr)
+RemoveXlogFile(const char *segname, XLogSegNo *endlogSegNo,
+			   XLogSegNo recycleSegNo)
 {
 	char		path[MAXPGPATH];
 #ifdef WIN32
 	char		newpath[MAXPGPATH];
 #endif
 	struct stat statbuf;
-	XLogSegNo	endlogSegNo;
-	XLogSegNo	recycleSegNo;
-
-	if (wal_recycle)
-	{
-		/*
-		 * Initialize info about where to try to recycle to.
-		 */
-		XLByteToSeg(endptr, endlogSegNo, wal_segment_size);
-		if (lastredoptr == InvalidXLogRecPtr)
-			recycleSegNo = endlogSegNo + 10;

Re: ResourceOwner refactoring

2021-01-12 Thread Michael Paquier

On Wed, Jan 13, 2021 at 09:18:57AM +0200, Heikki Linnakangas wrote:
> --- a/src/common/cryptohash_openssl.c
> +++ b/src/common/cryptohash_openssl.c
> +static ResourceOwnerFuncs cryptohash_funcs =
> +{
> + /* relcache references */
> + .name = "LLVM JIT context",
> + .phase = RESOURCE_RELEASE_BEFORE_LOCKS,
> + .ReleaseResource = ResOwnerReleaseCryptoHash,
> + .PrintLeakWarning = ResOwnerPrintCryptoHashLeakWarning,
> +};
> +#endif

Looks like a copy-paste error here.
--
Michael


signature.asc
Description: PGP signature

Re: adding partitioned tables to publications

2021-01-12 Thread Amit Langote

On Tue, Jan 12, 2021 at 5:09 PM Amit Kapila  wrote:
>
> On Mon, Jan 11, 2021 at 5:44 PM Mark Zhao <875941...@qq.com> wrote:
> >
> > Thanks for your reply. The patch is exactly what I want.
> > My English name is Mark Zhao, which should be the current email name.
> >
>
> Pushed the fix.

Thanks Amit and Mark.  I hadn't realized at the time that the relation
descriptor leak was occurring, but good to see it tested and fixed.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: ResourceOwner refactoring

2021-01-12 Thread Heikki Linnakangas


On 13/01/2021 03:55, kuroda.hay...@fujitsu.com wrote:

Dear Heikki,

I'm also interested in this patch, but it cannot be applied to the current 
HEAD...

$ git apply ~/v2-0001-Make-resowners-more-easily-extensible.patch
error: patch failed: src/common/cryptohash_openssl.c:57
error: src/common/cryptohash_openssl.c: patch does not apply
error: patch failed: src/include/utils/resowner_private.h:1
error: src/include/utils/resowner_private.h: patch does not apply


Here's a rebased version. Thanks!

- Heikki
>From 31b1b4661823cf38b2d4c5931f96c477b6441271 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 16 Dec 2020 17:26:03 +0200
Subject: [PATCH v3 1/1] Make resowners more easily extensible.

Use a single array and hash, instead of one for each object kind.
---
 src/backend/access/common/tupdesc.c   |   42 +-
 src/backend/jit/jit.c |2 -
 src/backend/jit/llvm/llvmjit.c|   40 +-
 src/backend/storage/buffer/bufmgr.c   |   43 +-
 src/backend/storage/buffer/localbuf.c |2 +-
 src/backend/storage/file/fd.c |   44 +-
 src/backend/storage/ipc/dsm.c |   44 +-
 src/backend/storage/lmgr/lock.c   |2 +-
 src/backend/utils/cache/catcache.c|   98 ++-
 src/backend/utils/cache/plancache.c   |   50 +-
 src/backend/utils/cache/relcache.c|   39 +-
 src/backend/utils/cache/syscache.c|2 +-
 src/backend/utils/resowner/README |   19 +-
 src/backend/utils/resowner/resowner.c | 1139 +++--
 src/backend/utils/time/snapmgr.c  |   38 +-
 src/common/cryptohash_openssl.c   |   45 +-
 src/include/storage/buf_internals.h   |   15 +
 src/include/utils/catcache.h  |3 -
 src/include/utils/plancache.h |2 +
 src/include/utils/resowner.h  |   41 +-
 src/include/utils/resowner_private.h  |  105 ---
 21 files changed, 795 insertions(+), 1020 deletions(-)
 delete mode 100644 src/include/utils/resowner_private.h

diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 902f59440cd..7fa44d5f7ee 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -29,9 +29,21 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/datum.h"
-#include "utils/resowner_private.h"
+#include "utils/resowner.h"
 #include "utils/syscache.h"
 
+/* ResourceOwner callbacks to hold tupledesc references  */
+static void ResOwnerReleaseTupleDesc(Datum res);
+static void ResOwnerPrintTupleDescLeakWarning(Datum res);
+
+static ResourceOwnerFuncs tupdesc_resowner_funcs =
+{
+	/* relcache references */
+	.name = "tupdesc reference",
+	.phase = RESOURCE_RELEASE_AFTER_LOCKS,
+	.ReleaseResource = ResOwnerReleaseTupleDesc,
+	.PrintLeakWarning = ResOwnerPrintTupleDescLeakWarning
+};
 
 /*
  * CreateTemplateTupleDesc
@@ -376,9 +388,10 @@ IncrTupleDescRefCount(TupleDesc tupdesc)
 {
 	Assert(tupdesc->tdrefcount >= 0);
 
-	ResourceOwnerEnlargeTupleDescs(CurrentResourceOwner);
+	ResourceOwnerEnlarge(CurrentResourceOwner);
 	tupdesc->tdrefcount++;
-	ResourceOwnerRememberTupleDesc(CurrentResourceOwner, tupdesc);
+	ResourceOwnerRemember(CurrentResourceOwner, PointerGetDatum(tupdesc),
+		  _resowner_funcs);
 }
 
 /*
@@ -394,7 +407,8 @@ DecrTupleDescRefCount(TupleDesc tupdesc)
 {
 	Assert(tupdesc->tdrefcount > 0);
 
-	ResourceOwnerForgetTupleDesc(CurrentResourceOwner, tupdesc);
+	ResourceOwnerForget(CurrentResourceOwner, PointerGetDatum(tupdesc),
+		_resowner_funcs);
 	if (--tupdesc->tdrefcount == 0)
 		FreeTupleDesc(tupdesc);
 }
@@ -925,3 +939,23 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
 
 	return desc;
 }
+
+
+/*
+ * ResourceOwner callbacks
+ */
+static void
+ResOwnerReleaseTupleDesc(Datum res)
+{
+	DecrTupleDescRefCount((TupleDesc) DatumGetPointer(res));
+}
+
+static void
+ResOwnerPrintTupleDescLeakWarning(Datum res)
+{
+	TupleDesc tupdesc = (TupleDesc) DatumGetPointer(res);
+
+	elog(WARNING,
+		 "TupleDesc reference leak: TupleDesc %p (%u,%d) still referenced",
+		 tupdesc, tupdesc->tdtypeid, tupdesc->tdtypmod);
+}
diff --git a/src/backend/jit/jit.c b/src/backend/jit/jit.c
index 2da300e000d..1aa04d173b4 100644
--- a/src/backend/jit/jit.c
+++ b/src/backend/jit/jit.c
@@ -26,7 +26,6 @@
 #include "jit/jit.h"
 #include "miscadmin.h"
 #include "utils/fmgrprotos.h"
-#include "utils/resowner_private.h"
 
 /* GUCs */
 bool		jit_enabled = true;
@@ -140,7 +139,6 @@ jit_release_context(JitContext *context)
 	if (provider_successfully_loaded)
 		provider.release_context(context);
 
-	ResourceOwnerForgetJIT(context->resowner, PointerGetDatum(context));
 	pfree(context);
 }
 
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index b0789a5fb80..64c7fd92bc7 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -40,7 +40,7 @@
 #include "portability/instr_time.h"
 #include "storage/ipc.h"
 #include "utils/memutils.h"
-#include "utils/resowner_private.h"
+#include

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-12 Thread japin



On Wed, 13 Jan 2021 at 13:26, Amit Kapila wrote:
> On Tue, Jan 12, 2021 at 4:59 PM Bharath Rupireddy
>  wrote:
>>
>> On Tue, Jan 12, 2021 at 12:06 PM Amit Kapila  wrote:
>> > > Here's my analysis:
>> > > 1) in the publisher, alter publication drop table successfully
>> > > removes(PublicationDropTables) the table from the catalogue
>> > > pg_publication_rel
>> > > 2) in the subscriber, alter subscription refresh publication
>> > > successfully removes the table from the catalogue pg_subscription_rel
>> > > (AlterSubscription_refresh->RemoveSubscriptionRel)
>> > > so far so good
>> > >
>> >
>> > Here, it should register the worker to stop on commit, and then on
>> > commit it should call AtEOXact_ApplyLauncher to stop the apply worker.
>> > Once the apply worker is stopped, the corresponding WALSender will
>> > also be stopped. Something here is not happening as per expected
>> > behavior.
>>
>> On the subscriber, an entry for worker stop is created in 
>> AlterSubscription_refresh --> logicalrep_worker_stop_at_commit. At the end 
>> of txn, in AtEOXact_ApplyLauncher, we try to stop that worker, but it cannot 
>> be stopped because logicalrep_worker_find returns null 
>> (AtEOXact_ApplyLauncher --> logicalrep_worker_stop --> 
>> logicalrep_worker_find). The worker entry for that subscriber is having 
>> relid as 0 [1], due to which the following if condition will not be hit. The 
>> apply worker on the subscriber related to the subscription on which refresh 
>> publication was run is not closed. It looks like relid 0 is valid because it 
>> will be applicable only during the table sync phase, the comment in the 
>> LogicalRepWorker structure says that.
>>
>> And also, I think, expecting the apply worker to be closed this way doesn't 
>> make sense because the apply worker is a per-subscription base, and the 
>> subscription can have other tables too.
>>
>
> Okay, that makes sense. As responded to Li Japin, let's focus on
> figuring out why we are sending the changes from the publisher node in
> some cases and not in other cases.

After some analysis, I find that the dropped tables always replicate to 
subscriber.
The difference is that if we drop the table from publication and refresh
publication (on subscriber), the LogicalRepRelMapEntry in 
should_apply_changes_for_rel()
set state to SUBREL_STATE_UNKNOWN.

(gdb) p *rel
$2 = {remoterel = {remoteid = 16410, nspname = 0x5564fb0177c0 "public",
relname = 0x5564fb0177a0 "t1", natts = 1, attnames = 0x5564fb0177e0, 
atttyps = 0x5564fb017780,
replident = 100 'd', relkind = 0 '\000', attkeys = 0x0}, localrelvalid = 
true,
  localreloid = 16412, localrel = 0x7f78705da1b8, attrmap = 0x5564fb017800, 
updatable = false,
  *state = 0 '\000'*, statelsn = 0}

If we insert data between drop table from publication and refresh publication, 
the
LogicalRepRelMapEntry state is always SUBREL_STATE_READY.

(gdb) p *rel
$2 = {remoterel = {remoteid = 16410, nspname = 0x5564fb0177c0 "public",
relname = 0x5564fb0177a0 "t1", natts = 1, attnames = 0x5564fb0177e0, 
atttyps = 0x5564fb017780,
replident = 100 'd', relkind = 0 '\000', attkeys = 0x0}, localrelvalid = 
true,
  localreloid = 16412, localrel = 0x7f78705d9d38, attrmap = 0x5564fb017800, 
updatable = false,
  *state = 114 'r'*, statelsn = 23545672}

I will dig why the state of LogicalRepRelMapEntry doesn't change in second case.

Any suggestion is welcome!

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Wrong usage of RelationNeedsWAL

2021-01-12 Thread Kyotaro Horiguchi

Hello.

Commit c6b92041d3 changed the definition of RelationNeedsWAL().

-#define RelationNeedsWAL(relation) \
-   ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation) 
\
+   ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&  
\
+(XLogIsNeeded() || 
\
+ (relation->rd_createSubid == InvalidSubTransactionId &&   
\
+  relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))

On the other hand I found this usage.

plancat.c:128 get_relation_info()
>   /* Temporary and unlogged relations are inaccessible during recovery. */
>   if (!RelationNeedsWAL(relation) && RecoveryInProgress())
>   ereport(ERROR,
>   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>errmsg("cannot access temporary or unlogged 
> relations during recovery")));

It works as expected accidentally, but the meaning is off.
WAL-skipping optmization is irrelevant to the condition for the error.

I found five misues in the tree. Please find the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 2912e1db38ff034e27e4010eff5b2e5afcce3f85 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 13 Jan 2021 15:52:03 +0900
Subject: [PATCH] Fix misuses of RelationNeedsWAL

The definition of the macro RelationNeedsWAL has been changed by
c6b92041d3 to include conditions related to the WAL-skip optimization
but some uses of the macro are not relevant to the optimization. That
misuses are harmless for now as they are only executed while wal_level
>= replica or WAL-skipping is inactive. However, this should be
corrected to prevent future hazard.
---
 src/backend/catalog/pg_publication.c |  2 +-
 src/backend/optimizer/util/plancat.c |  2 +-
 src/include/utils/rel.h  | 15 +++
 src/include/utils/snapmgr.h  |  2 +-
 4 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/src/backend/catalog/pg_publication.c b/src/backend/catalog/pg_publication.c
index 5f8e1c64e1..f3060a4cf1 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -67,7 +67,7 @@ check_publication_add_relation(Relation targetrel)
  errdetail("System tables cannot be added to publications.")));
 
 	/* UNLOGGED and TEMP relations cannot be part of publication. */
-	if (!RelationNeedsWAL(targetrel))
+	if (!RelationIsWalLogged(targetrel))
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("table \"%s\" cannot be replicated",
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index da322b453e..0500efcdb9 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -126,7 +126,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	relation = table_open(relationObjectId, NoLock);
 
 	/* Temporary and unlogged relations are inaccessible during recovery. */
-	if (!RelationNeedsWAL(relation) && RecoveryInProgress())
+	if (!RelationIsWalLogged(relation) && RecoveryInProgress())
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("cannot access temporary or unlogged relations during recovery")));
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 10b63982c0..810806a542 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -552,16 +552,23 @@ typedef struct ViewOptions
 		(relation)->rd_smgr->smgr_targblock = (targblock); \
 	} while (0)
 
+/*
+ * RelationIsPermanent
+ *		True if relation is WAL-logged.
+ */
+#define RelationIsWalLogged(relation)	\
+	((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+
 /*
  * RelationNeedsWAL
- *		True if relation needs WAL.
+ *		True if relation needs WAL at the time.
  *
  * Returns false if wal_level = minimal and this relation is created or
  * truncated in the current transaction.  See "Skipping WAL for New
  * RelFileNode" in src/backend/access/transam/README.
  */
 #define RelationNeedsWAL(relation)		\
-	((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&	\
+	(RelationIsWalLogged(relation) &&	\
 	 (XLogIsNeeded() ||	\
 	  (relation->rd_createSubid == InvalidSubTransactionId &&			\
 	   relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
@@ -619,7 +626,7 @@ typedef struct ViewOptions
  */
 #define RelationIsAccessibleInLogicalDecoding(relation) \
 	(XLogLogicalInfoActive() && \
-	 RelationNeedsWAL(relation) && \
+	 RelationIsWalLogged(relation) && \
 	 (IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation)))
 
 /*
@@ -635,7 +642,7 @@ typedef struct ViewOptions
  */
 #define

Re: Single transaction in the tablesync worker?

2021-01-12 Thread Peter Smith

On Wed, Jan 13, 2021 at 1:07 PM Hou, Zhijie  wrote:
>
> > Also PSA some detailed logging evidence of some test scenarios involving
> > Drop/AlterSubscription:
> > + Test-20210112-AlterSubscriptionRefresh-ok.txt =
> > AlterSubscription_refresh which successfully drops a tablesync slot
> > + Test-20210112-AlterSubscriptionRefresh-warning.txt =
> > AlterSubscription_refresh gives WARNING that it cannot drop the tablesync
> > slot (which no longer exists)
> > + Test-20210112-DropSubscription-warning.txt = DropSubscription with a
> > disassociated slot_name gives a WARNING that it cannot drop the tablesync
> > slot (due to broken connection)
>
> Hi
>
> > * The AlterSubscription_refresh (v14+) is now more similar to 
> > DropSubscription w.r.t to stopping workers for any "removed" tables.
> I have an issue about the above feature.
>
> With the patch, it seems does not stop the worker in the case of [1].
> I probably missed something, should we stop the worker in such case ?
>
> [1] 
> https://www.postgresql.org/message-id/CALj2ACV%2B0UFpcZs5czYgBpujM9p0Hg1qdOZai_43OU7bqHU_xw%40mail.gmail.com
>

I am not exactly sure of the concern. (If the extra info below does
not help can you please describe your concern with more details).

This [v14] patch code/feature is only referring to the immediate
stopping of only the *** "tablesync" *** worker (if any) for any/each
table being removed from the subscription. It has nothing to say about
the "apply" worker of the subscription, which continues replicating as
before.

OTOH, I think the other mail problem is not really related to the
"tablesync" workers. As you can see (e.g. steps 7,8,9,10 of [2]), that
problem is described as continuing over multiple transactions to
replicate unexpected rows - I think this could only be done by the
subscription "apply" worker, and is after the "tablesync" worker has
gone away.

So AFAIK these are 2 quite unrelated problems, and would be solved
independently.

It just happens that they are both exposed using ALTER SUBSCRIPTION
... REFRESH PUBLICATION;

[v14] = 
https://www.postgresql.org/message-id/CAHut%2BPsPO2vOp%2BP7U2szROMy15PKKGanhUrCYQ0ffpy9zG1V1A%40mail.gmail.com
[2] = 
https://www.postgresql.org/message-id/CALj2ACV%2B0UFpcZs5czYgBpujM9p0Hg1qdOZai_43OU7bqHU_xw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: remove unneeded pstrdup in fetch_table_list

2021-01-12 Thread Amit Kapila

On Wed, Jan 13, 2021 at 8:11 AM Hou, Zhijie  wrote:
>
> Hi
>
> In function fetch_table_list, it get the table names from publicer and return 
> a list of tablenames.
> When append the name to the list, it use the following code:
>
> **
> nspname = TextDatumGetCString(slot_getattr(slot, 1, ));
> Assert(!isnull);
> relname = TextDatumGetCString(slot_getattr(slot, 2, ));
> rv = makeRangeVar(pstrdup(nspname), pstrdup(relname), -1);
> tablelist = lappend(tablelist, rv);
> **
>
> the nspname and relname will be copied which seems unnecessary.
> Because nspame and relname is get from TextDatumGetCString.
> IMO, TextDatumGetCString returns a newly palloced string.
>
> **
> result = (char *) palloc(len + 1);
> memcpy(result, VARDATA_ANY(tunpacked), len);
> result[len] = '\0';
>
> if (tunpacked != t)
> pfree(tunpacked);
>
> return result;
> **
>

Your observation seems correct to me, though I have not tried to test
your patch.

-- 
With Regards,
Amit Kapila.

RE: Disable WAL logging to speed up data loading

2021-01-12 Thread tsunakawa.ta...@fujitsu.com

From: Kyotaro Horiguchi 
> XLogBeginInsert();
> XLogSetRecrodFlags(XLOG_MARK_ESSENTIAL);  # new flag value
> XLOGInsert();

Oh, sounds like a nice idea.  That's more flexible by allowing WAL-emitting 
modules to specify which WAL records are mandatory even when wal_level is none.

For example, gistXLogAssignLSN() adds the above flag like this:

XLogBeginInsert();
XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT | XLOG_MARK_ESSENTIAL);
XLogRegisterData((char *) , sizeof(dummy));

(Here's a word play - unimportant but essential, what's that?)

And the filter in XLogInsert() becomes:

+   if (wal_level == WAL_LEVEL_NONE &&
+   !((rmid == RM_XLOG_ID && info == XLOG_CHECKPOINT_SHUTDOWN) ||
+ (rmid == RM_XLOG_ID && info == XLOG_PARAMETER_CHANGE) ||
+ (rmid == RM_XACT_ID && info == XLOG_XACT_PREPARE) ||
+ (curinsert_flags & XLOG_MARK_ESSENTIAL)))

Or,

+   if (wal_level == WAL_LEVEL_NONE &&
+!(curinsert_flags & XLOG_MARK_ESSENTIAL))

and add the new flag when emitting XLOG_CHECKPOINT_ONLINE, 
XLOG_PARAMETER_CHANGE and XLOG_PREPARE records.  I think both have good 
reasons: the former centralizes the handling of XACT and XLOG RM WAL records 
(as the current XLOG module does already), and the latter delegates the 
decision to each module.  Which would you prefer?  (I kind of like the former, 
but this is a weak opinion.)


Regards
Takayuki Tsunakawa

Re: [DOC] Document concurrent index builds waiting on each other

2021-01-12 Thread Michael Paquier

On Tue, Jan 12, 2021 at 04:51:39PM -0300, Alvaro Herrera wrote:
> I looked into this again, and I didn't like what I had added to
> maintenance.sgml at all.  It seems out of place where I put it; and I
> couldn't find any great spots.  Going back to your original proposal,
> what about something like this?  It's just one more para in the "notes"
> section in CREATE INDEX and REINDEX pages, without any additions to the
> VACUUM pages.

+1.
--
Michael


signature.asc
Description: PGP signature

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-12 Thread Amit Kapila

On Wed, Jan 13, 2021 at 11:08 AM Bharath Rupireddy
 wrote:
>
> On Wed, Jan 13, 2021 at 10:33 AM Amit Kapila  wrote:
> >
> > On Tue, Jan 12, 2021 at 5:23 PM japin  wrote:
> > >
> > > On Tue, 12 Jan 2021 at 19:32, Bharath Rupireddy wrote:
> > > > On Tue, Jan 12, 2021 at 4:47 PM Li Japin  wrote:
> > > >> IIUC the logical replication only replicate the tables in publication, 
> > > >> I think
> > > >> when the tables that aren't in publication should not be replicated.
> > > >>
> > > >> Attached the patch that fixes it.  Thought?
> > > >
> > > > With that change, we don't get the behaviour that's stated in the
> > > > document - "The ADD TABLE and DROP TABLE clauses will add and remove
> > > > one or more tables from the publication. Note that adding tables to a
> > > > publication that is already subscribed to will require a ALTER
> > > > SUBSCRIPTION ... REFRESH PUBLICATION action on the subscribing side in
> > > > order to become effective" -
> > > > https://www.postgresql.org/docs/devel/sql-alterpublication.html.
> > > >
> > >
> > > The documentation only emphasize adding tables to a publication, not
> > > include dropping tables from a publication.
> > >
> >
> > Right and I think that is ensured by the subscriber by calling
> > should_apply_changes_for_rel() which won't return true unless the
> > newly added relation is not synced via Refresh Publication. So, this
> > means with or without this patch we should be sending the changes of
> > the newly published table from the publisher?
>
> Oh, my bad, alter subscription...refresh publication is required only when 
> the tables are added to the publisher. Patch by japin makes the walsender 
> process to stop sending the data to the subscriber/apply worker. The patch is 
> based on the idea of looking at the PUBLICATIONRELMAP in get_rel_sync_entry 
> when the entries have been invalidated in rel_sync_cache_publication_cb 
> because of alter publication...drop table.
> ,
> When the alter subscription...refresh publication is run on the subscriber, 
> the SUBSCRIPTIONRELMAP catalogue gets invalidated but the corresponding cache 
> entries in the LogicalRepRelMap which is used by logicalrep_rel_open are not 
> invalidated. LogicalRepRelMap is used to know the relations that are 
> associated with the subscription. But we seem to have not taken care of 
> invalidating those entries, though we have the invalidation callback 
> invalidate_syncing_table_states registered for SUBSCRIPTIONRELMAP in 
> ApplyWorkerMain. So, we miss to work on updating the entries in 
> LogicalRepRelMap.
>
> IMO, the ideal way to fix this issue is 1) stop the walsender sending the 
> changes to dropped tables, for this japin patch works 2) we must mark all the 
> LogicalRepRelMap entries as invalid in invalidate_syncing_table_states so 
> that in the next logicalrep_rel_open, if the entry is invalidated, then we 
> have to call GetSubscriptionRelState to get the latest state, as shown in 
> [1]. Likewise, we might also have to mark the cache entries invalid in  
> subscription_change_cb which is invalidation callback for pg_subscription
>

Is the second point you described here is related to the original
bug-reported or will it cause some other problem?

> Thoughts?
>
> [1] -
> if (entry->state != SUBREL_STATE_READY || entry->invalid)
> entry->state = GetSubscriptionRelState(MySubscription->oid,
>entry->localreloid,
>>statelsn);
>
>if (entry->invalid)
>entry->invalid = false;
>
> return entry;
>
> > I have another question on your patch which is why in some cases like
> > when we have not inserted in step-5 (as mentioned by you) the
> > following insertions are not sent. Is somehow we are setting the
> > pubactions as false in that case, if so, how?
>
> The reason is that the issue reported in this thread occurs - when we have 
> the walsender process running, RelationSyncCache is initialized, we inserted 
> some data into the table that's sent to the subscriber and the table is 
> dropped, we miss to set the pubactions to false in get_rel_sync_entry, though 
> the cache entries have been invalidated.
>
> In some cases, it works properly because the old walsender process was 
> stopped and when a new walsender is started, then the cache RelationSyncCache 
> gets initialized again and the pubactions will be set to false in 
> get_rel_sync_entry.
>

Why is walsender process was getting stopped in one case but not in another?

-- 
With Regards,
Amit Kapila.

Re: Single transaction in the tablesync worker?

2021-01-12 Thread Peter Smith

On Mon, Jan 4, 2021 at 10:48 PM Amit Kapila  wrote:
> 7.
> @@ -905,7 +905,7 @@ replorigin_advance(RepOriginId node,
>   LWLockAcquire(_state->lock, LW_EXCLUSIVE);
>
>   /* Make sure it's not used by somebody else */
> - if (replication_state->acquired_by != 0)
> + if (replication_state->acquired_by != 0 &&
> replication_state->acquired_by != MyProcPid)
>   {
>
> I think you won't need this change if you do replorigin_advance before
> replorigin_session_setup in your patch.
>

As you know the replorigin_session_setup sets the
replication_state->acquired_by to be the current PID. So without this
change the replorigin_advance rejects that same slot state thinking
that it is already active for a different process. Root problem is
that the same process/PID calling both functions would hang. So this
patch change allows replorigin_advance code to be called by self.

IIUC that acquired_by check condition is like a sanity check for the
originid passed. The patched code only does just like what the comment
says:
"/* Make sure it's not used by somebody else */"
Doesn't "somebody else" means "anyone but me" (i.e. anyone but MyProcPid).

Also, “setup” of a thing generally comes before usage of that thing,
so won't it seem strange to do (like the suggestion) and deliberately
call the "setup" function 2nd instead of 1st?

Can you please explain why is it better to do it the suggested way
(switch the calls around) than keep the patch code? Probably there is
a good reason but I am just not understanding it.

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-12 Thread Bharath Rupireddy

On Wed, Jan 13, 2021 at 10:33 AM Amit Kapila 
wrote:
>
> On Tue, Jan 12, 2021 at 5:23 PM japin  wrote:
> >
> > On Tue, 12 Jan 2021 at 19:32, Bharath Rupireddy wrote:
> > > On Tue, Jan 12, 2021 at 4:47 PM Li Japin  wrote:
> > >> IIUC the logical replication only replicate the tables in
publication, I think
> > >> when the tables that aren't in publication should not be replicated.
> > >>
> > >> Attached the patch that fixes it.  Thought?
> > >
> > > With that change, we don't get the behaviour that's stated in the
> > > document - "The ADD TABLE and DROP TABLE clauses will add and remove
> > > one or more tables from the publication. Note that adding tables to a
> > > publication that is already subscribed to will require a ALTER
> > > SUBSCRIPTION ... REFRESH PUBLICATION action on the subscribing side in
> > > order to become effective" -
> > > https://www.postgresql.org/docs/devel/sql-alterpublication.html.
> > >
> >
> > The documentation only emphasize adding tables to a publication, not
> > include dropping tables from a publication.
> >
>
> Right and I think that is ensured by the subscriber by calling
> should_apply_changes_for_rel() which won't return true unless the
> newly added relation is not synced via Refresh Publication. So, this
> means with or without this patch we should be sending the changes of
> the newly published table from the publisher?

Oh, my bad, alter subscription...refresh publication is required only when
the tables are added to the publisher. Patch by japin makes the walsender
process to stop sending the data to the subscriber/apply worker. The patch
is based on the idea of looking at the PUBLICATIONRELMAP in
get_rel_sync_entry when the entries have been invalidated in
rel_sync_cache_publication_cb because of alter publication...drop table.
,
When the alter subscription...refresh publication is run on the subscriber,
the SUBSCRIPTIONRELMAP catalogue gets invalidated but the corresponding
cache entries in the LogicalRepRelMap which is used by logicalrep_rel_open
are not invalidated. LogicalRepRelMap is used to know the relations that
are associated with the subscription. But we seem to have not taken care of
invalidating those entries, though we have the invalidation callback
invalidate_syncing_table_states registered for SUBSCRIPTIONRELMAP in
ApplyWorkerMain. So, we miss to work on updating the entries in
LogicalRepRelMap.

IMO, the ideal way to fix this issue is 1) stop the walsender sending the
changes to dropped tables, for this japin patch works 2) we must mark all
the LogicalRepRelMap entries as invalid in invalidate_syncing_table_states
so that in the next logicalrep_rel_open, if the entry is invalidated, then
we have to call GetSubscriptionRelState to get the latest state, as shown
in [1]. Likewise, we might also have to mark the cache entries invalid in
 subscription_change_cb which is invalidation callback for pg_subscription

Thoughts?

[1] -
if (entry->state != SUBREL_STATE_READY *|| entry->invalid*)
entry->state = GetSubscriptionRelState(MySubscription->oid,
   entry->localreloid,
   >statelsn);

   if (entry->invalid)
   entry->invalid = false;

return entry;

> I have another question on your patch which is why in some cases like
> when we have not inserted in step-5 (as mentioned by you) the
> following insertions are not sent. Is somehow we are setting the
> pubactions as false in that case, if so, how?

The reason is that the issue reported in this thread occurs - when we have
the walsender process running, RelationSyncCache is initialized, we
inserted some data into the table that's sent to the subscriber and the
table is dropped, we miss to set the pubactions to false in
get_rel_sync_entry, though the cache entries have been invalidated.

In some cases, it works properly because the old walsender process was
stopped and when a new walsender is started, then the cache
RelationSyncCache gets initialized again and the pubactions will be set to
false in get_rel_sync_entry.

if (!found)
{
/* immediately make a new entry valid enough to satisfy callbacks */
entry->schema_sent = false;
entry->streamed_txns = NIL;
entry->replicate_valid = false;
entry->pubactions.pubinsert = entry->pubactions.pubupdate =
entry->pubactions.pubdelete = entry->pubactions.pubtruncate =
false;
entry->publish_as_relid = InvalidOid;
}

Hope that clarifies why the issue happens in some cases and not in other
cases.

> > > The publisher stops sending the tuples whenever the relation gets
> > > dropped from the publication, not waiting until alter subscription ...
> > > refresh publication on the subscriber.
> > >
> >
> > If we want to wait the subscriber executing alter subscription ...
refresh publication,
> > maybe we should send some feedback to walsender.  How can we send this
feedback to
> > walsender in

Re: Movement of restart_lsn position movement of logical replication slots is very slow

2021-01-12 Thread Amit Kapila

On Tue, Jan 12, 2021 at 9:15 AM Jammie  wrote:
>
> Hi Amit,
> Thanks for the response .
> Can you please let me know what pg_current_wal_lsn returns ?
>
> is this position the LSN of the next log record to be created, or is it the 
> LSN of the last log record already created and inserted in the log?
>

This is the position up to which we have already written the WAL to
the kernel but not yet flushed to disk.

-- 
With Regards,
Amit Kapila.

RE: [Patch] Optimize dropping of relation buffers using dlist

2021-01-12 Thread k.jami...@fujitsu.com

On Wed, January 13, 2021 2:15 PM (JST), Amit Kapila wrote:
> On Wed, Jan 13, 2021 at 7:39 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 12 Jan 2021 08:49:53 +0530, Amit Kapila
> >  wrote in
> > > On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi
> > >  wrote:
> > > >
> > > > At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com"
>  wrote in:
> > > > > > Thanks for the detailed tests. NBuffers/32 seems like an
> > > > > > appropriate value for the threshold based on these results. I
> > > > > > would like to slightly modify part of the commit message in
> > > > > > the first patch as below [1], otherwise, I am fine with the
> > > > > > changes. Unless you or anyone else has any more comments, I am
> > > > > > planning to push the 0001 and 0002 sometime next week.
> > > > > >
> > > > > > [1]
> > > > > > "The recovery path of DropRelFileNodeBuffers() is optimized so
> > > > > > that scanning of the whole buffer pool can be avoided when the
> > > > > > number of blocks to be truncated in a relation is below a
> > > > > > certain threshold. For such cases, we find the buffers by doing
> lookups in BufMapping table.
> > > > > > This improves the performance by more than 100 times in many
> > > > > > cases when several small tables (tested with 1000 relations)
> > > > > > are truncated and where the server is configured with a large
> > > > > > value of shared buffers (greater than 100GB)."
> > > > >
> > > > > Thank you for taking a look at the results of the tests. And
> > > > > it's also consistent with the results from Tang too.
> > > > > The commit message LGTM.
> > > >
> > > > +1.
> > > >
> > >
> > > I have pushed the 0001.
> >
> > Thank you for commiting this.
> >
> 
> Pushed 0002 as well.
> 

Thank you very much for committing those two patches, and for everyone here
who contributed in the simplifying the approaches, code reviews, testing, etc.

I compile with the --enable-coverage and check if the newly-added code and 
updated
parts were covered by tests.
Yes, the lines were hit including the updated lines of DropRelFileNodeBuffers(),
DropRelFileNodesAllBuffers(), smgrdounlinkall(), smgrnblocks().
Newly added APIs were covered too: FindAndDropRelFileNodeBuffers() and
smgrnblocks_cached(). 
However, the parts where UnlockBufHdr(bufHdr, buf_state); is called is not hit.
But I noticed that exists as well in previously existing functions in bufmgr.c.

Thank you very much again.

Regards,
Kirk Jamison

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-12 Thread Amit Kapila

On Tue, Jan 12, 2021 at 4:59 PM Bharath Rupireddy
 wrote:
>
> On Tue, Jan 12, 2021 at 12:06 PM Amit Kapila  wrote:
> > > Here's my analysis:
> > > 1) in the publisher, alter publication drop table successfully
> > > removes(PublicationDropTables) the table from the catalogue
> > > pg_publication_rel
> > > 2) in the subscriber, alter subscription refresh publication
> > > successfully removes the table from the catalogue pg_subscription_rel
> > > (AlterSubscription_refresh->RemoveSubscriptionRel)
> > > so far so good
> > >
> >
> > Here, it should register the worker to stop on commit, and then on
> > commit it should call AtEOXact_ApplyLauncher to stop the apply worker.
> > Once the apply worker is stopped, the corresponding WALSender will
> > also be stopped. Something here is not happening as per expected
> > behavior.
>
> On the subscriber, an entry for worker stop is created in 
> AlterSubscription_refresh --> logicalrep_worker_stop_at_commit. At the end of 
> txn, in AtEOXact_ApplyLauncher, we try to stop that worker, but it cannot be 
> stopped because logicalrep_worker_find returns null (AtEOXact_ApplyLauncher 
> --> logicalrep_worker_stop --> logicalrep_worker_find). The worker entry for 
> that subscriber is having relid as 0 [1], due to which the following if 
> condition will not be hit. The apply worker on the subscriber related to the 
> subscription on which refresh publication was run is not closed. It looks 
> like relid 0 is valid because it will be applicable only during the table 
> sync phase, the comment in the LogicalRepWorker structure says that.
>
> And also, I think, expecting the apply worker to be closed this way doesn't 
> make sense because the apply worker is a per-subscription base, and the 
> subscription can have other tables too.
>

Okay, that makes sense. As responded to Li Japin, let's focus on
figuring out why we are sending the changes from the publisher node in
some cases and not in other cases.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-01-12 Thread Amit Kapila

On Wed, Jan 13, 2021 at 7:39 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 12 Jan 2021 08:49:53 +0530, Amit Kapila  
> wrote in
> > On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi
> >  wrote:
> > >
> > > At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" 
> > >  wrote in:
> > > > > Thanks for the detailed tests. NBuffers/32 seems like an appropriate
> > > > > value for the threshold based on these results. I would like to
> > > > > slightly modify part of the commit message in the first patch as below
> > > > > [1], otherwise, I am fine with the changes. Unless you or anyone else
> > > > > has any more comments, I am planning to push the 0001 and 0002
> > > > > sometime next week.
> > > > >
> > > > > [1]
> > > > > "The recovery path of DropRelFileNodeBuffers() is optimized so that
> > > > > scanning of the whole buffer pool can be avoided when the number of
> > > > > blocks to be truncated in a relation is below a certain threshold. For
> > > > > such cases, we find the buffers by doing lookups in BufMapping table.
> > > > > This improves the performance by more than 100 times in many cases
> > > > > when several small tables (tested with 1000 relations) are truncated
> > > > > and where the server is configured with a large value of shared
> > > > > buffers (greater than 100GB)."
> > > >
> > > > Thank you for taking a look at the results of the tests. And it's also
> > > > consistent with the results from Tang too.
> > > > The commit message LGTM.
> > >
> > > +1.
> > >
> >
> > I have pushed the 0001.
>
> Thank you for commiting this.
>

Pushed 0002 as well.

-- 
With Regards,
Amit Kapila.

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-12 Thread Amit Kapila

On Tue, Jan 12, 2021 at 5:23 PM japin  wrote:
>
> On Tue, 12 Jan 2021 at 19:32, Bharath Rupireddy wrote:
> > On Tue, Jan 12, 2021 at 4:47 PM Li Japin  wrote:
> >> IIUC the logical replication only replicate the tables in publication, I 
> >> think
> >> when the tables that aren't in publication should not be replicated.
> >>
> >> Attached the patch that fixes it.  Thought?
> >
> > With that change, we don't get the behaviour that's stated in the
> > document - "The ADD TABLE and DROP TABLE clauses will add and remove
> > one or more tables from the publication. Note that adding tables to a
> > publication that is already subscribed to will require a ALTER
> > SUBSCRIPTION ... REFRESH PUBLICATION action on the subscribing side in
> > order to become effective" -
> > https://www.postgresql.org/docs/devel/sql-alterpublication.html.
> >
>
> The documentation only emphasize adding tables to a publication, not
> include dropping tables from a publication.
>

Right and I think that is ensured by the subscriber by calling
should_apply_changes_for_rel() which won't return true unless the
newly added relation is not synced via Refresh Publication. So, this
means with or without this patch we should be sending the changes of
the newly published table from the publisher?

I have another question on your patch which is why in some cases like
when we have not inserted in step-5 (as mentioned by you) the
following insertions are not sent. Is somehow we are setting the
pubactions as false in that case, if so, how?

> > The publisher stops sending the tuples whenever the relation gets
> > dropped from the publication, not waiting until alter subscription ...
> > refresh publication on the subscriber.
> >
>
> If we want to wait the subscriber executing alter subscription ... refresh 
> publication,
> maybe we should send some feedback to walsender.  How can we send this 
> feedback to
> walsender in non-walreceiver process?
>

I don't think we need this if what I said above is correct.

-- 
With Regards,
Amit Kapila.

Re: [PATCH] Feature improvement for CLOSE, FETCH, MOVE tab completion

2021-01-12 Thread Fujii Masao

On Tue, Jan 12, 2021 at 11:09 AM Fujii Masao  wrote:
>
> On Tue, Jan 12, 2021 at 10:00 AM Masahiko Sawada  
> wrote:
> >
> > On Mon, Jan 11, 2021 at 11:00 PM Peter Eisentraut
> >  wrote:
> > >
> > > On 2021-01-05 10:56, Masahiko Sawada wrote:
> > > > BTW according to the documentation, the options of DECLARE statement
> > > > (BINARY, INSENSITIVE, SCROLL, and NO SCROLL) are order-sensitive.
> > > >
> > > > DECLARE name [ BINARY ] [ INSENSITIVE ] [ [ NO ] SCROLL ]
> > > >  CURSOR [ { WITH | WITHOUT } HOLD ] FOR query
> > > >
> > > > But I realized that these options are actually order-insensitive. For
> > > > instance, we can declare a cursor like:
> > > >
> > > > =# declare abc scroll binary cursor for select * from pg_class;
> > > > DECLARE CURSOR
> > > >
> > > > The both parser code and documentation has been unchanged from 2003.
> > > > Is it a documentation bug?
> > >
> > > According to the SQL standard, the ordering of the cursor properties is
> > > fixed.  Even if the PostgreSQL parser offers more flexibility, I think
> > > we should continue to encourage writing the clauses in the standard order.
> >
> > Thanks for your comment. Agreed.
> >
> > So regarding the tab completion for DECLARE statement, perhaps it
> > would be better to follow the documentation?
>
> IMO yes because it's less confusing to make the document and
> tab-completion consistent.

I updated the patch that way. Could you review this version?

Regards,

-- 
Fujii Masao


fix_tab_complete_close_fetch_move_v6.patch
Description: Binary data

RE: Terminate the idle sessions

2021-01-12 Thread kuroda.hay...@fujitsu.com

Dear Tom,

> So I propose to change the new ERRCODE_IDLE_SESSION_TIMEOUT to be in
> class 57 and call it good.

I agreed your suggestion and I confirmed your commit.
Thanks!

Hayato Kuroda
FUJITSU LIMITED

Re: Disable WAL logging to speed up data loading

2021-01-12 Thread movead li

I read the patch and have two points:

1. I do basebackup for database then switch wal level from logical to none to 
logical and
of cause I archive the wal segments. Next I do PITR base on the basebackup, as 
a result
it success startup with a waring said maybe data missed.

Because the 'none' level is to bulkload data, do you think it's good that we 
still support
recover from a 'none' wal level.

2. I just mark wal_level as 'none' but fail to startup, it success after I drop 
the publication and
it's subscription,mark max_wal_senders as 0, drop replicate slot. I think it 
worth to write how 
we can startup a 'none' wal level database in document .

Re: Logical decoding without slots: decoding in lockstep with recovery

2021-01-12 Thread Craig Ringer

On Sat, 26 Dec 2020 at 06:51, Andres Freund  wrote:

> Hi,
>
> On 2020-12-23 14:56:07 +0800, Craig Ringer wrote:
> > I want to share an idea I've looked at a few times where I've run into
> > situations where logical slots were inadvertently dropped, or where it
> > became necessary to decode changes in the past on a slot.
> >
> > As most of you will know you can't just create a logical slot in the
> past.
> > Even if it was permitted, it'd be unsafe due to catalog_xmin retention
> > requirements and missing WAL.
> >
> > But if we can arrange a physical replica to replay the WAL of interest
> and
> > decode each commit as soon as it's replayed by the startup process, we
> know
> > the needed catalog rows must all exist, so it's safe to decode the
> change.
> >
> > So it should be feasible to run logical decoding in standby, even
> without a
> > replication slot, so long as we:
> >
> > * pause startup process after each xl_xact_commit
> > * wake the walsender running logical decoding
> > * decode and process until ReorderBufferCommit for the just-committed
> xact
> > returns
> > * wake the startup process to decode the up to the next commit
>
> I don't think it's safe to just do this for each xl_xact_commit - we can
> remove needed rows at quite a few places, not just around transaction
> commit.

Good point.

I vaguely recall spotting a possible decoding-on-standby issue with eager
removal of rows that are still ahead of the global xmin if the primary
"knows" can't be needed based on info about currently running backends. But
when looking over code related to HOT, visibility, and vacuum now I can't
for the life of me remember exactly what it was or find it. Hopefully I
just misunderstood at the time or was getting confused between decoding on
standby and xact streaming.

> Rows needed to correctly decode rows earlier in the transaction
> might not be available by the time the commit record was logged.
>

When can that happen?

I think you'd basically have to run logical decoding in lockstep with
> WAL replay, i.e. replay one record, then call logical decoding for that
> record, replay the next record, ...
>

That sounds likely to be unusably slow. The only way I can see it having
any hope of moving at a reasonable rate would be to run a decoding session
inside the startup process itself so we don't have to switch back/forth for
each record. But I imagine that'd probably cause a whole other set of
problems.

> Can anyone see any obvious problem with this?
>
> The patch for logical decoding on the standby
> https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32%40alap3.anarazel.de
> should provide some of the infrastructure to do this properly. Should
> really commit it. /me adds an entry to the top of the todo list.
>

That would certainly be helpful for quite a number of things.

Re: A failure of standby to follow timeline switch

2021-01-12 Thread Fujii Masao

On Wed, Jan 13, 2021 at 10:48 AM Kyotaro Horiguchi
 wrote:
>
> At Tue, 12 Jan 2021 10:47:21 +0900, Fujii Masao  wrote 
> in
> > On Sat, Jan 9, 2021 at 5:08 AM Alvaro Herrera  
> > wrote:
> > >
> > > Masao-san: Are you intending to act as committer for these?  Since the
> > > bug is mine I can look into it, but since you already did all the
> > > reviewing work, I'm good with you giving it the final push.
> >
> > Thanks! I'm thinking to push the patch.
> >
> >
> > > 0001 looks good to me; let's get that one committed quickly so that we
> > > can focus on the interesting stuff.  While the implementation of
> > > find_in_log is quite dumb (not this patch's fault), it seems sufficient
> > > to deal with small log files.  We can improve the implementation later,
> > > if needed, but we have to get the API right on the first try.
> > >
> > > 0003: The fix looks good to me.  I verified that the test fails without
> > > the fix, and it passes with the fix.
> >
> > Yes.
> >
> >
> > > The test added in 0002 is a bit optimistic regarding timing, as well as
> > > potentially slow; it loops 1000 times and sleeps 100 milliseconds each
> > > time.  In a very slow server (valgrind or clobber_cache animals) this
> > > could not be sufficient time, while on fast servers it may end up
> > > waiting longer than needed.  Maybe we can do something like this:
> >
> > On second thought, I think that the regression test should be in
> > 004_timeline_switch.pl instead of 001_stream_rep.pl because it's
>
> Agreed. It's definitely the right place.
>
> > the test about timeline switch. Also I'm thinking that it's better to
> > test the timeline switch by checking whether some data is successfully
> > replicatead like the existing regression test for timeline switch in
> > 004_timeline_switch.pl does, instead of finding the specific message
> > in the log file. I attached the POC patch. Thought?
>
> It's practically a check on this issue, and looks better. The 180s
> timeout in the failure case seems a bit annoying but it's the way all
> of this kind of test follow.

Yes.

>
> The last check on table content is actually useless but it might make
> sense to confirm that replication is actually working. However, I
> don't think the test don't need to insert as many as 1000 tuples. Just
> a single tuple would suffice.

Thanks for the review!
I'm ok with this change (i.e., insert only single row).
Attached is the updated version of the patch.

Regards,

-- 
Fujii Masao


v6_follow_timeline_switch.patch
Description: Binary data

remove unneeded pstrdup in fetch_table_list

2021-01-12 Thread Hou, Zhijie

Hi

In function fetch_table_list, it get the table names from publicer and return a 
list of tablenames.
When append the name to the list, it use the following code:

**
nspname = TextDatumGetCString(slot_getattr(slot, 1, ));
Assert(!isnull);
relname = TextDatumGetCString(slot_getattr(slot, 2, ));
rv = makeRangeVar(pstrdup(nspname), pstrdup(relname), -1);
tablelist = lappend(tablelist, rv);
**

the nspname and relname will be copied which seems unnecessary.
Because nspame and relname is get from TextDatumGetCString.
IMO, TextDatumGetCString returns a newly palloced string.

**
result = (char *) palloc(len + 1);
memcpy(result, VARDATA_ANY(tunpacked), len);
result[len] = '\0';

if (tunpacked != t)
pfree(tunpacked);

return result;
**

It may harm when there are a lot of tables are replicated.
So I try to fix this.

Best regards,
houzj








0001-remove-unneeded-pstrdup.patch
Description: 0001-remove-unneeded-pstrdup.patch

Re: Tid scan improvements

2021-01-12 Thread Edmund Horner

On Fri, 1 Jan 2021 at 14:30, David Fetter  wrote:

> On Sun, Dec 01, 2019 at 11:34:16AM +0900, Michael Paquier wrote:
> > Okay, still nothing has happened after two months.  Once this is
> > solved a new patch submission could be done.  For now I have marked
> > the entry as returned with feedback.
>
> I dusted off the last version of the patch without implementing the
> suggestions in this thread between here and there.
>
> I think this is a capability worth having, as I was surprised when it
> turned out that it didn't exist when I was looking to make an
> improvement to pg_dump. My idea, which I'll get back to when this is
> in, was to use special knowledge of heap AM tables to make it possible
> to parallelize dumps of large tables by working separately on each
> underlying file, of which there could be quite a few for a large one.
>
> Will try to understand the suggestions upthread better and implement
> same.
>

Hi David,

Thanks for updating the patch.  I'd be very happy if this got picked up
again, and I'd certainly follow along and do some review.

+   splan->tidrangequals =
+   fix_scan_list(root,
splan->tidrangequals,
+ rtoffset,
1); /* v9_tid XXX Not sure this is right */

I'm pretty sure the parameter num_exec = 1 is fine; it seems to only affect
correlated subselects, which shouldn't really make their way into the
tidrangequals expressions.  It's more or less the same situation as
tidquals for TidPath, anyway.  We could put a little comment:  /*
correlated subselects shouldn't get into tidquals/tidrangequals */ or
something to that effect.

Edmund

Re: Key management with tests

2021-01-12 Thread Neil Chen

Thank you for your reply，

On Wed, Jan 13, 2021 at 12:08 AM Stephen Frost  wrote:

>
> No, we can't 'modify the page format as we wish'- if we change away from
> using a C structure then we're going to be modifying quite a bit of
> code which otherwise doesn't need to be changed.  The proposed flag
> doesn't actually make a different page format work, the only thing it
> would do would be to allow some parts of the cluster to be encrypted and
> other parts not be, but I don't know that that's actually a useful
> capability or a good reason to use one of those bits.  Having it handled
> on a cluster level, at initdb time through pg_control, seems like it'd
> work just fine.
>
>
Yes, I realized that for cluster-level encryption, it would be unwise to
flag a single page(Unless we want to do it at relation-level). Forgive me
for not describing clearly, the 'modify the page' I said means the method
you mentioned, not modifying the C structure. My original motivation is to
avoid storing in an unconventional format without a description of the C
structure. However, as I just said, it seems that we should not set
the flag for a single page. Maybe it's enough to just add a comment
description?

Re: Moving other hex functions to /common

2021-01-12 Thread Michael Paquier

On Tue, Jan 12, 2021 at 01:13:00PM -0500, Bruce Momjian wrote:
> Thanks for you work on this.  Looks good.

I have been looking again at this patch again for a couple of hours
this morning to double-check if I have not missed anything, and I
think that we should be in good shape.  This still needs a pgindent
run but I'll take care of it.  Let's wait a bit and see if others have
any comments or objections.  If there is nothing, I'll commit what we
have here.

> I posted your patch under my key management patch and the cfbot
> reports all green:
> 
>   http://cfbot.cputube.org/index.html
> 
> The key management patch calls the src/common hex functions from
> src/backend/crypto, pg_alterckey, and the crypto tests, and these are
> all tested by make check-world.

Ah, OK.  Nice.
--
Michael

signature.asc
Description: PGP signature

Re: [Patch] Optimize dropping of relation buffers using dlist

2021-01-12 Thread Kyotaro Horiguchi

At Tue, 12 Jan 2021 08:49:53 +0530, Amit Kapila  wrote 
in 
> On Fri, Jan 8, 2021 at 7:03 AM Kyotaro Horiguchi
>  wrote:
> >
> > At Thu, 7 Jan 2021 09:25:22 +, "k.jami...@fujitsu.com" 
> >  wrote in:
> > > > Thanks for the detailed tests. NBuffers/32 seems like an appropriate
> > > > value for the threshold based on these results. I would like to
> > > > slightly modify part of the commit message in the first patch as below
> > > > [1], otherwise, I am fine with the changes. Unless you or anyone else
> > > > has any more comments, I am planning to push the 0001 and 0002
> > > > sometime next week.
> > > >
> > > > [1]
> > > > "The recovery path of DropRelFileNodeBuffers() is optimized so that
> > > > scanning of the whole buffer pool can be avoided when the number of
> > > > blocks to be truncated in a relation is below a certain threshold. For
> > > > such cases, we find the buffers by doing lookups in BufMapping table.
> > > > This improves the performance by more than 100 times in many cases
> > > > when several small tables (tested with 1000 relations) are truncated
> > > > and where the server is configured with a large value of shared
> > > > buffers (greater than 100GB)."
> > >
> > > Thank you for taking a look at the results of the tests. And it's also
> > > consistent with the results from Tang too.
> > > The commit message LGTM.
> >
> > +1.
> >
> 
> I have pushed the 0001.

Thank you for commiting this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Disable WAL logging to speed up data loading

2021-01-12 Thread Kyotaro Horiguchi

At Tue, 12 Jan 2021 07:09:28 +, "osumi.takami...@fujitsu.com" 
 wrote in 
> On Tuesday, January 12, 2021 12:52 PM Takayuki/綱川 貴之 
>  wrote:
> > From: Osumi, Takamichi/大墨 昂道 
> > > I updated the patch following this discussion, and fixed the
> > > documentation as well.
> > 
> > 
> > + (rmid == RM_GIST_ID && info == RM_GIST_ID) ||
> > + (rmid == RM_GENERIC_ID)))
> > 
> > info is wrong for GiST, and one pair of parenthesis is unnecessary.  The 
> > above
> > would be:
> > 
> > + (rmid == RM_GIST_ID && info ==
> > XLOG_GIST_ASSIGN_LSN) ||
> > + rmid == RM_GENERIC_ID))
> Oops, sorry for this careless mistake.
> Fixed. And again, regression test produces no failure.

@@ -449,6 +450,18 @@ XLogInsert(RmgrId rmid, uint8 info)
return EndPos;
}
 
+   /* Issues only limited types of WAL when wal logging is disabled */
+   if (wal_level == WAL_LEVEL_NONE &&
+   !((rmid == RM_XLOG_ID && info == XLOG_CHECKPOINT_SHUTDOWN) ||
+ (rmid == RM_XLOG_ID && info == XLOG_PARAMETER_CHANGE) ||
+ (rmid == RM_XACT_ID && info == XLOG_XACT_PREPARE) ||
+ (rmid == RM_GIST_ID && info == XLOG_GIST_ASSIGN_LSN) ||
+ rmid == RM_GENERIC_ID))

As Sawada-san mentioned, this prevents future index AMs from being
pluggable.  Providing an interface would work but seems a bit too
invasive.  The record insertion flags is there for this very purpose.

XLogBeginInsert();
XLogSetRecrodFlags(XLOG_MARK_ESSENTIAL);  # new flag value
XLOGInsert();

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Single transaction in the tablesync worker?

2021-01-12 Thread Hou, Zhijie

> Also PSA some detailed logging evidence of some test scenarios involving
> Drop/AlterSubscription:
> + Test-20210112-AlterSubscriptionRefresh-ok.txt =
> AlterSubscription_refresh which successfully drops a tablesync slot
> + Test-20210112-AlterSubscriptionRefresh-warning.txt =
> AlterSubscription_refresh gives WARNING that it cannot drop the tablesync
> slot (which no longer exists)
> + Test-20210112-DropSubscription-warning.txt = DropSubscription with a
> disassociated slot_name gives a WARNING that it cannot drop the tablesync
> slot (due to broken connection)

Hi

> * The AlterSubscription_refresh (v14+) is now more similar to 
> DropSubscription w.r.t to stopping workers for any "removed" tables.
I have an issue about the above feature.

With the patch, it seems does not stop the worker in the case of [1].
I probably missed something, should we stop the worker in such case ?

[1] 
https://www.postgresql.org/message-id/CALj2ACV%2B0UFpcZs5czYgBpujM9p0Hg1qdOZai_43OU7bqHU_xw%40mail.gmail.com

Best regards,
houzj

RE: ResourceOwner refactoring

2021-01-12 Thread kuroda.hay...@fujitsu.com

Dear Heikki, 

I'm also interested in this patch, but it cannot be applied to the current 
HEAD...

$ git apply ~/v2-0001-Make-resowners-more-easily-extensible.patch
error: patch failed: src/common/cryptohash_openssl.c:57
error: src/common/cryptohash_openssl.c: patch does not apply
error: patch failed: src/include/utils/resowner_private.h:1
error: src/include/utils/resowner_private.h: patch does not apply

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: A failure of standby to follow timeline switch

2021-01-12 Thread Kyotaro Horiguchi

At Tue, 12 Jan 2021 10:47:21 +0900, Fujii Masao  wrote 
in 
> On Sat, Jan 9, 2021 at 5:08 AM Alvaro Herrera  wrote:
> >
> > Masao-san: Are you intending to act as committer for these?  Since the
> > bug is mine I can look into it, but since you already did all the
> > reviewing work, I'm good with you giving it the final push.
> 
> Thanks! I'm thinking to push the patch.
> 
> 
> > 0001 looks good to me; let's get that one committed quickly so that we
> > can focus on the interesting stuff.  While the implementation of
> > find_in_log is quite dumb (not this patch's fault), it seems sufficient
> > to deal with small log files.  We can improve the implementation later,
> > if needed, but we have to get the API right on the first try.
> >
> > 0003: The fix looks good to me.  I verified that the test fails without
> > the fix, and it passes with the fix.
> 
> Yes.
> 
> 
> > The test added in 0002 is a bit optimistic regarding timing, as well as
> > potentially slow; it loops 1000 times and sleeps 100 milliseconds each
> > time.  In a very slow server (valgrind or clobber_cache animals) this
> > could not be sufficient time, while on fast servers it may end up
> > waiting longer than needed.  Maybe we can do something like this:
> 
> On second thought, I think that the regression test should be in
> 004_timeline_switch.pl instead of 001_stream_rep.pl because it's

Agreed. It's definitely the right place.

> the test about timeline switch. Also I'm thinking that it's better to
> test the timeline switch by checking whether some data is successfully
> replicatead like the existing regression test for timeline switch in
> 004_timeline_switch.pl does, instead of finding the specific message
> in the log file. I attached the POC patch. Thought?

It's practically a check on this issue, and looks better. The 180s
timeout in the failure case seems a bit annoying but it's the way all
of this kind of test follow.

The last check on table content is actually useless but it might make
sense to confirm that replication is actually working. However, I
don't think the test don't need to insert as many as 1000 tuples. Just
a single tuple would suffice.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: outdated references to replication timeout

2021-01-12 Thread Fujii Masao

On Wed, Jan 13, 2021 at 5:39 AM John Naylor
 wrote:
>
> Hi,
>
> The parameter replication_timeout was retired in commit 6f60fdd701 back in 
> 2012, but some comments and error messages seem to refer to that old setting 
> instead of wal_sender_timeout or wal_receiver_timeout. The attached patch 
> replaces the old language with more specific references.

Thanks for the patch! I think this change makes sense.

-   (errmsg("terminating walsender process
due to replication timeout")));
+   (errmsg("terminating walsender process
due to WAL sender timeout")));

Isn't it a bit strange to include different expressions "walsender" and
"WAL sender" for the same thing in one message?

This is a bit related, but different topic, though. If we change the above
message about walsender timeout, I also want to change the message about
walreceiver timeout, so as to make them more consistent. For example,

- (errmsg("terminating walreceiver due to timeout")));
+ (errmsg("terminating WAL receiver process due to WAL receiver timeout")));

Regards,

-- 
Fujii Masao

Re: Fix a typo in SearchCatCache function comment

2021-01-12 Thread Michael Paquier

On Tue, Jan 12, 2021 at 04:21:18PM +0900, Michael Paquier wrote:
> Good catch.  I'll go fix it tomorrow if nobody objects.

And applied.  Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: pg_preadv() and pg_pwritev()

2021-01-12 Thread Thomas Munro

On Mon, Jan 11, 2021 at 3:59 PM Thomas Munro  wrote:
> On Mon, Jan 11, 2021 at 3:34 PM Thomas Munro  wrote:
> > I pushed it with that name, and a couple more cosmetic changes.  I'll
> > keep an eye on the build farm.
>
> Since only sifaka has managed to return a result so far (nice CPU), I
> had plenty of time to notice that macOS Big Sur has introduced
> preadv/pwritev.  They were missing on Catalina[1].

The rest of buildfarm was OK with it too, but I learned of a small
problem through CI testing of another patch: it's not OK for
src/port/pwrite.c to do this:

if (part > 0)
elog(ERROR, "unexpectedly wrote more than requested");

... because now when I try to use pg_pwrite() in pg_test_fsync,
Windows fails to link:

libpgport.lib(pwrite.obj) : error LNK2019: unresolved external symbol
errstart referenced in function pg_pwritev_with_retry
[C:\projects\postgresql\pg_test_fsync.vcxproj]

I'll go and replace that with an assertion.

Re: pg_dump INDEX ATTACH versus --clean option

2021-01-12 Thread Alvaro Herrera

On 2021-Jan-12, Tom Lane wrote:

> I think actually the cleanest fix would be to invent ALTER INDEX DETACH
> PARTITION and use that as the dropStmt for the INDEX ATTACH object.
> No idea how painful that would be to do, though.  I suppose it'd involve
> reverting the parent index back to an invalid state.

Right.  The initial submitted patch did have DETACH, and on review we were kinda
happy that we were able to remove that and avoid indexes that can revert from 
valid to
invalid state.  I don't recall the precise reason, but it can probably
be found in the archives ... perhaps starting at
https://postgr.es/m/flat/cakjs1f9g6hnahjpolahavrkt0upyyzncei2rq__klqcrge_...@mail.gmail.com

As far as the code goes, DETACH was already in some version older than
what got committed; I suppose we could easily crib stuff from there.
It had a new alter table subcommand, so it'd not be a backpatchable fix
in that way; we'd need some different parse node representation, I
think.  One problem that was definitely not solved, is that in
multi-level partitioning setups, we would have to lock relations from
the top down.

-- 
Álvaro Herrera   Valdivia, Chile

Re: WIP: System Versioned Temporal Table

2021-01-12 Thread Ryan Lambert

On Mon, Jan 11, 2021 at 7:02 AM Simon Riggs 
wrote:

> On Sat, Jan 9, 2021 at 10:39 AM Simon Riggs
>  wrote:
> >
> > On Fri, Jan 8, 2021 at 9:19 PM Ryan Lambert 
> wrote:
> >
> > >> Updated v11 with additional docs and some rewording of messages/tests
> > >> to use "system versioning" correctly.
> > >>
> > >> No changes on the points previously raised.
> > >>
> > > Thank you!  The v11 applies and installs.  I tried a simple test,
> unfortunately it appears the versioning is not working. The initial value
> is not preserved through an update and a new row does not appear to be
> created.
> >
> > Agreed. I already noted this in my earlier review comments.
>
> I'm pleased to note that UPDATE-not-working was a glitch, possibly in
> an earlier patch merge. That now works as advertised.
>

It is working as expected now, Thank you!


> I've added fairly clear SGML docs to explain how the current patch
> works, which should assist wider review.
>

The default column names changed to start_timestamp and end_timestamp.  A
number of places in the docs still refer to StartTime and EndTime.  I
prefer the new names without MixedCase.


>
> Also moved test SQL around a bit, renamed some things in code for
> readability, but not done any structural changes.
>
> This is looking much better now... with the TODO/issues list now
> looking like this...
>
> * Anomalies around use of CURRENT_TIMESTAMP are not discussed or resolved.
> Probably need to add a test that end_timestamp > start_timestamp or ERROR,
> which effectively enforces serializability.
>
> * No discussion, comments or tests around freezing and whether that
> causes issues here
>
> * What happens if you ask for a future time?
> It will give an inconsistent result as it scans, so we should refuse a
> query for time > current_timestamp.

* ALTER TABLE needs some work, it's a bit klugey at the moment and
> needs extra tests.
> Should refuse DROP COLUMN on a system time column, but currently doesn't
>
> * UPDATE foo SET start_timestamp = DEFAULT should fail but currently
> doesn't
>
> * Do StartTime and EndTime show in SELECT *? Currently, yes. Would
> guess we wouldn't want them to, not sure what standard says.
>
> From here, the plan would be to set this to "Ready For Committer" in
> about a week. That is not the same thing as me saying it is
> ready-for-commit, but we need some more eyes on this patch to decide
> if it is something we want and, if so, are the code changes cool.
>
>
Should I invest time now into further testing with more production-like
scenarios on this patch?  Or would it be better to wait on putting effort
into that until it has had more review?  I don't have much to offer for
help on your current todo list.

Thanks,

Ryan Lambert

Re: list of extended statistics on psql

2021-01-12 Thread Tatsuro Yamada


Hi Tomas,

On 2021/01/12 20:08, Tomas Vondra wrote:


On 1/12/21 2:57 AM, Tatsuro Yamada wrote:

Hi Tomas,

On 2021/01/09 9:01, Tomas Vondra wrote:

...>

While working on that, I realized that 'defined' might be a bit
ambiguous, I initially thought it means 'NOT NULL' (which it does not).
I propose to change it to 'requested' instead. Tatsuro, do you agree, or
do you think 'defined' is better?


Regarding the status of extended stats, I think the followings:

  - "defined": it shows the extended stats defined only. We can't know
   whether it needs to analyze or not. I agree this name was
    ambiguous. Therefore we should replace it with a more suitable
   name.
  - "requested": it shows the extended stats needs something. Of course,
   we know it needs to ANALYZE because we can create the patch.
   However, I feel there is a little ambiguity for DBA.
   To solve this, it would be better to write an explanation of
   the status in the document. For example,

==
The column of the kind of extended stats (e. g. Ndistinct) shows some statuses.
"requested" means that it needs to gather data by ANALYZE. "built" means ANALYZE
  was finished, and the planner can use it. NULL means that it doesn't exists.
==

What do you think? :-D



Yes, that seems reasonable to me. Will you provide an updated patch?



Sounds good. I'll send the updated patch today.


Thanks,
Tatsuro Yamada

Re: pgbench and timestamps (bounced)

2021-01-12 Thread Tom Lane

Fabien COELHO  writes:
>> Hi, this entry is "Waiting on Author" and the thread was inactive for a 
>> while. I see this discussion still has some open questions. Are you 
>> going to continue working on it, or should I mark it as "returned with 
>> feedback" until a better time?

> IMHO the proposed fix is reasonable and addresses the issue.
> I have responded to Tom's remarks about it, and it is waiting for his 
> answer to my counter arguments. So ISTM that the patch is currently still 
> waiting for some feedback.

It looks like my unhappiness with injecting a pthread dependency into
pgbench is going to be overtaken by events in the "option delaying
queries" thread [1].  However, by the same token there are some conflicts
between the two patchsets, and also I prefer the other thread's approach
to portability (i.e. do it honestly, not with a private portability layer
in pgbench.c).  So I'm inclined to put the parts of this patch that
require pthreads on hold till that lands.

What remains that we could do now, and perhaps back-patch, is point (2)
i.e. disallow digits as the first character of a pgbench variable name.
That would be enough to "solve" the original bug report, and it does seem
like it could be back-patched, while we're certainly not going to risk
back-patching anything as portability-fraught as introducing mutexes.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/20200227180100.zyvjwzcpiokfsqm2%40alap3.anarazel.de

Re: pg_dump INDEX ATTACH versus --clean option

2021-01-12 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Jan-12, Tom Lane wrote:
>> Since there's no ALTER INDEX DETACH PARTITION, it's not entirely
>> clear what to do about this.  We could possibly not emit any
>> dropStmt for partition child indexes, but that seems very likely
>> to cause problems for partial-restore scenarios.

> Yeah, it would break the case of restoring a partition that already
> exists under --clean.  (Of course, if the partition doesn't exist
> already, there's no problem, since nothing is to be dropped anyway.)

> About the only thing I can think of, is to make the dropStmt use a
> plpgsql DO block that drops conditionally (only if not an index
> partition).

Don't much like that :-(.  Aside from the fact that we'd then be
requiring plpgsql to exist to do a restore, I think this would
cause big compatibility problems in the future, since the DO block
would have to do some more-or-less-weird catalog query to find out
if the index is a partition.  We'd be locked into guaranteeing that
that query works, pretty much forever.

I think actually the cleanest fix would be to invent ALTER INDEX DETACH
PARTITION and use that as the dropStmt for the INDEX ATTACH object.
No idea how painful that would be to do, though.  I suppose it'd involve
reverting the parent index back to an invalid state.

regards, tom lane

Re: pg_upgrade test for binary compatibility of core data types

2021-01-12 Thread Andrew Dunstan



On 1/12/21 12:53 PM, Bruce Momjian wrote:
> On Tue, Jan 12, 2021 at 11:27:53AM -0600, Justin Pryzby wrote:
>> On Tue, Jan 12, 2021 at 12:15:59PM -0500, Bruce Momjian wrote:
>>> Uh, what exactly is missing from the beta checklist?  I read the patch
>>> and commit message but don't understand it.
>> Did you try to use test.sh to upgrade from a prior release ?
>>
>> Evidently it's frequently forgotten, as evidenced by all the "deferred
>> maintenance" I had to do to allow testing the main patch (currently 0003).
>>
>> See also:
>>
>> commit 5bab1985dfc25eecf4b098145789955c0b246160
>> Author: Tom Lane 
>> Date:   Thu Jun 8 13:48:27 2017 -0400
>>
>> Fix bit-rot in pg_upgrade's test.sh, and improve documentation.
>> 
>> Doing a cross-version upgrade test with test.sh evidently hasn't been
>> tested since circa 9.2, because the script lacked case branches for
>> old-version servers newer than 9.1.  Future-proof that a bit, and
>> clean up breakage induced by our recent drop of V0 function call
>> protocol (namely that oldstyle_length() isn't in the regression
>> suite anymore).
> Oh, that is odd.  I thought that was regularly run.  I have my own test
> infrastructure that I run for every major release so I never have run
> the built-in one, except for make check-world.
>

Cross version pg_upgrade is tested regularly in the buildfarm, but not
using test.sh. Instead it uses the saved data repository from a previous
run of the buildfarm client for the source branch, and tries to upgrade
that to the target branch.


cheers


andrew



--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: pg_dump INDEX ATTACH versus --clean option

2021-01-12 Thread Alvaro Herrera

On 2021-Jan-12, Tom Lane wrote:

> then "pg_dump -n s1 -c mydb >mydb.dump" will emit
> 
> ALTER TABLE ONLY s1.at12 DROP CONSTRAINT at12_pkey;
> ALTER TABLE ONLY s1.at11 DROP CONSTRAINT at11_pkey;
> ALTER TABLE ONLY s1.at1 DROP CONSTRAINT at1_pkey;
> DROP TABLE s1.at12;
> DROP TABLE s1.at11;
> DROP TABLE s1.at1;
> DROP SCHEMA s1;
> ... then create the objects ...
> 
> which naturally results in
> 
> psql:mydb.dump:19: ERROR:  cannot drop inherited constraint "at12_pkey" of 
> relation "at12"
> psql:mydb.dump:20: ERROR:  cannot drop inherited constraint "at11_pkey" of 
> relation "at11"

> That's not really okay, since it'd break a single-transaction
> restore.

Hmm.  You complained about a related case in
3170626.1594842...@sss.pgh.pa.us and I posted a patch:
https://www.postgresql.org/message-id/20200812214918.GA30353@alvherre.pgsql

I suggested there to make the dropStmt empty, but ended up not pushing
that patch.  That would solve this problem also.

> Since there's no ALTER INDEX DETACH PARTITION, it's not entirely
> clear what to do about this.  We could possibly not emit any
> dropStmt for partition child indexes, but that seems very likely
> to cause problems for partial-restore scenarios.

Yeah, it would break the case of restoring a partition that already
exists under --clean.  (Of course, if the partition doesn't exist
already, there's no problem, since nothing is to be dropped anyway.)

About the only thing I can think of, is to make the dropStmt use a
plpgsql DO block that drops conditionally (only if not an index
partition).

-- 
Álvaro Herrera39°49'30"S 73°17'W

outdated references to replication timeout

2021-01-12 Thread John Naylor

Hi,

The parameter replication_timeout was retired in commit 6f60fdd701 back in
2012, but some comments and error messages seem to refer to that old
setting instead of wal_sender_timeout or wal_receiver_timeout. The attached
patch replaces the old language with more specific references.

-- 
John Naylor
EDB: http://www.enterprisedb.com


stray-repl-timeout.patch
Description: Binary data

Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2021-01-12 Thread Álvaro Herrera

On 2021-Jan-12, Álvaro Herrera wrote:

> > For the 0001 patch, since ReindexIndexInfo is used only within
> > ReindexRelationConcurrently() I think it’s a function-local structure
> > type. So we can declare it within the function. What do you think?
> 
> That's a good idea.  Pushed with that change, thanks.

Here's the other patch, with comments fixed per reviews, and rebased to
account for the scope change.

-- 
Álvaro Herrera   Valdivia, Chile
Tulio: oh, para qué servirá este boton, Juan Carlos?
Policarpo: No, aléjense, no toquen la consola!
Juan Carlos: Lo apretaré una y otra vez.
>From f2476636a358b77aca4b7adb36ca8e383e78af9b Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 30 Nov 2020 16:01:46 -0300
Subject: [PATCH v2] set PROC_IN_SAFE_IC during REINDEX CONCURRENTLY

---
 src/backend/commands/indexcmds.c | 56 +---
 src/include/storage/proc.h   |  1 +
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 8c9c39a467..35c3d20eae 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -385,7 +385,7 @@ CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY
+ * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
  * on indexes that are neither expressional nor partial are also safe to
  * ignore, since we know that those processes won't examine any data
  * outside the table they're indexing.
@@ -1566,9 +1566,11 @@ DefineIndex(Oid relationId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+	/*
+	 * This transaction doesn't need to set the PROC_IN_SAFE_IC flag, because
+	 * it only acquires an Xid to do some catalog manipulations, after the
+	 * wait is over.
+	 */
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -3066,6 +3068,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		Oid			indexId;
 		Oid			tableId;
 		Oid			amId;
+		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3377,6 +3380,9 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		heapRel = table_open(indexRel->rd_index->indrelid,
 			 ShareUpdateExclusiveLock);
 
+		/* determine safety of this index for set_indexsafe_procflags */
+		idx->safe = (indexRel->rd_indexprs == NIL &&
+	 indexRel->rd_indpred == NIL);
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -3418,6 +3424,7 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 
 		newidx = palloc(sizeof(ReindexIndexInfo));
 		newidx->indexId = newIndexId;
+		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -3485,6 +3492,11 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
+	 */
+
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -3514,6 +3526,10 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		 */
 		CHECK_FOR_INTERRUPTS();
 
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
 		/* Set ActiveSnapshot since functions in the indexes may need it */
 		PushActiveSnapshot(GetTransactionSnapshot());
 
@@ -3534,8 +3550,14 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
+
 	StartTransactionCommand();
 
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
+	 */
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -3564,6 +3586,10 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		 */
 		CHECK_FOR_INTERRUPTS();
 
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
 		/*
 		 * Take the "reference snapshot" that will be used by validate_index()
 		 * to filter candidate tuples.
@@ -3607,6 +3633,9 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the reference snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
+		 *
+		 * Because we don't take a snapshot in this transaction, there's no
+		 * need to set the

Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2021-01-12 Thread Álvaro Herrera

On 2021-Jan-04, Masahiko Sawada wrote:

> On Tue, Dec 8, 2020 at 5:18 PM Hamid Akhtar  wrote:
> >
> > The following review has been posted through the commitfest application:
> > make installcheck-world:  tested, passed
> > Implements feature:   not tested
> > Spec compliant:   not tested
> > Documentation:not tested
> >
> > The patch looks good to me. With regards to the code comments, I had pretty 
> > similar concerns as already raised by Dmitry; and those have already been 
> > answered by you. So this patch is good to go from my side once you have 
> > updated the comments per your last email.
> 
> For the 0001 patch, since ReindexIndexInfo is used only within
> ReindexRelationConcurrently() I think it’s a function-local structure
> type. So we can declare it within the function. What do you think?

That's a good idea.  Pushed with that change, thanks.

Thanks also to Dmitry and Hamid for the reviews.

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: [DOC] Document concurrent index builds waiting on each other

2021-01-12 Thread Alvaro Herrera

On 2020-Dec-01, James Coleman wrote:

> On Tue, Dec 1, 2020 at 6:51 PM Alvaro Herrera  wrote:

> > Makes sense.  ISTM that if we want to have a cautionary blurb CIC docs,
> > it should go in REINDEX CONCURRENTLY as well.
> 
> Agreed. Or, alternatively, a blurb something like "Please note how CIC
> interacts with VACUUM ...", and then the primary language in
> maintenance.sgml. That would have the benefit of maintaining the core
> language in only one place.

I looked into this again, and I didn't like what I had added to
maintenance.sgml at all.  It seems out of place where I put it; and I
couldn't find any great spots.  Going back to your original proposal,
what about something like this?  It's just one more para in the "notes"
section in CREATE INDEX and REINDEX pages, without any additions to the
VACUUM pages.

-- 
Álvaro Herrera39°49'30"S 73°17'W
>From 8b0aa591f5d3f34ccdc2cd0fc6031dd421f229c2 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 30 Nov 2020 18:50:06 -0300
Subject: [PATCH v6] Highlight vacuum consideration in create index docs

Per James Coleman
---
 doc/src/sgml/ref/create_index.sgml | 5 +
 doc/src/sgml/ref/reindex.sgml  | 5 +
 2 files changed, 10 insertions(+)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 2054d5d943..d951f14b09 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -829,6 +829,11 @@ Indexes:
to remove an index.
   
 
+  
+   Like any long-running transaction, CREATE INDEX can
+   affect which tuples can be removed by concurrent VACUUM.
+  
+
   
Prior releases of PostgreSQL also had an
R-tree index method.  This method has been removed because
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 6e1cf06713..ef553f6481 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -436,6 +436,11 @@ Indexes:
 CONCURRENTLY cannot.

 
+   
+Like any long-running transaction, REINDEX can
+affect which tuples can be removed by concurrent VACUUM.
+   
+

 REINDEX SYSTEM does not support
 CONCURRENTLY since system catalogs cannot be reindexed
-- 
2.20.1

pg_dump INDEX ATTACH versus --clean option

2021-01-12 Thread Tom Lane

I noticed that pg_dump --clean does not work with partitioned
indexes.  Given for instance

create schema s1;
create table s1.at1 (f1 int, f2 int, primary key(f1,f2)) partition by list(f1);
create table s1.at11 partition of s1.at1 for values in(11);
create table s1.at12 partition of s1.at1 for values in(12);

then "pg_dump -n s1 -c mydb >mydb.dump" will emit

ALTER TABLE ONLY s1.at12 DROP CONSTRAINT at12_pkey;
ALTER TABLE ONLY s1.at11 DROP CONSTRAINT at11_pkey;
ALTER TABLE ONLY s1.at1 DROP CONSTRAINT at1_pkey;
DROP TABLE s1.at12;
DROP TABLE s1.at11;
DROP TABLE s1.at1;
DROP SCHEMA s1;
... then create the objects ...

which naturally results in

psql:mydb.dump:19: ERROR:  cannot drop inherited constraint "at12_pkey" of 
relation "at12"
psql:mydb.dump:20: ERROR:  cannot drop inherited constraint "at11_pkey" of 
relation "at11"
ALTER TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP SCHEMA

That's not really okay, since it'd break a single-transaction
restore.

Since there's no ALTER INDEX DETACH PARTITION, it's not entirely
clear what to do about this.  We could possibly not emit any
dropStmt for partition child indexes, but that seems very likely
to cause problems for partial-restore scenarios.

regards, tom lane

Re: Key management with tests

2021-01-12 Thread Andres Freund

Hi,

On 2021-01-11 20:12:00 +0900, Masahiko Sawada wrote:

> diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
> index 32b5d62e1f..d474af753c 100644
> --- a/contrib/bloom/blinsert.c
> +++ b/contrib/bloom/blinsert.c
> @@ -177,6 +177,7 @@ blbuildempty(Relation index)
>* XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
>* this even when wal_level=minimal.
>*/
> + PageEncryptInplace(metapage, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO);
>   PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
>   smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
> (char *) metapage, true);

There's quite a few places doing encryption + checksum + smgwrite now. I
strongly suggest splitting that off into a helper routine in a
preparatory patch.


> @@ -528,6 +529,8 @@ BootstrapModeMain(void)
>  
>   InitPostgres(NULL, InvalidOid, NULL, InvalidOid, NULL, false);
>  
> + InitializeBufferEncryption();
> +
>   /* Initialize stuff for bootstrap-file processing */
>   for (i = 0; i < MAXATTR; i++)
>   {

Why are we initializing this here instead of postmaster? As far as I can
tell that just leads to redundant work instead of doing it once?


> +/*-
> + * We use both page LSN and page number to create a nonce for each page. Page
> + * LSN is 8 byte, page number is 4 byte, and the maximum required counter for
> + * AES-CTR is 2048, which fits in 3 byte. Since the length of IV is 16 byte
> + * it's fine. Using the LSN and page number as part of the nonce has
> + * three benefits:
> + *
> + * 1. We don't need to decrypt/re-encrypt during CREATE DATABASE since the 
> page
> + * contents are the same in both places, and once one database changes its 
> pages,
> + * it gets a new LSN, and hence a new nonce.
> + * 2. For each change of an 8k page, we get a new nonce, so we are not 
> encrypting
> + * different data with the same nonce/IV.
> + * 3. We avoid requiring pg_upgrade to preserve database oids, tablespace 
> oids,
> + * relfilenodes.

I think 3) also has a few minor downsides - by not including information
identifying a relation a potential attacker with access to the data
directory has more chances to get the database to decrypt data by
e.g. switching relation files around.



> @@ -2792,12 +2793,15 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
>*/
>   bufBlock = BufHdrGetBlock(buf);
>  
> + bufToWrite = PageEncryptCopy((Page) bufBlock, buf->tag.forkNum,
> +  
> buf->tag.blockNum);
> +
>   /*
>* Update page checksum if desired.  Since we have only shared lock on 
> the
>* buffer, other processes might be updating hint bits in it, so we must
>* copy the page to private storage if we do checksumming.
>*/
> - bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
> + bufToWrite = PageSetChecksumCopy((Page) bufToWrite, buf->tag.blockNum);
>  
>   if (track_io_timing)
>   INSTR_TIME_SET_CURRENT(io_start);

So now we copy the page twice, not just once, if both checksums and
encryption is enabled? That doesn't seem right.


> @@ -3677,6 +3683,21 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
>   {
>   dirtied = true; /* Means "will be dirtied by 
> this action" */
>  
> + /*
> +  * We will dirty the page but the page lsn is not 
> changed if we
> +  * doesn't write a backup block. We don't want to 
> encrypt the
> +  * different bits stream with the same combination of 
> nonce and key
> +  * since in buffer encryption the page lsn is a part of 
> nonce.
> +  * Therefore we WAL-log no-op record just to move page 
> lsn forward if
> +  * we doesn't write a backup block, even when this is 
> not the first
> +  * modification in this checkpoint round.
> +  */
> + if (XLogRecPtrIsInvalid(lsn) && DataEncryptionEnabled())
> + {
> + lsn = log_noop();
> + Assert(!XLogRecPtrIsInvalid(lsn));
> + }
> +

Aren't you doing a WAL record while holding the buffer header lock here?
You can't do things like WAL insertions while holding a spinlock.


I don't see how it is safe / correct to use a noop record here. A noop
record isn't associated with the page, so WAL replay isn't going to
perform the same LSN modification.

Also, why is it OK to modify the LSN without, if necessary, logging an FPI?



> +char *
> +PageEncryptCopy(Page page, ForkNumber forknum, BlockNumber blkno)
> +{
> + static char *pageCopy = NULL;
> +
> + /* If we don't need a checksum, just return the

Re: Key management with tests

2021-01-12 Thread Bruce Momjian

On Tue, Jan 12, 2021 at 01:57:11PM -0500, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Tue, Jan 12, 2021 at 01:44:05PM -0500, Stephen Frost wrote:
> > > * Bruce Momjian (br...@momjian.us) wrote:
> > > > Well, we have eight unused bits in the IV, so we could just increment
> > > > that for every hint bit change that uses the same LSN, and then force a
> > > > dummy WAL record when that 8-bit counter overflows --- that seems
> > > > simpler than logging hint bits.
> > > 
> > > Sure, as long as we have a place to store that information..  We need to
> > > have the full IV available when we go to decrypt the page.
> > 
> > Oh, yeah, we would need that counter recorded since previously the IV
> > was made up of already-recorded information, i.e., the page LSN and page
> > number.  However, the reason don't WAL-log hint bits always is because
> > we can afford to lose them, but in this case, any counter we need to
> > store will need to be WAL logged since we can't affort to lose that
> > counter value for decryption --- that gets us back to WAL-logging
> > something during hint bit changes.  :-(
> 
> I don't think that's actually the case..?  The hole I'm talking about is
> there exclusively for post-encryption storage of the tag and maybe this
> part of the IV and would be zero'd out in the FPIs that actually go into
> the WAL (which would be encrypted with the WAL key, not the data key).
> All we would need to be confident of is that if the page with the hint
> bit update gets encrypted and written out that the IV counter gets
> incremented and also written out as part of that write.

OK, got it.  I have added this to the wiki:


https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements

wal_log_hints will be enabled automatically in encryption mode. However,
more than one hit change between checkpoints does not cause WAL
activity, which would cause the same LSN to be used for different page
images. This means we need a page-stored counter, to be used in the four
unused bytes of the IV. This prevents multiple page writes during the
same checkpoint interval from using the same IV. Counter changes do not
need to be WAL logged since we either get the page from the WAL (which
is only encrypted with the WAL data key), or from disk, which is
durable. 

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2021-01-12 Thread Tomas Vondra

Thanks. These patches seem to resolve the TOAST table issue, freezing it 
as expected. I think the code duplication is not an issue, but I wonder 
why heap_insert uses this condition:


/*
 * ...
 *
 * No need to update the visibilitymap if it had all_frozen bit set
 * before this insertion.
 */
if (all_frozen_set && ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0))

while heap_multi_insert only does this:

if (all_frozen_set) { ... }

I haven't looked at the details, but shouldn't both do the same thing?

I've done some benchmarks, comparing master and patched version on a 
bunch of combinations (logged/unlogged, no before-insert trigger, 
trigger filtering everything/nothing). On master, the results are:


group copy   vacuum
---
logged / no trigger   4672  162
logged / trigger (all)4914  162
logged / trigger (none)   1219   11
unlogged / no trigger 4132  156
unlogged / trigger (all)  4292  156
unlogged / trigger (none) 1275   11

and patched looks like this:

group copy   vacuum
---
logged / no trigger   4669   12
logged / trigger (all)4874   12
logged / trigger (none)   1233   11
unlogged / no trigger 4054   11
unlogged / trigger (all)  4185   12
unlogged / trigger (none) 1222   10

This looks pretty nice - there are no regressions, just speedups in the 
vacuum step. The SQL script used is attached.


However, I've also repeated the test counting all-frozen pages in both 
the main table and TOAST table, and I get this:



master
==

select count(*) from pg_visibility((select reltoastrelid from pg_class 
where relname = 't'));


 count

 10
(1 row)


select count(*) from pg_visibility((select reltoastrelid from pg_class 
where relname = 't')) where not all_visible;


 count

 10
(1 row)


patched
===

select count(*) from pg_visibility((select reltoastrelid from pg_class 
where relname = 't'));


 count

 12
(1 row)


select count(*) from pg_visibility((select reltoastrelid from pg_class 
where relname = 't')) where not all_visible;


 count

  0
(1 row)

That is - all TOAST pages are frozen (as expected, which is good). But 
now there are 12 pages, not just 10 pages. That is, we're now 
creating 2 extra pages, for some reason. I recall Pavan reported similar 
issue with every 32768-th page not being properly filled, but I'm not 
sure if that's the same issue.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


test.sql
Description: application/sql


bench.sql
Description: application/sql

Re: [HACKERS] [PATCH] Generic type subscripting

2021-01-12 Thread Pavel Stehule

ne 10. 1. 2021 v 19:52 odesílatel Pavel Stehule 
napsal:

> Hi
>
>
>> I'm thinking of the update path as a kind of implicit schema. JSON is
>> intentionally not bound to any schema on creation, so I don't see a
>> failure to enforce another schema at runtime (and outside the WHERE
>> clause, at that) as an error exactly.
>>
>
> This concept is not consistent with other implemented behaviour.
>
> 1. The schema is dynamically enhanced - so although the path doesn't
> exists, it is created and data are changed
>
> postgres=# create table foo(a jsonb);
> CREATE TABLE
> postgres=# insert into foo values('{}');
> INSERT 0 1
> postgres=# update foo set a['a']['a'][10] = '0';
> UPDATE 1
> postgres=# select * from foo;
>
> ┌───┐
> │   a
>   │
>
> ╞═══╡
> │ {"a": {"a": [null, null, null, null, null, null, null, null, null, null,
> 0]}} │
>
> └───┘
> (1 row)
>
> So although the path [a,a,10] was not exists, it was created.
>
> 2. this update fails (and it is correct)
>
> postgres=# update foo set a['a']['a']['b'] = '0';
> ERROR:  path element at position 3 is not an integer: "b"
>
> although the path [a,a,b] doesn't exists, and it is not ignored.
>
> This implementation doesn't do only UPDATE (and then analogy with WHERE
> clause isn't fully adequate). It does MERGE. This is necessary, because
> without it, the behaviour will be pretty unfriendly - because there is not
> any external schema. I think so this is important - and it can be little
> bit messy. I am not sure if I use correct technical terms - we try to use
> LAX update in first step, and if it is not successful, then we try to do
> LAX insert. This is maybe correct from JSON semantic - but for developer it
> is unfriendly, because he hasn't possibility to detect if insert was or was
> not successful. In special JSON functions I can control behave and can
> specify LAX or STRICT how it is necessity. But in this interface
> (subscripting) this possibility is missing.
>
> I think so there should be final check (semantically) if value was
> updated, and if the value was changed. If not, then error should be raised.
> It should be very similar like RLS update. I know and I understand so there
> should be more than one possible implementations, but safe is only one -
> after successful update I would to see new value inside, and when it is not
> possible, then I expect exception. I think so it is more practical too. I
> can control filtering with WHERE clause. But I cannot to control MERGE
> process. Manual recheck after every update can be terrible slow.
>

I tested behaviour and I didn't find anything other than the mentioned
issue.

Now I can check this feature from plpgsql, and it is working. Because there
is no special support in plpgsql runtime, the update of jsonb is
significantly slower than in update of arrays, and looks so update of jsonb
has O(N2) cost. I don't think it is important at this moment - more
important is fact, so I didn't find any memory problems.


postgres=# do $$
declare v int[] = array_fill(0, ARRAY[10,10,10,20]);
begin
  for i1 in 1..10 loop
for i2 in 1..10 loop
  for i3 in 1..10 loop
for i4 in 1..20 loop
  v[i1][i2][i3][i4] = 10;
  raise notice '% % % %', i1, i2, i3, i4;
end loop;
  end loop;
end loop;
  end loop;
end;
$$;

postgres=# do $$
declare v jsonb;
begin
  for i1 in 1..10 loop
for i2 in 1..10 loop
  for i3 in 1..10 loop
for i4 in 1..20 loop
  v[i1][i2][i3][i4] = '10'::jsonb;
  raise notice '% % % %', i1, i2, i3, i4;
end loop;
  end loop;
end loop;
  end loop;
end;
$$;

There are some unwanted white spaces

+   Jsonb   *jsonbSource = DatumGetJsonbP(*op->resvalue);
+   sbsrefstate->prevvalue = jsonb_get_element(jsonbSource,
+  sbsrefstate->upperindex,
+  sbsrefstate->numupper,
+  >prevnull,
+  false);


+   workspace = palloc0(MAXALIGN(sizeof(JsonbSubWorkspace)) +
+   nupper * (sizeof(Datum) + sizeof(Oid)));
+   workspace->expectArray = false;
+   ptr = ((char *) workspace) + MAXALIGN(sizeof(JsonbSubWorkspace));

Regards

Pavel



> Regards
>
> Pavel
>
>
>
>> But I looked into the bulk case a little further, and "outside the
>> WHERE clause" cuts both ways. The server reports an update whether or
>> not the JSON could have been modified, which suggests triggers will
>> fire for no-op updates. That's more clearly a problem.
>>
>> insert into j (val) values
>>  ('{"a": 100}'),
>>  ('{"a": "200"}'),
>>  ('{"b": "300"}'),
>>  ('{"c": {"d": 400}}'),
>>

Re: Key management with tests

2021-01-12 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Tue, Jan 12, 2021 at 01:44:05PM -0500, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > > Well, we have eight unused bits in the IV, so we could just increment
> > > that for every hint bit change that uses the same LSN, and then force a
> > > dummy WAL record when that 8-bit counter overflows --- that seems
> > > simpler than logging hint bits.
> > 
> > Sure, as long as we have a place to store that information..  We need to
> > have the full IV available when we go to decrypt the page.
> 
> Oh, yeah, we would need that counter recorded since previously the IV
> was made up of already-recorded information, i.e., the page LSN and page
> number.  However, the reason don't WAL-log hint bits always is because
> we can afford to lose them, but in this case, any counter we need to
> store will need to be WAL logged since we can't affort to lose that
> counter value for decryption --- that gets us back to WAL-logging
> something during hint bit changes.  :-(

I don't think that's actually the case..?  The hole I'm talking about is
there exclusively for post-encryption storage of the tag and maybe this
part of the IV and would be zero'd out in the FPIs that actually go into
the WAL (which would be encrypted with the WAL key, not the data key).
All we would need to be confident of is that if the page with the hint
bit update gets encrypted and written out that the IV counter gets
incremented and also written out as part of that write.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Key management with tests

2021-01-12 Thread Bruce Momjian

On Tue, Jan 12, 2021 at 01:44:05PM -0500, Stephen Frost wrote:
> * Bruce Momjian (br...@momjian.us) wrote:
> > Well, we have eight unused bits in the IV, so we could just increment
> > that for every hint bit change that uses the same LSN, and then force a
> > dummy WAL record when that 8-bit counter overflows --- that seems
> > simpler than logging hint bits.
> 
> Sure, as long as we have a place to store that information..  We need to
> have the full IV available when we go to decrypt the page.

Oh, yeah, we would need that counter recorded since previously the IV
was made up of already-recorded information, i.e., the page LSN and page
number.  However, the reason don't WAL-log hint bits always is because
we can afford to lose them, but in this case, any counter we need to
store will need to be WAL logged since we can't affort to lose that
counter value for decryption --- that gets us back to WAL-logging
something during hint bit changes.  :-(

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Key management with tests

2021-01-12 Thread Bruce Momjian

On Tue, Jan 12, 2021 at 01:15:44PM -0500, Bruce Momjian wrote:
> On Tue, Jan 12, 2021 at 01:11:29PM -0500, Stephen Frost wrote:
> > I don't think there's any doubt that we need to make sure that the IV is
> > distinct and advancing the LSN to get a new one when needed for this
> > case seems like it's probably the way to do that.  Hint bit change
> > visibility to users isn't really at issue here- we can't use the same IV
> > multiple times.  The two options that we have are to either not actually
> > update the hint bit in such a case, or to make sure to change the
> > LSN/IV.  Another option would be to, if we're able to make a hole to put
> > the GCM tag on to the page somewhere, further widen that hole to include
> > an additional space for a counter that would be mixed into the IV, to
> > avoid having to do an XLOG NOOP.
> 
> Well, we have eight unused bits in the IV, so we could just increment
> that for every hint bit change that uses the same LSN, and then force a
> dummy WAL record when that 8-bit counter overflows --- that seems
> simpler than logging hint bits.

Sorry, I was incorrect.  The IV is 16 bytes, made up of the LSN (8
bytes), and the page number (4 bytes).  That leaves 4 bytes unused or
2^32 values for hint bit changes before we have to generate a dummy LSN
record.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Key management with tests

2021-01-12 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Tue, Jan 12, 2021 at 01:11:29PM -0500, Stephen Frost wrote:
> > > I think one big question is that, since we are using a streaming cipher,
> > > do we care about hint bit changes showing to users?  I actually don't
> > > know.  If we do, some kind of dummy LSN record might be required, as you
> > > suggested.
> > 
> > I don't think there's any doubt that we need to make sure that the IV is
> > distinct and advancing the LSN to get a new one when needed for this
> > case seems like it's probably the way to do that.  Hint bit change
> > visibility to users isn't really at issue here- we can't use the same IV
> > multiple times.  The two options that we have are to either not actually
> > update the hint bit in such a case, or to make sure to change the
> > LSN/IV.  Another option would be to, if we're able to make a hole to put
> > the GCM tag on to the page somewhere, further widen that hole to include
> > an additional space for a counter that would be mixed into the IV, to
> > avoid having to do an XLOG NOOP.
> 
> Well, we have eight unused bits in the IV, so we could just increment
> that for every hint bit change that uses the same LSN, and then force a
> dummy WAL record when that 8-bit counter overflows --- that seems
> simpler than logging hint bits.

Sure, as long as we have a place to store that information..  We need to
have the full IV available when we go to decrypt the page.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Key management with tests

2021-01-12 Thread Bruce Momjian

On Tue, Jan 12, 2021 at 01:11:29PM -0500, Stephen Frost wrote:
> > I think one big question is that, since we are using a streaming cipher,
> > do we care about hint bit changes showing to users?  I actually don't
> > know.  If we do, some kind of dummy LSN record might be required, as you
> > suggested.
> 
> I don't think there's any doubt that we need to make sure that the IV is
> distinct and advancing the LSN to get a new one when needed for this
> case seems like it's probably the way to do that.  Hint bit change
> visibility to users isn't really at issue here- we can't use the same IV
> multiple times.  The two options that we have are to either not actually
> update the hint bit in such a case, or to make sure to change the
> LSN/IV.  Another option would be to, if we're able to make a hole to put
> the GCM tag on to the page somewhere, further widen that hole to include
> an additional space for a counter that would be mixed into the IV, to
> avoid having to do an XLOG NOOP.

Well, we have eight unused bits in the IV, so we could just increment
that for every hint bit change that uses the same LSN, and then force a
dummy WAL record when that 8-bit counter overflows --- that seems
simpler than logging hint bits.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Moving other hex functions to /common

2021-01-12 Thread Bruce Momjian

On Tue, Jan 12, 2021 at 11:26:51AM +0900, Michael Paquier wrote:
> The two only things that were not present are the set of checks for
> overflows, and the adjustments for varlena.c.  The first point makes
> the code of encode.c safer, as previously the code would issue a FATAL
> *after* writing out-of-bound data.  Now it issues an ERROR before any
> overwrite is done, and I have added some assertions as an extra safety
> net.  For the second, I think that it makes the allocation pattern
> easier to follow, similarly to checksum manifests.

Thanks for you work on this.  Looks good.  I posted your patch under my
key management patch and the cfbot reports all green:

http://cfbot.cputube.org/index.html

The key management patch calls the src/common hex functions from
src/backend/crypto, pg_alterckey, and the crypto tests, and these are
all tested by make check-world.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Key management with tests

2021-01-12 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Tue, Jan 12, 2021 at 09:40:53PM +0900, Masahiko Sawada wrote:
> > > This says:
> > >
> > > 
> > > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
> > >
> > > wal_log_hints will be enabled automatically in encryption mode.
> > >
> > > Does that help?
> > 
> > IIUC it helps but not enough. When wal_log_hints is enabled, we write
> > a full-page image when updating hint bits if it's the first time
> > change for the page since the last checkpoint. But I'm concerned that
> > what if we change hint bits again after the page is flushed. We would
> > mark the page as dirtied but not write any WAL, leaving the page lsn
> > as it is.
> 
> I updated the wiki to be:
> 
>   
> https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
>   
>   wal_log_hints will be enabled automatically in encryption mode. However,
>   more than one hit change between checkpoints does not cause WAL
>   activity, which would cause the same LSN to be used for different pages
>   images. 
> 
> I think one big question is that, since we are using a streaming cipher,
> do we care about hint bit changes showing to users?  I actually don't
> know.  If we do, some kind of dummy LSN record might be required, as you
> suggested.

I don't think there's any doubt that we need to make sure that the IV is
distinct and advancing the LSN to get a new one when needed for this
case seems like it's probably the way to do that.  Hint bit change
visibility to users isn't really at issue here- we can't use the same IV
multiple times.  The two options that we have are to either not actually
update the hint bit in such a case, or to make sure to change the
LSN/IV.  Another option would be to, if we're able to make a hole to put
the GCM tag on to the page somewhere, further widen that hole to include
an additional space for a counter that would be mixed into the IV, to
avoid having to do an XLOG NOOP.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Key management with tests

2021-01-12 Thread Andres Freund

On 2021-01-12 13:03:14 -0500, Bruce Momjian wrote:
> I think one big question is that, since we are using a streaming cipher,
> do we care about hint bit changes showing to users?  I actually don't
> know.  If we do, some kind of dummy LSN record might be required, as you
> suggested.

That'd lead to a *massive* increase of WAL record volume. It's one thing
to WAL log hint bit writes once per page per checkpoint. It's another to
do so on every single hint bit write.

Re: Key management with tests

2021-01-12 Thread Bruce Momjian

On Tue, Jan 12, 2021 at 09:40:53PM +0900, Masahiko Sawada wrote:
> > This says:
> >
> > 
> > https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
> >
> > wal_log_hints will be enabled automatically in encryption mode.
> >
> > Does that help?
> 
> IIUC it helps but not enough. When wal_log_hints is enabled, we write
> a full-page image when updating hint bits if it's the first time
> change for the page since the last checkpoint. But I'm concerned that
> what if we change hint bits again after the page is flushed. We would
> mark the page as dirtied but not write any WAL, leaving the page lsn
> as it is.

I updated the wiki to be:

https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements

wal_log_hints will be enabled automatically in encryption mode. However,
more than one hit change between checkpoints does not cause WAL
activity, which would cause the same LSN to be used for different pages
images. 

I think one big question is that, since we are using a streaming cipher,
do we care about hint bit changes showing to users?  I actually don't
know.  If we do, some kind of dummy LSN record might be required, as you
suggested.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: pg_upgrade test for binary compatibility of core data types

2021-01-12 Thread Bruce Momjian

On Tue, Jan 12, 2021 at 11:27:53AM -0600, Justin Pryzby wrote:
> On Tue, Jan 12, 2021 at 12:15:59PM -0500, Bruce Momjian wrote:
> > Uh, what exactly is missing from the beta checklist?  I read the patch
> > and commit message but don't understand it.
> 
> Did you try to use test.sh to upgrade from a prior release ?
> 
> Evidently it's frequently forgotten, as evidenced by all the "deferred
> maintenance" I had to do to allow testing the main patch (currently 0003).
> 
> See also:
> 
> commit 5bab1985dfc25eecf4b098145789955c0b246160
> Author: Tom Lane 
> Date:   Thu Jun 8 13:48:27 2017 -0400
> 
> Fix bit-rot in pg_upgrade's test.sh, and improve documentation.
> 
> Doing a cross-version upgrade test with test.sh evidently hasn't been
> tested since circa 9.2, because the script lacked case branches for
> old-version servers newer than 9.1.  Future-proof that a bit, and
> clean up breakage induced by our recent drop of V0 function call
> protocol (namely that oldstyle_length() isn't in the regression
> suite anymore).

Oh, that is odd.  I thought that was regularly run.  I have my own test
infrastructure that I run for every major release so I never have run
the built-in one, except for make check-world.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: WIP: BRIN multi-range indexes

2021-01-12 Thread Tomas Vondra

On 1/12/21 6:28 PM, John Naylor wrote:
On Sat, Dec 19, 2020 at 8:15 PM Tomas Vondra 
mailto:tomas.von...@enterprisedb.com>> 
wrote:

 > [12-20 version]

Hi Tomas,

The measurements look good. In case it fell through the cracks, my 
earlier review comments for Bloom BRIN indexes regarding minor details 
don't seem to have been addressed in this version. I'll point to earlier 
discussion for convenience:

https://www.postgresql.org/message-id/CACPNZCt%3Dx-fOL0CUJbjR3BFXKgcd9HMPaRUVY9cwRe58hmd8Xg%40mail.gmail.com 

https://www.postgresql.org/message-id/CACPNZCuqpkCGt8%3DcywAk1kPu0OoV_TjPXeV-J639ABQWyViyug%40mail.gmail.com 

Whooops :-( I'll go through those again, thanks for reminding me.

 > The improvements are fairly minor:
 >
 > 1) Rejecting bloom filters that are clearly too large (larger than page)
 > early. This is imperfect, as it works for individual index keys, not the
 > whole row. But per discussion it seems useful.

I think this is good enough.

 > So based on this I'm tempted to just use the version with two hashes, as
 > implemented in 0005. It's much simpler than the partitioning scheme,
 > does not need any of the logic to generate primes etc.

Sounds like the best engineering decision.

Circling back to multi-minmax build times, I ran a couple quick tests on 
bigger hardware, and found that not only is multi-minmax slower than 
minmax, which is to be expected, but also slower than btree. (unlogged 
table ~12GB in size, maintenance_work_mem = 1GB, median of three runs)

btree          38.3s
minmax         26.2s
multi-minmax  110s

Since btree indexes are much larger, I imagine something algorithmic is 
involved. Is it worth digging further to see if some code path is taking 
more time than we would expect?

I suspect it'd due to minmax having to decide which "ranges" to merge, 
which requires repeated sorting, etc. I certainly don't dare to claim 
the current algorithm is perfect. I wouldn't have expected such big 
difference, though - so definitely worth investigating.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pg_dump PublicationRelInfo objects need to be marked with owner

2021-01-12 Thread Tom Lane

Although pg_publication_rel entries don't have an owner per se,
it's still important for the ArchiveEntry created for one to be
marked with an appropriate owner, because that is what determines
which role will be used to run the ALTER PUBLICATION command when
doing a restore with --use-set-session-authorization.  As the
code stands, the session's original role will be used to run
ALTER PUBLICATION.  Admittedly, to have enough privileges to
create the publication, the original role probably has to be
superuser, so that generally this'd be moot.  Still, in principle
it's wrong, and maybe there are edge cases where you'd get a
failure that you shouldn't.

Accordingly, I propose the attached patch to ensure that the ALTER
is done as the owner of the publication.  While I was at it I rewrote
getPublicationTables to do just one query instead of one per table ---
the existing coding seems like it could be quite inefficient for lots
of tables.

Upon looking at the code, I notice that the ALTER PUBLICATION ref
page is lying about the permissions required.  Yes, you do have to
own the publication, but you *also* have to own the table(s) you
are adding.  So this raises the question of exactly which user
we should try to run the command as.  I judged that the publication
owner is more likely to be the right thing.

Thoughts?

regards, tom lane

diff --git a/src/bin/pg_dump/common.c b/src/bin/pg_dump/common.c
index 255f891432..b0f02bc1f6 100644
--- a/src/bin/pg_dump/common.c
+++ b/src/bin/pg_dump/common.c
@@ -52,6 +52,7 @@ static DumpableObject **oprinfoindex;
 static DumpableObject **collinfoindex;
 static DumpableObject **nspinfoindex;
 static DumpableObject **extinfoindex;
+static DumpableObject **pubinfoindex;
 static int	numTables;
 static int	numTypes;
 static int	numFuncs;
@@ -59,6 +60,7 @@ static int	numOperators;
 static int	numCollations;
 static int	numNamespaces;
 static int	numExtensions;
+static int	numPublications;
 
 /* This is an array of object identities, not actual DumpableObjects */
 static ExtensionMemberId *extmembers;
@@ -93,6 +95,7 @@ getSchemaData(Archive *fout, int *numTablesPtr)
 	CollInfo   *collinfo;
 	NamespaceInfo *nspinfo;
 	ExtensionInfo *extinfo;
+	PublicationInfo *pubinfo;
 	InhInfo*inhinfo;
 	int			numAggregates;
 	int			numInherits;
@@ -247,7 +250,9 @@ getSchemaData(Archive *fout, int *numTablesPtr)
 	getPolicies(fout, tblinfo, numTables);
 
 	pg_log_info("reading publications");
-	getPublications(fout);
+	pubinfo = getPublications(fout, );
+	pubinfoindex = buildIndexArray(pubinfo, numPublications,
+   sizeof(PublicationInfo));
 
 	pg_log_info("reading publication membership");
 	getPublicationTables(fout, tblinfo, numTables);
@@ -937,6 +942,17 @@ findExtensionByOid(Oid oid)
 	return (ExtensionInfo *) findObjectByOid(oid, extinfoindex, numExtensions);
 }
 
+/*
+ * findPublicationByOid
+ *	  finds the entry (in pubinfo) of the publication with the given oid
+ *	  returns NULL if not found
+ */
+PublicationInfo *
+findPublicationByOid(Oid oid)
+{
+	return (PublicationInfo *) findObjectByOid(oid, pubinfoindex, numPublications);
+}
+
 /*
  * findIndexByOid
  *		find the entry of the index with the given oid
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index f2c88fa6b5..52973d9e9c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -3865,8 +3865,8 @@ dumpPolicy(Archive *fout, PolicyInfo *polinfo)
  * getPublications
  *	  get information about publications
  */
-void
-getPublications(Archive *fout)
+PublicationInfo *
+getPublications(Archive *fout, int *numPublications)
 {
 	DumpOptions *dopt = fout->dopt;
 	PQExpBuffer query;
@@ -3886,7 +3886,10 @@ getPublications(Archive *fout)
 ntups;
 
 	if (dopt->no_publications || fout->remoteVersion < 10)
-		return;
+	{
+		*numPublications = 0;
+		return NULL;
+	}
 
 	query = createPQExpBuffer();
 
@@ -3964,6 +3967,9 @@ getPublications(Archive *fout)
 	PQclear(res);
 
 	destroyPQExpBuffer(query);
+
+	*numPublications = ntups;
+	return pubinfo;
 }
 
 /*
@@ -4072,7 +4078,8 @@ getPublicationTables(Archive *fout, TableInfo tblinfo[], int numTables)
 	DumpOptions *dopt = fout->dopt;
 	int			i_tableoid;
 	int			i_oid;
-	int			i_pubname;
+	int			i_prpubid;
+	int			i_prrelid;
 	int			i,
 j,
 ntups;
@@ -4082,15 +4089,39 @@ getPublicationTables(Archive *fout, TableInfo tblinfo[], int numTables)
 
 	query = createPQExpBuffer();
 
-	for (i = 0; i < numTables; i++)
+	/* Collect all publication membership info. */
+	appendPQExpBufferStr(query,
+		 "SELECT tableoid, oid, prpubid, prrelid "
+		 "FROM pg_catalog.pg_publication_rel");
+	res = ExecuteSqlQuery(fout, query->data, PGRES_TUPLES_OK);
+
+	ntups = PQntuples(res);
+
+	i_tableoid = PQfnumber(res, "tableoid");
+	i_oid = PQfnumber(res, "oid");
+	i_prpubid = PQfnumber(res, "prpubid");
+	i_prrelid = PQfnumber(res, "prrelid");
+
+	/* this allocation may be more than we need */
+	pubrinfo = pg_malloc(ntups *

Re: WIP: BRIN multi-range indexes

2021-01-12 Thread John Naylor

On Sat, Dec 19, 2020 at 8:15 PM Tomas Vondra 
wrote:
> [12-20 version]

Hi Tomas,

The measurements look good. In case it fell through the cracks, my earlier
review comments for Bloom BRIN indexes regarding minor details don't seem
to have been addressed in this version. I'll point to earlier discussion
for convenience:

https://www.postgresql.org/message-id/CACPNZCt%3Dx-fOL0CUJbjR3BFXKgcd9HMPaRUVY9cwRe58hmd8Xg%40mail.gmail.com

https://www.postgresql.org/message-id/CACPNZCuqpkCGt8%3DcywAk1kPu0OoV_TjPXeV-J639ABQWyViyug%40mail.gmail.com

> The improvements are fairly minor:
>
> 1) Rejecting bloom filters that are clearly too large (larger than page)
> early. This is imperfect, as it works for individual index keys, not the
> whole row. But per discussion it seems useful.

I think this is good enough.

> So based on this I'm tempted to just use the version with two hashes, as
> implemented in 0005. It's much simpler than the partitioning scheme,
> does not need any of the logic to generate primes etc.

Sounds like the best engineering decision.

Circling back to multi-minmax build times, I ran a couple quick tests on
bigger hardware, and found that not only is multi-minmax slower than
minmax, which is to be expected, but also slower than btree. (unlogged
table ~12GB in size, maintenance_work_mem = 1GB, median of three runs)

btree  38.3s
minmax 26.2s
multi-minmax  110s

Since btree indexes are much larger, I imagine something algorithmic is
involved. Is it worth digging further to see if some code path is taking
more time than we would expect?

--
John Naylor
EDB: http://www.enterprisedb.com

Re: pg_upgrade test for binary compatibility of core data types

2021-01-12 Thread Justin Pryzby

On Tue, Jan 12, 2021 at 12:15:59PM -0500, Bruce Momjian wrote:
> On Mon, Jan 11, 2021 at 10:13:52PM -0600, Justin Pryzby wrote:
> > On Mon, Jan 11, 2021 at 03:28:08PM +0100, Peter Eisentraut wrote:
> > > I think these patches could use some in-place documentation of what they 
> > > are
> > > trying to achieve and how they do it.  The required information is spread
> > > over a lengthy thread.  No one wants to read that.  Add commit messages to
> > > the patches.
> > 
> > 0001 patch fixes pg_upgrade/test.sh, which was disfunctional.
> > Portions of the first patch were independently handled by commits 52202bb39,
> > fa744697c, 091866724.  So this is rebased on those.
> > I guess updating this script should be a part of a beta-checklist somewhere,
> > since I guess nobody will want to backpatch changes for testing older 
> > releases.
> 
> Uh, what exactly is missing from the beta checklist?  I read the patch
> and commit message but don't understand it.

Did you try to use test.sh to upgrade from a prior release ?

Evidently it's frequently forgotten, as evidenced by all the "deferred
maintenance" I had to do to allow testing the main patch (currently 0003).

See also:

commit 5bab1985dfc25eecf4b098145789955c0b246160
Author: Tom Lane 
Date:   Thu Jun 8 13:48:27 2017 -0400

Fix bit-rot in pg_upgrade's test.sh, and improve documentation.

Doing a cross-version upgrade test with test.sh evidently hasn't been
tested since circa 9.2, because the script lacked case branches for
old-version servers newer than 9.1.  Future-proof that a bit, and
clean up breakage induced by our recent drop of V0 function call
protocol (namely that oldstyle_length() isn't in the regression
suite anymore).

-- 
Justin

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2021-01-12 Thread Anastasia Lubennikova


On 12.01.2021 00:51, Tomas Vondra wrote:



On 1/11/21 10:00 PM, Anastasia Lubennikova wrote:

On 11.01.2021 01:35, Tomas Vondra wrote:

Hi,

I started looking at this patch again, hoping to get it committed in 
this CF, but I think there's a regression in handling TOAST tables 
(compared to the v3 patch submitted by Pavan in February 2019).


The test I'm running a very simple test (see test.sql):

1) start a transaction
2) create a table with a text column
3) copy freeze data into it
4) use pg_visibility to see how many blocks are all_visible both in the
   main table and it's TOAST table

For v3 patch (applied on top of 278584b526 and 
s/hi_options/ti_options) I get this:


   pages   NOT all_visible
  --
  main   637 0
  toast    50001 3

There was some discussion about relcache invalidations causing a 
couple TOAST pages not be marked as all_visible, etc.


However, for this patch on master I get this

   pages   NOT all_visible
  --
  main   637 0
  toast    50001 50001

So no pages in TOAST are marked as all_visible. I haven't 
investigated what's causing this, but IMO that needs fixing to make 
ths patch RFC.


Attached is the test script I'm using, and a v5 of the patch - 
rebased on current master, with some minor tweaks to comments etc.




Thank you for attaching the test script. I reproduced the problem. 
This regression occurs because TOAST internally uses heap_insert().

You have asked upthread about adding this optimization to heap_insert().

I wrote a quick fix, see the attached patch 0002. The TOAST test 
passes now, but I haven't tested performance or any other use-cases yet.

I'm going to test it properly in a couple of days and share results.



Thanks. I think it's important to make this work for TOAST tables - it 
often stores most of the data, and it was working in v3 of the patch. 
I haven't looked into the details, but if it's really just due to 
TOAST using heap_insert, I'd say it just confirms the importance of 
tweaking heap_insert too. 



I've tested performance. All tests were run on my laptop, latest master 
with and without patches, all default settings, except disabled 
autovacuum and installed pg_stat_statements extension.


The VACUUM is significantly faster with the patch, as it only checks 
visibility map. COPY speed fluctuates a lot between tests, but I didn't 
notice any trend.
I would expect minor slowdown with the patch, as we need to handle 
visibility map pages during COPY FREEZE. But in some runs, patched 
version was faster than current master, so the impact of the patch is 
insignificant.


I run 3 different tests:

1) Regular table (final size 5972 MB)

patched   |   master

COPY FREEZE data 3 times:

33384,544 ms   31135,037 ms
31666,226 ms   31158,294 ms
32802,783 ms   33599,234 ms

VACUUM
54,185 ms         48445,584 ms


2) Table with TOAST (final size 1207 MB where 1172 MB is in toast table)

patched |   master

COPY FREEZE data 3 times:

368166,743 ms    383231,077 ms
398695,018 ms    454572,630 ms
410174,682 ms    567847,288 ms

VACUUM
43,159 ms    6648,302 ms


3) Table with a trivial BEFORE INSERT trigger (final size 5972 MB)

patched |   master

COPY FREEZE data 3 times:

73544,225 ms  64967,802 ms
90960,584 ms  71826,362 ms
81356,025 ms  80023,041 ms

VACUUM
49,626 ms    40100,097 ms

I took another look at the yesterday's patch and it looks fine to me. So 
now I am waiting for your review.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: pg_upgrade test for binary compatibility of core data types

2021-01-12 Thread Bruce Momjian

On Mon, Jan 11, 2021 at 10:13:52PM -0600, Justin Pryzby wrote:
> On Mon, Jan 11, 2021 at 03:28:08PM +0100, Peter Eisentraut wrote:
> > I think these patches could use some in-place documentation of what they are
> > trying to achieve and how they do it.  The required information is spread
> > over a lengthy thread.  No one wants to read that.  Add commit messages to
> > the patches.
> 
> 0001 patch fixes pg_upgrade/test.sh, which was disfunctional.
> Portions of the first patch were independently handled by commits 52202bb39,
> fa744697c, 091866724.  So this is rebased on those.
> I guess updating this script should be a part of a beta-checklist somewhere,
> since I guess nobody will want to backpatch changes for testing older 
> releases.

Uh, what exactly is missing from the beta checklist?  I read the patch
and commit message but don't understand it.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: libpq compression

2021-01-12 Thread Andrey Borodin




> 12 янв. 2021 г., в 20:47, Konstantin Knizhnik  
> написал(а):
> 
>> I think we should come up with an minimal, prelimininary 0001 patch which is
>> common between the 3 compression patches (or at least the two using zstd).  
>> The
>> ./configure changes and a compressionlibs struct would also be included.  I'm
>> planning to do something like this with the next revision of my patchset.
>> 
> It will be great to have such common preliminary patch including zstd support.

+1. I'd also rebase my WAL FPI patch on top of common code.

Best regards, Andrey Borodin.

Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2021-01-12 Thread Andrey Borodin




> 12 янв. 2021 г., в 13:49, Noah Misch  написал(а):
> 
> What do you think of abandoning slru-truncate-t-insurance entirely?  As of
> https://postgr.es/m/20200330052809.gb2324...@rfd.leadboat.com I liked the idea
> behind it, despite its complicating the system for hackers and DBAs.  The
> TruncateMultiXact() interaction rendered it less appealing.  In v14+, commit
> cd5e822 mitigates the kind of bugs that slru-truncate-t-insurance mitigates,
> further reducing the latter's value.  slru-truncate-t-insurance does mitigate
> larger trespasses into unlink-eligible space, though.
I seem to me that not committing an insurance patch is not a mistake. Let's 
abandon slru-truncate-t-insurance for now.

>> Fix  certainly worth backpatching.
> 
> I'll push it on Saturday, probably.
Thanks!

Best regards, Andrey Borodin.

Re: Yet another fast GiST build

2021-01-12 Thread Andrey Borodin




> 12 янв. 2021 г., в 18:49, Heikki Linnakangas  написал(а):
> 
>> PFA patch with implementation.
> 
> I did a bit of cleanup on the function signature. The .sql script claimed 
> that gist_page_items() took bytea as argument, but in reality it was a 
> relation name, as text. I changed it so that it takes a page image as 
> argument, instead of reading the block straight from the index. Mainly to 
> make it consistent with brin_page_items(), if it wasn't for that precedence I 
> might've gone either way on it.
bt_page_items() takes relation name and block number, that was a reason for 
doing so. But all others *_page_items() (heap, brin, hash) are doing as in v4. 
So I think it's more common way.

> 
> Fixed the docs accordingly, and ran pgindent. New patch version attached.

Thanks! Looks good to me.

One more question: will bytea tests run correctly on 32bit\different-endian 
systems?

Best regards, Andrey Borodin.

Re: Key management with tests

2021-01-12 Thread Stephen Frost

Greetings,

* Neil Chen (carpenter.nail...@gmail.com) wrote:
> On Tue, Jan 12, 2021 at 10:47 AM Stephen Frost  wrote:
> > This is an interesting question but ultimately I don't think we should
> > be looking at this from the perspective of allowing arbitrary changes to
> > the page format.  The challenge is that much of the page format, today,
> > is defined by a C struct and changing the way that works would require a
> > great deal of code to be modified and turn this into a massive effort,
> > assuming we wish to have the same compiled binary able to work with both
> > unencrypted and encrypted clusters, which I do believe is a requirement.
> >
> > The thought that I had was to, instead, try to figure out if we could
> > fudge some space by, say, putting a 128-bit 'hole' at the end of the
> > page and just move pd_special back, effectively making the page seem
> > 'smaller' to all of the code that uses it, except for the code that
> > knows how to do the decryption.  I ran into some trouble with that but
> > haven't quite sorted out what happened yet.  Other ideas would be to put
> > it before pd_special, or maybe somewhere else, but a lot depends on the
> > code's expectations.
>
> I agree that we should not make too many changes to affect the use of
> unencrypted clusters. But as a personal opinion only, I don't think it's a
> good idea to add some "implicit" tricks. To provide an inspiration, can we
> add a flag to mark whether the page format has been changed:

Sure, of course we could add such a flag, but I don't see how that would
actually help with the issue?

> In this way, I think it has little effect on the unencrypted cluster, and
> we can also modify the page format as we wish. Of course, it's also
> possible that I didn't understand your design correctly, or there's
> something wrong with my idea. :D

No, we can't 'modify the page format as we wish'- if we change away from
using a C structure then we're going to be modifying quite a bit of
code which otherwise doesn't need to be changed.  The proposed flag
doesn't actually make a different page format work, the only thing it
would do would be to allow some parts of the cluster to be encrypted and
other parts not be, but I don't know that that's actually a useful
capability or a good reason to use one of those bits.  Having it handled
on a cluster level, at initdb time through pg_control, seems like it'd
work just fine.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: libpq compression

2021-01-12 Thread Konstantin Knizhnik





On 12.01.2021 18:38, Justin Pryzby wrote:

On Tue, Jan 12, 2021 at 08:44:43AM +0300, Konstantin Knizhnik wrote:

On 11.01.2021 20:38, Tomas Vondra wrote:

1) Fixes the MSVC makefile. The list of files is sorted alphabetically,
so I've added the file at the end.

Thank you

This is still failing the windows build.

I think you need something like this, which I have in my zstd/pg_dump patch.

--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -307,6 +307,7 @@ sub GenerateFiles
 HAVE_LIBXML2=> undef,
 HAVE_LIBXSLT=> undef,
 HAVE_LIBZ   => $self->{options}->{zlib} ? 1 : 
undef,
+   HAVE_LIBZSTD=> $self->{options}->{zstd} ? 1 : 
undef,



Thank you.


I think we should come up with an minimal, prelimininary 0001 patch which is
common between the 3 compression patches (or at least the two using zstd).  The
./configure changes and a compressionlibs struct would also be included.  I'm
planning to do something like this with the next revision of my patchset.

It will be great to have such common preliminary patch including zstd 
support.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: libpq compression

2021-01-12 Thread Justin Pryzby

On Tue, Jan 12, 2021 at 08:44:43AM +0300, Konstantin Knizhnik wrote:
> On 11.01.2021 20:38, Tomas Vondra wrote:
> > 1) Fixes the MSVC makefile. The list of files is sorted alphabetically,
> > so I've added the file at the end.
> Thank you

This is still failing the windows build.

I think you need something like this, which I have in my zstd/pg_dump patch.

--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -307,6 +307,7 @@ sub GenerateFiles
HAVE_LIBXML2=> undef,
HAVE_LIBXSLT=> undef,
HAVE_LIBZ   => $self->{options}->{zlib} ? 1 : 
undef,
+   HAVE_LIBZSTD=> $self->{options}->{zstd} ? 1 : 
undef,

I think we should come up with an minimal, prelimininary 0001 patch which is
common between the 3 compression patches (or at least the two using zstd).  The
./configure changes and a compressionlibs struct would also be included.  I'm
planning to do something like this with the next revision of my patchset.

-- 
Justin

Re: Executing inet_server_addr/port() in parallel workers

2021-01-12 Thread Tom Lane

BTW, just for the archives' sake: inet_client_addr() and
inet_client_port() also access MyProcPort, so they'd also have
this issue, except that they're already marked parallel-restricted.
So somebody just missed the server equivalents when marking the
parallel safety of built-in functions.

contrib/sslinfo is also full of uses of MyProcPort, but all its
functions are properly marked parallel-restricted.

I did not see any other risky-looking uses of MyProcPort in a
quick grep.

regards, tom lane

Re: O(n^2) system calls in RemoveOldXlogFiles()

2021-01-12 Thread Alvaro Herrera

On 2021-Jan-11, Thomas Munro wrote:

> I didn't check the migration history of this code but it seems that
> endlogSegNo doesn't currently have the right scoping to achieve the
> goal of that last comment, so checkpoints finish up repeatedly search
> for the next free slot, starting at the low end each time,

Apparently b2a5545bd63f changed; before that commit, that code
(including the quoted comment) was all in RemoveOldXlogFiles, and
endlogSegNo was calculated only once.  But ISTM that even with that
formulation it had the problem you point out.  The real problem is the
loop hidden inside InstallXLogFileSegment().

-- 
Álvaro Herrera

Re: Fix a typo in xlogreader.c

2021-01-12 Thread Alvaro Herrera

On 2021-Jan-12, Masahiko Sawada wrote:

> Hi,
> 
> The function comment of RestoreBlockImage() seems not correct since it
> returns a boolean, not the buffer number.

You're right -- this has always been wrong.  Pushed your patch.

Thank you!



-- 
Álvaro Herrera39°49'30"S 73°17'W
"It takes less than 2 seconds to get to 78% complete; that's a good sign.
A few seconds later it's at 90%, but it seems to have stuck there.  Did
somebody make percentages logarithmic while I wasn't looking?"
http://smylers.hates-software.com/2005/09/08/1995c749.html

Re: Executing inet_server_addr/port() in parallel workers

2021-01-12 Thread Tom Lane

Masahiko Sawada  writes:
> While investigating a customer issue it's turned out that if a
> parallel worker executes inet_server_addr() and inet_server_port() the
> results are always null because MyProcPort is not set in parallel
> workers.

Check.

> To fix this issue, I think there are two options:
> 1. Inherits the local address information from postmaster or the
> leader to parallel workers.
> 2. Have both inet_server_addr() and inet_server_port() parallel restricted.

> Since auxiliary processes including parallel workers don't communicate
> directly to the client, this information is basically not necessary.
> Also, the option #1 needs code change to pass the information, which
> brings complexity and is not good in terms of back-patching. So I'd
> prefer the option #2. I've attached the patch doing the option #2.

Um, but #2 is surely not back-patchable.  Changing initial catalog
contents in a back branch is a big PITA, so we generally wouldn't
even consider it except for critical security problems.  Which this
is not.

Having said that, I'm fine with the idea of just marking these
parallel-restricted in HEAD and ignoring the problem in the back
branches.

regards, tom lane

Re: Added schema level support for publication.

2021-01-12 Thread vignesh C

On Mon, Jan 11, 2021 at 11:45 AM Bharath Rupireddy
 wrote:
>
> On Sun, Jan 10, 2021 at 11:21 PM vignesh C  wrote:
> > On Sat, Jan 9, 2021 at 8:08 PM Bharath Rupireddy
> >  wrote:
> > > I think this feature can be useful, in case a user has a lot of tables
> > > to publish inside a schema. Having said that, I wonder if this feature
> > > mandates users to create the same schema with same
> > > permissions/authorizations manually on the subscriber, because logical
> > > replication doesn't propagate any ddl's so are the schema or schema
> > > changes? Or is it that the list of tables from the publisher can go
> > > into a different schema on the subscriber?
> > >
> >
> > DDL's will not be propagated to the subscriber. Users have to create
> > the schema & tables in the subscriber. No change in
> > Permissions/authorizations handling, it will be the same as the
> > existing behavior for relations.
>
> Looks like the existing behaviour already requires users to create the
> schema on the subscriber when publishing the tables from that schema.
> Otherwise, an error is thrown on the subscriber [1].
>
> [1] on publisher:
> CREATE SCHEMA myschema;
> CREATE TABLE myschema.t1(a1 int, b1 int);
> INSERT INTO myschema.t1_myschema SELECT i, i+10 FROM generate_series(1,10) i;
> CREATE PUBLICATION testpub FOR TABLE myschema.t1;
>
> on subscriber:
> postgres=# CREATE SUBSCRIPTION testsub CONNECTION 'host=localhost
> dbname=postgres user=bharath port=5432' PUBLICATION testpub;
> ERROR:  schema "myschema" does not exist
> CREATE SCHEMA myschema;
> CREATE TABLE myschema.t1(a1 int, b1 int);
> postgres=# CREATE SUBSCRIPTION testsub CONNECTION 'host=localhost
> dbname=postgres user=bharath port=5432' PUBLICATION testpub;
> NOTICE:  created replication slot "testsub" on publisher
> CREATE SUBSCRIPTION
>

Yes this feature will also have the same behavior, DDL creation should
be taken care of by DBA similar to how it is handled may be using
pg_dump or use sql scripts/statements to update.

> > > Since the schema can have other objects such as data types, functions,
> > > operators, I'm sure with your feature, non-table objects will be
> > > skipped.
> > >
> >
> > Yes, only table data will be sent to subscribers, non-table objects
> > will be skipped.
>
> Looks like the existing CREATE PUBLICATION FOR ALL TABLES, which is
> for all the tables in the database, does this i.e. skips non-table
> objects and temporary tables, foreign tables and so on. So, your
> feature also can behave the same way, but within the scope of the
> given schema/s.
>

Yes, it will support only normal tables. Non table objects, foreign
tables & temporary tables will not be supported.

> > > As Amit pointed out earlier, the behaviour when schema dropped, I
> > > think we should also consider when schema is altered, say altered to a
> > > different name, maybe we should change that in the publication too.
> > >
> >
> > I agree that when schema is altered the renamed schema should be
> > reflected in the publication.
>
> I think, it's not only making sure that the publisher side has the new
> altered schema, but also the subscriber needs those alters. Having
> said that, since these alters come under DDL changes and in logical
> replication we don't publish the scheme changes to the subscriber, we
> may not need to anything extra for informing the schema alters to the
> subscriber from the publisher, the users might have to do the same
> schema alter on the subscriber and then a ALTER SUBSCRIPTION testsub
> REFRESH PUBLICATION;  should work for them? If this understanding is
> correct, then we should document this.
>

Yes, alter schema changes will be reflected in the publication, the
corresponding change needs to be done by the user on the subscriber
side. Once a user does ALTER SUBSCRIPTION testsub REFRESH PUBLICATION,
the new altered schema changes will be reflected in the subscriber. I
will update the documentation that user need to take care for
subscription refresh.

> > > In general, what happens if we have some temporary tables or foreign
> > > tables inside the schema, will they be allowed to send the data to
> > > subscribers?
> > >
> >
> > Temporary tables & foreign tables will not be added to the publications.
>
> Yes the existing logical replication framework doesn't allow
> replication of temporary, unlogged, foreign tables and other non-table
> relations such as materialized views, indexes etc [1]. The CREATE
> PUBLICATION statement either fails in check_publication_add_relation
> or before that.
>
> CREATE PUBLICATION testpub FOR TABLE tab1, throwing the error if the
> single table tab1 is any of the above restricted tables, seems fine.
> But, if there's a list of tables with CREATE PUBLICATION testpub FOR
> TABLE normal_tab1, temp_tab2, normal_tab3, foreign_tab4,
> unlogged_tab5, normal_tab6, normal_tab7 ..; This query fails on
> first encounter of the restricted table, say at temp_tab2. Whereas,
> CREATE PUBLICATION testpub FOR ALL TABLES;

Re: libpq compression

2021-01-12 Thread Konstantin Knizhnik




On 12.01.2021 4:20, Justin Pryzby wrote:

On Mon, Jan 11, 2021 at 04:53:51PM +0300, Konstantin Knizhnik wrote:

On 09.01.2021 23:31, Justin Pryzby wrote:

I suggest that there should be an enum of algorithms, which is constant across
all servers.  They would be unconditionally included and not #ifdef depending
on compilation options.

I do not think that it is possible (even right now, it is possible to build
Postgres without zlib support).
Also if new compression algorithms  are added, then in any case we have to
somehow handle situation when
old client is connected to new server and visa versa.

I mean an enum of all compression supported in the master branch, starting with
ZLIB = 1.  I think this applies to libpq compression (which requires client and
server to handle a common compression algorithm), and pg_dump (output of which
needs to be read by pg_restore), but maybe not TOAST, the patch for which
supports extensions, with dynamically allocated OIDs.


Sorry, I do not understand the goal of introducing this enum.
Algorithms are in any case specified by name. And internally there is no 
need in such enum,

at least in libpq compression.



In config.sgml, it says that libpq_compression defaults to on (on the server
side), but in libpq.sgml it says that it defaults to off (on the client side).
Is that what's intended ?  I would've thought the defaults would match, or that
the server would enforce a default more conservative than the client's (the DBA
should probably have to explicitly enable compression, and need to "opt-in"
rather than "opt-out").

Yes, it is intended behavior:  libpq_compression GUC allows to prohibit
compression requests fro clients if due to some reasons (security, CPU
consumption is not desired).
But by default server should support compression if it is requested by
client.

It's not clear to me if that's true..   It may be what's convenient for you,
especially during development, but that doesn't mean it's safe or efficient or
what's generally desirable to everyone.


Definitely it is only my point of view, may be DBA will have different 
opinions.
I think that compression is not dangerousness or resource consuming 
feature which should be disabled by default.
At least there is on GUC prohibiting SSL connections and them are even 
more expensive.
In any case, I will be glad to collect votes whether compression 
requests should be be default rejected by server or not.



But client should not request compression by default: it makes sense only for
queries returning large result sets or transferring a lot of data (liek
COPY).

I think you're making assumptions about everyone's use of the tools, and it's
better if the DBA makes that determination.  The clients aren't generally under
the admin's control, and if they don't request compression, then it's not used.
If they request compression, then the DBA still has control over whether it's
allowed.

We agree that it should be disabled by default, but I suggest that it's most
flexible if client's makes the request and allow the server to decide.  By
default the server should deny/ignore the request, with the client gracefully
falling back to no compression.


I think that compression will be most efficient for internal connections 
(i.e. replication, bulk data loading, ...)

which are mostly controlled by DBA.

Compression would have little effect on most queries, especially at default
level=1.
I mostly agree with you here, except that compression level 1 gives 
little effect.
All my experiments both bith page level compression and libpq 
compression shows that default compression level

provides optimal balance between compression ratio and compression speed.

In Daniil's benchmark the difference between compression ration 1 and 19 
in zstd is only 30%.


Right now "compression" parameter accepts not only boolean values but 
also

list of suggested algorithms with optional compression level, like
"zstd:1,zlib"

You're talking about the client compression param.  I'm suggesting that the
server should allow fine-grained control of what compression algs are permitted
at *runtime*.  This would allow distributions to compile with all compression
libraries enabled at configure time, and still allow an DBA to disable one
without recompiling.


Frankly speaking I do not see much sense in it.
My point of view is the following: there are several well known and 
widely used compression algorithms now:
zlib, zstd, lz4. There are few others but results of various my 
benchmarks (mostly for page level compression)
shows that zstd provides best compression ratio/speed  balance and lz4 - 
the best speed.


zlib is available almost everywhere and postgres by default is 
configured with zlib.
So it can be considered as default compression algorithm available 
everywhere.
If Postgres was built with zstd or lz4 (not currently supported for 
libpq compression) then it should be used instead of zlib

because it is faster and provides better

Re: Cirrus CI (Windows help wanted)

2021-01-12 Thread Andrew Dunstan



On 1/5/21 11:19 PM, Thomas Munro wrote:
>
> It seems we can make our own, either on-the-fly with caching, or
> hosted somewhere, like this:
>
> https://cirrus-ci.org/guide/docker-builder-vm/#dockerfile-as-a-ci-environment
>
>


OK, I got this working.


There is some weirdness that I had to work around in the way they do
docker. So here's what works for me, with the docker image nicely cached
even across repos:


.cirrus.yml:

task:
  name: Windows
  windows_container:
    dockerfile: ci/Dockerfile
  build_script:
  - cd "c:/Program Files (x86)/Microsoft Visual 
Studio/2019/BuildTools/VC/Auxiliary/Build"
  - vcvarsall x64
  - echo on
  - cd 
C:\Users\ContainerAdministrator\AppData\Local\Temp\cirrus-ci-build
  - set PATH=C:\strawberry\perl\bin;%PATH%
  - perl src/tools/msvc/mkvcbuild.pl
  - set IgnoreWarnIntDirInTempDetected=true
  - msbuild pgsql.sln
  test_script:
  - set PATH=C:\strawberry\perl\bin;%PATH%
  - perl src/tools/msvc/vcregress.pl check


ci/Dockerfile:

FROM cirrusci/windowsservercore:2019

SHELL ["powershell", "-NoLogo", "-NoProfile", "-Command"]


RUN \
    pwd ; \
    choco feature disable -n=usePackageExitCodes ; \
    choco install -y --no-progress --version=16.8.3.0 
visualstudio2019buildtools


# cirrus does something odd with this command, so it's stuck in a bat file
# and copied to the docker container and then executed
COPY ci/inst-tools.bat .
RUN \
    cmd /c .\inst-tools.bat

RUN \
    choco install -y --no-progress strawberryperl ; \
    choco install -y --no-progress winflexbison diffutils ; \
    Rename-Item -Path c:\ProgramData\chocolatey\bin\win_flex.exe flex.exe ; 
\
    Rename-Item -Path c:\ProgramData\chocolatey\bin\win_bison.exe bison.exe 
; \
    Remove-Item C:\ProgramData\chocolatey\logs\*.* -Force -Recurse ; \
    Remove-Item C:\Users\ContainerAdministrator\AppData\Local\Temp\*.* 
-Force -Recurse

ci/inst-tools.bat:

choco install -y --no-progress --version=1.0.0 
visualstudio2019-workload-vctools  --install-args="--add 
Microsoft.VisualStudio.Component.VC.CLI.Support"

cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Yet another fast GiST build

2021-01-12 Thread Heikki Linnakangas


On 10/12/2020 12:16, Andrey Borodin wrote:

9 дек. 2020 г., в 14:47, Andrey Borodin  написал(а):

7 дек. 2020 г., в 23:56, Peter Geoghegan  написал(а):

On Mon, Dec 7, 2020 at 2:05 AM Andrey Borodin  wrote:

Here's version with tests and docs. I still have no idea how to print some 
useful information about tuples keys.


I suggest calling BuildIndexValueDescription() from your own custom
debug instrumentation code.

Thanks for the hint, Peter!
This function does exactly what I want to do. But I have no Relation inside 
gist_page_items(bytea) function... probably, I'll add gist_page_items(relname, 
blockno) overload to fetch keys.


PFA patch with implementation.


I did a bit of cleanup on the function signature. The .sql script 
claimed that gist_page_items() took bytea as argument, but in reality it 
was a relation name, as text. I changed it so that it takes a page image 
as argument, instead of reading the block straight from the index. 
Mainly to make it consistent with brin_page_items(), if it wasn't for 
that precedence I might've gone either way on it.


Fixed the docs accordingly, and ran pgindent. New patch version attached.

- Heikki
>From 791e45bcc86b205b372ea0679c8c036fff941f56 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 12 Jan 2021 15:43:09 +0200
Subject: [PATCH v4 1/1] Add functions to 'pageinspect' to inspect GiST
 indexes.

Author: Andrey Borodin and me
Discussion: https://www.postgresql.org/message-id/3E4F9093-A1B5-4DF8-A292-0B48692E3954%40yandex-team.ru
---
 contrib/pageinspect/Makefile  |   6 +-
 contrib/pageinspect/expected/gist.out | 152 ++
 contrib/pageinspect/gistfuncs.c   | 287 ++
 contrib/pageinspect/pageinspect--1.8--1.9.sql |  41 +++
 contrib/pageinspect/pageinspect.control   |   2 +-
 contrib/pageinspect/sql/gist.sql  |  23 ++
 doc/src/sgml/pageinspect.sgml |  89 ++
 7 files changed, 597 insertions(+), 3 deletions(-)
 create mode 100644 contrib/pageinspect/expected/gist.out
 create mode 100644 contrib/pageinspect/gistfuncs.c
 create mode 100644 contrib/pageinspect/pageinspect--1.8--1.9.sql
 create mode 100644 contrib/pageinspect/sql/gist.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index d9d8177116..4539f0aef7 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -7,19 +7,21 @@ OBJS = \
 	btreefuncs.o \
 	fsmfuncs.o \
 	ginfuncs.o \
+	gistfuncs.o \
 	hashfuncs.o \
 	heapfuncs.o \
 	rawpage.o
 
 EXTENSION = pageinspect
-DATA =  pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
+DATA =  pageinspect--1.8--1.9.sql \
+	pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
 	pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
 	pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
 	pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
 	pageinspect--1.0--1.1.sql
 PGFILEDESC = "pageinspect - functions to inspect contents of database pages"
 
-REGRESS = page btree brin gin hash checksum
+REGRESS = page btree brin gin gist hash checksum
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pageinspect/expected/gist.out b/contrib/pageinspect/expected/gist.out
new file mode 100644
index 00..3c7e08b710
--- /dev/null
+++ b/contrib/pageinspect/expected/gist.out
@@ -0,0 +1,152 @@
+CREATE TABLE test_gist AS SELECT point(i,i) p, i::text t FROM
+generate_series(1,1000) i;
+CREATE INDEX test_gist_idx ON test_gist USING gist (p);
+-- Page 0 is the root, the rest are leaf pages
+SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 0));
+ lsn | nsn | rightlink  | flags 
+-+-++---
+ 0/1 | 0/0 | 4294967295 | {}
+(1 row)
+
+SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 1));
+ lsn | nsn | rightlink  | flags  
+-+-++
+ 0/1 | 0/0 | 4294967295 | {leaf}
+(1 row)
+
+SELECT * FROM gist_page_opaque_info(get_raw_page('test_gist_idx', 2));
+ lsn | nsn | rightlink | flags  
+-+-+---+
+ 0/1 | 0/0 | 1 | {leaf}
+(1 row)
+
+SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 0), 'test_gist_idx');
+ itemoffset |   ctid| itemlen |   keys
++---+-+---
+  1 | (1,65535) |  40 | (p)=((166,166))
+  2 | (2,65535) |  40 | (p)=((332,332))
+  3 | (3,65535) |  40 | (p)=((498,498))
+  4 | (4,65535) |  40 | (p)=((664,664))
+  5 | (5,65535) |  40 | (p)=((830,830))
+  6 | (6,65535) |  40 | (p)=((996,996))
+  7 | (7,65535) |  40 | (p)=((1000,1000))
+(7 rows)
+
+SELECT * FROM gist_page_items(get_raw_page('test_gist_idx', 1), 'test_gist_idx') LIMIT 5;
+ itemoffset | ctid  | itemlen |keys 
++---+-+-
+  1 | (0,1) |  40 | (p)=((1,1))
+  2 | (0,2) |  40 | (p)=((2,2))
+  3 | (0,3) |  40 | (p)=((3,3))
+

Re: Track replica origin progress for Rollback Prepared

2021-01-12 Thread Amit Kapila

On Tue, Jan 12, 2021 at 3:18 PM Ajin Cherian  wrote:
>
> On Wed, Jan 6, 2021 at 11:56 PM Amit Kapila  wrote:
>
>
> > Now, let us see how the tests mentioned by me cover this code. In the
> > first test (check that 2PC gets replicated to subscriber then ROLLBACK
> > PREPARED), we do below on publisher and wait for it to be applied on
> > the subscriber.
> > BEGIN;
> > INSERT INTO tab_full VALUES (12);
> > PREPARE TRANSACTION 'test_prepared_tab_full';
> > ROLLBACK PREPARED 'test_prepared_tab_full';
>
..
>
> Also restarted server2 and confirmed that the rollback prepared was
> not re-sent again.
>

Thanks for doing these tests. I think you can put an elog in the below
code change as well to show that the recovery code path is also hit:

+xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid,
+ XLogRecPtr lsn, RepOriginId origin_id)
 {
...
+ if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
+ {
+ /* recover apply progress */
+ replorigin_advance(origin_id, parsed->origin_lsn, lsn,
+false /* backward */, false /* WAL */);
+ }


-- 
With Regards,
Amit Kapila.

Re: Single transaction in the tablesync worker?

2021-01-12 Thread Peter Smith

On Mon, Jan 11, 2021 at 3:32 PM Amit Kapila  wrote:
>
> On Fri, Jan 8, 2021 at 8:20 AM Amit Kapila  wrote:
> >
> > On Fri, Jan 8, 2021 at 7:14 AM Peter Smith  wrote:
> > >
> > > FYI, I was able to reproduce this case in debugger. PSA logs showing 
> > > details.
> > >
> >
> > Thanks for reproducing as I was worried about exactly this case. I
> > have one question related to logs:
> >
> > ##
> > ## ALTER SUBSCRIPTION to REFRESH the publication
> >
> > ## This blocks on some latch until the tablesync worker dies, then it 
> > continues
> > ##
> >
> > Did you check which exact latch or lock blocks this?
> >
>
> I have checked this myself and the command is waiting on the drop of
> origin till the tablesync worker is finished because replorigin_drop()
> requires state->acquired_by to be 0 which will only be true once the
> tablesync worker exits. I think this is the reason you might have
> noticed that the command can't be finished until the tablesync worker
> died. So this can't be an interlock between ALTER SUBSCRIPTION ..
> REFRESH command and tablesync worker and to that end it seems you have
> below Fixme's in the patch:

I have also seen this same blocking reason before in the replorigin_drop().
However, back when I first tested/reproduced the refresh issue
[test-refresh] that
AlterSubscription_refresh was still *original* unchanged code, so at
that time it did not
have any replorigin_drop() in at all. In any case in the latest code
[v14] the AlterSubscription is
immediately stopping the workers so this question may be moot.

>
> + * FIXME - Usually this cleanup would be OK, but will not
> + * always be OK because the logicalrep_worker_stop_at_commit
> + * only "flags" the worker to be stopped in the near future
> + * but meanwhile it may still be running. In this case there
> + * could be a race between the tablesync worker and this code
> + * to see who will succeed with the tablesync drop (and the
> + * loser will ERROR).
> + *
> + * FIXME - Also, checking the state is also not guaranteed
> + * correct because state might be TCOPYDONE when we checked
> + * but has since progressed to SYNDONE
> + */
> +
> + if (state == SUBREL_STATE_TCOPYDONE)
> + {
>
> I feel this was okay for an earlier code but now we need to stop the
> tablesync workers before trying to drop the slot as we do in
> DropSubscription. Now, if we do that then that would fix the race
> conditions mentioned in Fixme but still, there are few more things I
> am worried about: (a) What if the launcher again starts the tablesync
> worker? One idea could be to acquire AccessExclusiveLock on
> SubscriptionRelationId as we do in DropSubscription which is not a
> very good idea but I can't think of any other good way. (b) the patch
> is just checking SUBREL_STATE_TCOPYDONE before dropping the
> replication slot but the slot could be created even before that (in
> SUBREL_STATE_DATASYNC state). One idea could be we can try to drop the
> slot and if we are not able to drop then we can simply continue
> assuming it didn't exist.

The code was modified in the latest patch [v14] something like as suggested.

The workers for removed tables are now immediately stopped (like
DropSubscription does). Although I did include the AccessExclusiveLock
as (a) suggested, AFAIK this was actually ineffective at preventing
the workers relaunching. Instead, I am using
logicalrep_worker_stop_at_commit to do this - testing shows it as
working ok. Please see the code and latest test logs [v14] for
details.

Also, now the Drop/AlterSubscription will only give WARNING if unable
to drop slots, a per suggestion (b). This is also tested [v14].

>
> One minor comment:
> 1.
> + SpinLockAcquire(>relmutex);
>   MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE;
>   MyLogicalRepWorker->relstate_lsn = current_lsn;
> -
>
> Spurious line removal.

Fixed in latest patch [v14]


[v14] = 
https://www.postgresql.org/message-id/CAHut%2BPsPO2vOp%2BP7U2szROMy15PKKGanhUrCYQ0ffpy9zG1V1A%40mail.gmail.com
[test-refresh] 
https://www.postgresql.org/message-id/CAHut%2BPv7YW7AyO_Q_nf9kzogcJcDFQNe8FBP6yXdzowMz3dY_Q%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Key management with tests

2021-01-12 Thread Masahiko Sawada

On Tue, Jan 12, 2021 at 11:09 AM Bruce Momjian  wrote:
>
> On Tue, Jan 12, 2021 at 09:32:54AM +0900, Masahiko Sawada wrote:
> > On Tue, Jan 12, 2021 at 3:23 AM Stephen Frost  wrote:
> > > Right, or ensure that the actual IV used is distinct (such as by using
> > > another bit in the IV to distinguish logged-vs-unlogged), but it seems
> > > saner to just use a different key, ultimately.
> >
> > Agreed.
> >
> > I think we also need to consider how to make sure nonce is unique when
> > making a page dirty by updating hint bits. Hint bit update changes the
> > page contents but doesn't change the page lsn if we already write a
> > full page write. In the PoC patch, I logged a dummy WAL record
> > (XLOG_NOOP) just to move the page lsn forward, but since this is
> > required even when changing the page is not the first time since the
> > last checkpoint we might end up logging too many dummy WAL records.
>
> This says:
>
> 
> https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
>
> wal_log_hints will be enabled automatically in encryption mode.
>
> Does that help?

IIUC it helps but not enough. When wal_log_hints is enabled, we write
a full-page image when updating hint bits if it's the first time
change for the page since the last checkpoint. But I'm concerned that
what if we change hint bits again after the page is flushed. We would
mark the page as dirtied but not write any WAL, leaving the page lsn
as it is.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Executing inet_server_addr/port() in parallel workers

2021-01-12 Thread Masahiko Sawada

Hi,

While investigating a customer issue it's turned out that if a
parallel worker executes inet_server_addr() and inet_server_port() the
results are always null because MyProcPort is not set in parallel
workers. We can reproduce it in all supported versions higher than
9.6. Here is an example:

postgres=# select inet_server_addr(), inet_server_port();
 inet_server_addr | inet_server_port
--+--
 172.254.99.88| 5432
(1 row)

postgres=# select inet_client_addr(), inet_client_port();
 inet_client_addr | inet_client_port
--+--
 172.254.99.109   |31383
(1 row)

postgres=# set force_parallel_mode to on;
SET
postgres=# select inet_server_addr(), inet_server_port();
 inet_server_addr | inet_server_port
--+--
  |
(1 row)

postgres=# select inet_client_addr(), inet_client_port();
 inet_client_addr | inet_client_port
--+--
 172.254.99.109   |31383
(1 row)

To fix this issue, I think there are two options:

1. Inherits the local address information from postmaster or the
leader to parallel workers.
2. Have both inet_server_addr() and inet_server_port() parallel restricted.

Since auxiliary processes including parallel workers don't communicate
directly to the client, this information is basically not necessary.
Also, the option #1 needs code change to pass the information, which
brings complexity and is not good in terms of back-patching. So I'd
prefer the option #2. I've attached the patch doing the option #2.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/


fix_inet_server_addr_port.patch
Description: Binary data

Re: Get memory contexts of an arbitrary backend process

2021-01-12 Thread torikoshia


v7 that fixes recent conflicts.

It also changed the behavior of requestor when another requestor is
already working for simplicity.
In this case, v6 patch makes the requestor wait. v7 patch makes the
requestor quit.


Regards,

--
Atsushi TorikoshiFrom f20e48d99f2770bfec275805185aa5ce08661fce Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Tue, 12 Jan 2021 20:55:43 +0900
Subject: [PATCH v7] After commit 3e98c0bafb28de, we can display the usage of
 the memory contexts using pg_backend_memory_contexts system view. However,
 its target is limited to the process attached to the current session. This
 patch introduces pg_get_target_backend_memory_contexts() and makes it
 possible to collect memory contexts of the specified process.

---
 src/backend/access/transam/xlog.c|   7 +
 src/backend/catalog/system_views.sql |   3 +-
 src/backend/postmaster/pgstat.c  |   3 +
 src/backend/replication/basebackup.c |   3 +
 src/backend/storage/ipc/ipci.c   |   2 +
 src/backend/storage/ipc/procsignal.c |   4 +
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/backend/tcop/postgres.c  |   5 +
 src/backend/utils/adt/mcxtfuncs.c| 731 ++-
 src/backend/utils/init/globals.c |   1 +
 src/bin/initdb/initdb.c  |   3 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |   4 +-
 src/bin/pg_rewind/filemap.c  |   3 +
 src/include/catalog/pg_proc.dat  |  12 +-
 src/include/miscadmin.h  |   1 +
 src/include/pgstat.h |   3 +-
 src/include/storage/procsignal.h |   1 +
 src/include/utils/mcxtfuncs.h|  44 ++
 18 files changed, 810 insertions(+), 21 deletions(-)
 create mode 100644 src/include/utils/mcxtfuncs.h

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ede93ad7fd..4cab47a61d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -73,6 +73,7 @@
 #include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
+#include "utils/mcxtfuncs.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/relmapper.h"
@@ -6993,6 +6994,12 @@ StartupXLOG(void)
 		 */
 		pgstat_reset_all();
 
+		/*
+		 * Reset dump files in pg_memusage, because target processes do
+		 * not exist any more.
+		 */
+		RemoveMemcxtFile(0);
+
 		/*
 		 * If there was a backup label file, it's done its job and the info
 		 * has now been propagated into pg_control.  We must get rid of the
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5d89e77dbe..7419c496b2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -558,7 +558,8 @@ CREATE VIEW pg_backend_memory_contexts AS
 SELECT * FROM pg_get_backend_memory_contexts();
 
 REVOKE ALL ON pg_backend_memory_contexts FROM PUBLIC;
-REVOKE EXECUTE ON FUNCTION pg_get_backend_memory_contexts() FROM PUBLIC;
+REVOKE EXECUTE ON FUNCTION pg_get_backend_memory_contexts FROM PUBLIC;
+REVOKE EXECUTE ON FUNCTION pg_get_target_backend_memory_contexts FROM PUBLIC;
 
 -- Statistics views
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f24a33ef1..8eb2d062b0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4045,6 +4045,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
+		case WAIT_EVENT_DUMP_MEMORY_CONTEXT:
+			event_name = "DumpMemoryContext";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 0f54635550..c67e71d79b 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -184,6 +184,9 @@ static const char *const excludeDirContents[] =
 	/* Contents zeroed on startup, see StartupSUBTRANS(). */
 	"pg_subtrans",
 
+	/* Skip memory context dump files. */
+	"pg_memusage",
+
 	/* end of list */
 	NULL
 };
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b50..18a1dd5a74 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -45,6 +45,7 @@
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
+#include "utils/mcxtfuncs.h"
 #include "utils/snapmgr.h"
 
 /* GUCs */
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	McxtDumpShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 583efaecff..106e125cc2 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -28,6 +28,7 @@
 #include "storage/shmem.h"
 #include

Re: Single transaction in the tablesync worker?

2021-01-12 Thread Peter Smith

On Sat, Jan 9, 2021 at 5:44 PM Amit Kapila  wrote:
>
> On Fri, Jan 8, 2021 at 2:55 PM Peter Smith  wrote:
> >
> > On Fri, Jan 8, 2021 at 1:02 PM Hou, Zhijie  
> > wrote:
> > >
> >
> > > 3.
> > > +   /*
> > > +* To build a slot name for the sync work, we are limited to 
> > > NAMEDATALEN -
> > > +* 1 characters.
> > > +*
> > > +* The name is calculated as pg_%u_sync_%u (3 + 10 + 6 + 10 + 
> > > '\0'). (It's
> > > +* actually the NAMEDATALEN on the remote that matters, but this 
> > > scheme
> > > +* will also work reasonably if that is different.)
> > > +*/
> > > +   StaticAssertStmt(NAMEDATALEN >= 32, "NAMEDATALEN too small");   
> > > /* for sanity */
> > > +
> > > +   syncslotname = psprintf("pg_%u_sync_%u", suboid, relid);
> > >
> > > The comments says syncslotname is limit to NAMEDATALEN - 1 characters.
> > > But the actual size of it is (3 + 10 + 6 + 10 + '\0') = 30,which seems 
> > > not NAMEDATALEN - 1.
> > > Should we change the comment here?
> > >
> >
> > The comment wording is a remnant from older code which had a
> > differently format slot name.
> > I think the comment is still valid, albeit maybe unnecessary since in
> > the current code the tablesync slot
> > name length is fixed. But I left the older comment here as a safety reminder
> > in case some future change would want to modify the slot name. What do
> > you think?
> >
>
> I find it quite confusing. The comments should reflect the latest
> code. You can probably say in some form that the length of slotname
> shouldn't exceed NAMEDATALEN because of remote node constraints on
> slot name length. Also, probably the StaticAssert on NAMEDATALEN is
> not required.

Modified comment in latest patch [v14]

>
> 1.
> +   
> +Additional table synchronization slots are normally transient, created
> +internally and dropped automatically when they are no longer needed.
> +These table synchronization slots have generated names:
> +   pg_%u_sync_%u (parameters:
> Subscription oid, Table
> relid)
> +   
>
> The last line seems too long. I think we are not strict for 80 char
> limit in docs but it is good to be close to that, however, this
> appears quite long.

Fixed in latest patch [v14]

>
> 2.
> + /*
> + * Cleanup any remaining tablesync resources.
> + */
> + {
> + char originname[NAMEDATALEN];
> + RepOriginId originid;
> + char state;
> + XLogRecPtr statelsn;
>
> I have already mentioned previously that let's not use this new style
> of code (start using { to localize the scope of variables). I don't
> know about others but I find it difficult to read such a code. You
> might want to consider moving this whole block to a separate function.
>

Removed extra code block in latest patch [v14]

> 3.
> /*
> + * XXX - Should optimize this to avoid multiple
> + * connect/disconnect.
> + */
> + wrconn = walrcv_connect(sub->conninfo, true, sub->name, );
>
> I think it is better to avoid multiple connect/disconnect here. In
> this same function, we have connected to the publisher, we should be
> able to use the same connection.
>

Fixed in latest patch [v14]

> 4.
> process_syncing_tables_for_sync()
> {
> ..
> + /*
> + * Cleanup the tablesync slot.
> + */
> + syncslotname = ReplicationSlotNameForTablesync(
> +MySubscription->oid,
> +MyLogicalRepWorker->relid);
> + PG_TRY();
> + {
> + elog(DEBUG1, "process_syncing_tables_for_sync: dropping the
> tablesync slot \"%s\".", syncslotname);
> + ReplicationSlotDropAtPubNode(wrconn, syncslotname);
> + }
> + PG_FINALLY();
> + {
> + pfree(syncslotname);
> + }
> + PG_END_TRY();
> ..
> }
>
> Both here and in DropSubscription(), it seems we are using
> PG_TRY..PG_FINALLY just to free the memory even though
> ReplicationSlotDropAtPubNode already has try..finally. Can we arrange
> code to move allocation of syncslotname inside
> ReplicationSlotDropAtPubNode to avoid additional try..finaly? BTW, if
> the usage of try..finally here is only to free the memory, I am not
> sure if it is required because I think we will anyway Reset the memory
> context where this memory is allocated as part of error handling.
>

Eliminated need for TRY/FINALLY to free syncslotname in latest patch [v14]

> 5.
>  #define SUBREL_STATE_DATASYNC 'd' /* data is being synchronized (sublsn
>   * NULL) */
> +#define SUBREL_STATE_TCOPYDONE 't' /* tablesync copy phase is completed
> + * (sublsn NULL) */
>  #define SUBREL_STATE_SYNCDONE 's' /* synchronization finished in front of
>   * apply (sublsn set) */
>
> I am not very happy with the new state name SUBREL_STATE_TCOPYDONE as
> it is quite different from other adjoining state names and somehow not
> going well with the code. How about SUBREL_STATE_ENDCOPY 'e' or
> SUBREL_STATE_FINISHEDCOPY 'f'?
>

Using SUBREL_STATE_FINISHEDCOPY in latest patch [v14]

---
[v14] = 
https://www.postgresql.org/message-id/CAHut%2BPsPO2vOp%2BP7U2szROMy15PKKGanhUrCYQ0ffpy9zG1V1A%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-12 Thread japin



On Tue, 12 Jan 2021 at 19:32, Bharath Rupireddy wrote:
> On Tue, Jan 12, 2021 at 4:47 PM Li Japin  wrote:
>> IIUC the logical replication only replicate the tables in publication, I 
>> think
>> when the tables that aren't in publication should not be replicated.
>>
>> Attached the patch that fixes it.  Thought?
>
> With that change, we don't get the behaviour that's stated in the
> document - "The ADD TABLE and DROP TABLE clauses will add and remove
> one or more tables from the publication. Note that adding tables to a
> publication that is already subscribed to will require a ALTER
> SUBSCRIPTION ... REFRESH PUBLICATION action on the subscribing side in
> order to become effective" -
> https://www.postgresql.org/docs/devel/sql-alterpublication.html.
>

The documentation only emphasize adding tables to a publication, not
include dropping tables from a publication.

> The publisher stops sending the tuples whenever the relation gets
> dropped from the publication, not waiting until alter subscription ...
> refresh publication on the subscriber.
>

If we want to wait the subscriber executing alter subscription ... refresh 
publication,
maybe we should send some feedback to walsender.  How can we send this feedback 
to
walsender in non-walreceiver process?

> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com


-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Added schema level support for publication.

2021-01-12 Thread Rahila Syed

Hi Vignesh,

I had a look at the patch, please consider following comments.

On Thu, Jan 7, 2021 at 10:03 PM vignesh C  wrote:

> Hi,
>
> This feature adds schema option while creating publication. Users will
> be able to specify one or more schemas while creating publication,
> when the user specifies schema option, then the data changes for the
> tables present in the schema specified by the user will be replicated
> to the subscriber. Few examples have been listed below:
>
> Create a publication that publishes all changes for all the tables
> present in production schema:
> CREATE PUBLICATION production_publication FOR ALL TABLES SCHEMA production;
>
> Should it be FOR TABLES IN SCHEMA instead of FOR ALL TABLES SCHEMA?


> Create a publication that publishes all changes for all the tables
> present in marketing and sales schemas:
> CREATE PUBLICATION sales_publication FOR ALL TABLES SCHEMA marketing,
> sales;
>
> Add some schemas to the publication:
> ALTER PUBLICATION sales_publication ADD SCHEMA marketing_june, sales_june;
>
> As per current implementation this command fails even if one of the
schemas does not
exist. I think this is counterintuitive, it should throw a warning and
continue adding the rest.


> Drop some schema from the publication:
> ALTER PUBLICATION production_quarterly_publication DROP SCHEMA
> production_july;
>
> Same for drop schema, if one of these schemas does not exist in
publication,
the entire DROP operation is aborted.

Thank you,
Rahila Syed

Re: Is it useful to record whether plans are generic or custom?

2021-01-12 Thread torikoshia


 wrote in



ISTM now that creating pg_stat_statements_xxx views
both for generic andcustom plans is better than my PoC patch.


On my second thought, it also makes pg_stat_statements too complicated
compared to what it makes possible..

I'm also worrying that whether taking generic and custom plan execution
time or not would be controlled by a GUC variable, and the default
would be not taking them.
Not many people will change the default.

Since the same queryid can contain various queries (different plan,
different parameter $n, etc.), I also started to feel that it is not
appropriate to get the execution time of only generic/custom queries
separately.

I suppose it would be normal practice to store past results of
pg_stat_statements for future comparisons.
If this is the case, I think that if we only add the number of
generic plan execution, it will give us a hint to notice the cause
of performance degradation due to changes in the plan between
generic and custom.

For example, if there is a clear difference in the number of times
the generic plan is executed between before and after performance
degradation as below, it would be natural to check if there is a
problem with the generic plan.

  [after performance degradation]
  =# SELECT query, calls, generic_calls FROM pg_stat_statements where 
query like '%t1%';

  query| calls | generic_calls
  -+---+---
   PREPARE p1 as select * from t1 where i = $1 |  1100 |50

  [before performance degradation]
  =# SELECT query, calls, generic_calls FROM pg_stat_statements where 
query like '%t1%';

  query| calls | generic_calls
  -+---+---
   PREPARE p1 as select * from t1 where i = $1 |  1000 | 0


Attached a patch that just adds a generic call counter to
pg_stat_statements.

Any thoughts?


Regards,

--
Atsushi Torikoshidiff --git a/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql b/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql
index 0f63f08f7e..7fdef315ae 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql
@@ -44,7 +44,8 @@ CREATE FUNCTION pg_stat_statements(IN showtext boolean,
 OUT blk_write_time float8,
 OUT wal_records int8,
 OUT wal_fpi int8,
-OUT wal_bytes numeric
+OUT wal_bytes numeric,
+OUT generic_calls int8
 )
 RETURNS SETOF record
 AS 'MODULE_PATHNAME', 'pg_stat_statements_1_8'
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 72a117fc19..171c39f857 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -77,10 +77,12 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/spin.h"
+#include "tcop/pquery.h"
 #include "tcop/utility.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/plancache.h"
 #include "utils/timestamp.h"
 
 PG_MODULE_MAGIC;
@@ -192,6 +194,7 @@ typedef struct Counters
 	int64		wal_records;	/* # of WAL records generated */
 	int64		wal_fpi;		/* # of WAL full page images generated */
 	uint64		wal_bytes;		/* total amount of WAL generated in bytes */
+	int64		generic_calls;	/* # of times generic plans executed */
 } Counters;
 
 /*
@@ -277,6 +280,9 @@ static int	exec_nested_level = 0;
 /* Current nesting depth of planner calls */
 static int	plan_nested_level = 0;
 
+/* Current plan type */
+static bool	is_plan_type_generic = false;
+
 /* Saved hook values in case of unload */
 static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
 static post_parse_analyze_hook_type prev_post_parse_analyze_hook = NULL;
@@ -1034,6 +1040,20 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 */
 	if (pgss_enabled(exec_nested_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
+		/*
+		 * Since ActivePortal is not available at ExecutorEnd, we preserve
+		 * the plan type here.
+		 */
+		Assert(ActivePortal);
+
+		if (ActivePortal->cplan)
+		{
+			if (ActivePortal->cplan->is_generic)
+is_plan_type_generic = true;
+			else
+is_plan_type_generic = false;
+		}
+
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
 		 * space is allocated in the per-query context so it will go away at
@@ -1427,6 +1447,8 @@ pgss_store(const char *query, uint64 queryId,
 			e->counters.max_time[kind] = total_time;
 			e->counters.mean_time[kind] = total_time;
 		}
+		else if (kind == PGSS_EXEC && is_plan_type_generic)
+			e->counters.generic_calls += 1;
 		else
 		{
 			/*
@@ -1510,8 +1532,8 @@ pg_stat_statements_reset(PG_FUNCTION_ARGS)
 #define PG_STAT_STATEMENTS_COLS_V1_1	18
 #define PG_STAT_STATEMENTS_COLS_V1_2	19
 #define PG_STAT_STATEMENTS_COLS_V1_3	23
-#define

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-12 Thread Bharath Rupireddy

On Tue, Jan 12, 2021 at 4:47 PM Li Japin  wrote:
> IIUC the logical replication only replicate the tables in publication, I think
> when the tables that aren't in publication should not be replicated.
>
> Attached the patch that fixes it.  Thought?

With that change, we don't get the behaviour that's stated in the
document - "The ADD TABLE and DROP TABLE clauses will add and remove
one or more tables from the publication. Note that adding tables to a
publication that is already subscribed to will require a ALTER
SUBSCRIPTION ... REFRESH PUBLICATION action on the subscribing side in
order to become effective" -
https://www.postgresql.org/docs/devel/sql-alterpublication.html.

The publisher stops sending the tuples whenever the relation gets
dropped from the publication, not waiting until alter subscription ...
refresh publication on the subscriber.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-12 Thread Bharath Rupireddy

On Tue, Jan 12, 2021 at 12:06 PM Amit Kapila 
wrote:
> > Here's my analysis:
> > 1) in the publisher, alter publication drop table successfully
> > removes(PublicationDropTables) the table from the catalogue
> > pg_publication_rel
> > 2) in the subscriber, alter subscription refresh publication
> > successfully removes the table from the catalogue pg_subscription_rel
> > (AlterSubscription_refresh->RemoveSubscriptionRel)
> > so far so good
> >
>
> Here, it should register the worker to stop on commit, and then on
> commit it should call AtEOXact_ApplyLauncher to stop the apply worker.
> Once the apply worker is stopped, the corresponding WALSender will
> also be stopped. Something here is not happening as per expected
> behavior.

On the subscriber, an entry for worker stop is created in
AlterSubscription_refresh --> logicalrep_worker_stop_at_commit. At the end
of txn, in AtEOXact_ApplyLauncher, we try to stop that worker, but it
cannot be stopped because logicalrep_worker_find returns null
(AtEOXact_ApplyLauncher --> logicalrep_worker_stop -->
logicalrep_worker_find). The worker entry for that subscriber is having
relid as 0 [1], due to which the following if condition will not be hit.
The apply worker on the subscriber related to the subscription on which
refresh publication was run is not closed. It looks like relid 0 is valid
because it will be applicable only during the table sync phase, the comment
in the LogicalRepWorker structure says that.

And also, I think, expecting the apply worker to be closed this way doesn't
make sense because the apply worker is a per-subscription base, and the
subscription can have other tables too.

/* Search for attached worker for a given subscription id. */
for (i = 0; i < max_logical_replication_workers; i++)
{
LogicalRepWorker *w = >workers[i];

if (w->in_use && w->subid == subid && w->relid == relid &&
(!only_running || w->proc))
{
res = w;
break;
}
}

[1]
(gdb) p subid
$5 = 16391
(gdb) p relid
$6 = 16388
(gdb) p *w
$4 = {launch_time = 663760343317760, in_use = true, generation = 1, proc =
0x7fdfd9a7cc90,
  dbid = 12872, userid = 10, *subid = 16391*,* relid = 0*, relstate = 0
'\000', relstate_lsn = 0,
  relmutex = 0 '\000', last_lsn = 22798424, last_send_time =
663760483945980,
  last_recv_time = 663760483946087, reply_lsn = 22798424, reply_time =
663760483945980}

postgres=# select * from pg_stat_get_subscription(16391);
 subid | relid |  pid   | received_lsn |last_msg_send_time|
 last_msg_receipt_time   | latest_end_lsn | latest_end_time

---+---++--+--+--++--
 16391 |   | 466779 | 0/15BE140| 2021-01-12 15:26:48.778813+05:30 |
2021-01-12 15:26:48.778878+05:30 | 0/15BE140  | 2021-01-12
15:26:48.778813+05:30
(1 row)

> > 3) after the insertion into the table in the publisher(remember that
> > it's dropped from the publication in (1)), the walsender process is
> > unable detect that the table has been dropped from the publication
> > i.e. it doesn't look at the pg_publication_rel catalogue or some
> > other, but it only does is_publishable_relation() check which returns
> > true in pgoutput_change(). Maybe the walsender should look at the
> > catalogue pg_publication_rel in is_publishable_relation()?
> >
>
> We must be somewhere checking pg_publication_rel before sending the
> decoded change because otherwise, we would have sent the changes for
> the table which are not even part of this publication. I think you can
> try to create a separate table that is not part of the publication
> under test and see how the changes for that are filtered.

As pointed out by japin in the next thread, the walsender process in the
publisher uses RelationSyncCache to hold the relations to which the
insertions need to be sent to the subscriber. The RelationSyncCache gets
created during startup of the walsender
process(pgoutput_startup->init_rel_sync_cache) and also
rel_sync_cache_publication_cb callback gets registered. So, if any alters
happen to the pg_publication_rel catalog table, the callback
rel_sync_cache_publication_cb is called, all the entries in the
RelationSyncCache are marked as invalid, with the expectation that on the
next use of any cache entry in get_rel_sync_entry
(pgoutput_change->get_rel_sync_entry), that entry is validated again.

In the use case, the invalidation happens as expected in
rel_sync_cache_publication_cb and while revalidating the entry in
get_rel_sync_entry, since there is only one publication to which the given
relation is attached to, the pubids will be null and we don't set the
entry->pubactions.pubinsert/update/delete/truncate to false, so the
publisher keeps publishing the inserts to the relation.

/* Validate the entry */
if (!entry->replicate_valid)
{
List   *pubids =

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-12 Thread Li Japin


On Jan 12, 2021, at 5:47 PM, japin 
mailto:japi...@hotmail.com>> wrote:


On Tue, 12 Jan 2021 at 14:38, Amit Kapila wrote:
On Tue, Jan 12, 2021 at 11:39 AM Bharath Rupireddy
mailto:bharath.rupireddyforpostg...@gmail.com>>
 wrote:

On Tue, Jan 12, 2021 at 9:05 AM Amit Kapila 
mailto:amit.kapil...@gmail.com>> wrote:

On Mon, Jan 11, 2021 at 6:51 PM Bharath Rupireddy
mailto:bharath.rupireddyforpostg...@gmail.com>>
 wrote:

Hi,

While providing thoughts on the design in [1], I found a strange
behaviour with the $subject. The use case is shown below as a sequence
of steps that need to be run on publisher and subscriber to arrive at
the strange behaviour.  In step 5, the table is dropped from the
publication and in step 6, the refresh publication is run on the
subscriber, from here onwards, the expectation is that no further
inserts into the publisher table have to be replicated on to the
subscriber, but the opposite happens i.e. the inserts are still
replicated to the subscriber. ISTM as a bug. Let me know if I'm
missing anything.


Did you try to investigate what's going on? Can you please check what
is the behavior if, after step-5, you restart the subscriber and
separately try creating a new subscription (maybe on a different
server) for that publication after step-5 and see if that allows the
relation to be replicated? AFAIU, in AlterSubscription_refresh, we
remove such dropped rels and stop their corresponding apply workers
which should stop the further replication of such relations but that
doesn't seem to be happening in your case.

Here's my analysis:
1) in the publisher, alter publication drop table successfully
removes(PublicationDropTables) the table from the catalogue
pg_publication_rel
2) in the subscriber, alter subscription refresh publication
successfully removes the table from the catalogue pg_subscription_rel
(AlterSubscription_refresh->RemoveSubscriptionRel)
so far so good


Here, it should register the worker to stop on commit, and then on
commit it should call AtEOXact_ApplyLauncher to stop the apply worker.
Once the apply worker is stopped, the corresponding WALSender will
also be stopped. Something here is not happening as per expected
behavior.

3) after the insertion into the table in the publisher(remember that
it's dropped from the publication in (1)), the walsender process is
unable detect that the table has been dropped from the publication
i.e. it doesn't look at the pg_publication_rel catalogue or some
other, but it only does is_publishable_relation() check which returns
true in pgoutput_change(). Maybe the walsender should look at the
catalogue pg_publication_rel in is_publishable_relation()?


We must be somewhere checking pg_publication_rel before sending the
decoded change because otherwise, we would have sent the changes for
the table which are not even part of this publication. I think you can
try to create a separate table that is not part of the publication
under test and see how the changes for that are filtered.

I find that pgoutput_change() use a hash table RelationSyncCache to
cache the publication info for tables.  When we drop tables from the
publication, the RelationSyncCache doesn't updated, so it replicate
records.


IIUC the logical replication only replicate the tables in publication, I think
when the tables that aren't in publication should not be replicated.

Attached the patch that fixes it.  Thought?

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.






alter-publication-drop-table.patch
Description: alter-publication-drop-table.patch

Re: list of extended statistics on psql

2021-01-12 Thread Tomas Vondra




On 1/12/21 2:57 AM, Tatsuro Yamada wrote:

Hi Tomas,

On 2021/01/09 9:01, Tomas Vondra wrote:

...>

While working on that, I realized that 'defined' might be a bit
ambiguous, I initially thought it means 'NOT NULL' (which it does not).
I propose to change it to 'requested' instead. Tatsuro, do you agree, or
do you think 'defined' is better?


Regarding the status of extended stats, I think the followings:

  - "defined": it shows the extended stats defined only. We can't know
   whether it needs to analyze or not. I agree this name was
    ambiguous. Therefore we should replace it with a more 
suitable

   name.
  - "requested": it shows the extended stats needs something. Of course,
   we know it needs to ANALYZE because we can create the patch.
   However, I feel there is a little ambiguity for DBA.
   To solve this, it would be better to write an explanation of
   the status in the document. For example,

==
The column of the kind of extended stats (e. g. Ndistinct) shows some 
statuses.
"requested" means that it needs to gather data by ANALYZE. "built" means 
ANALYZE
  was finished, and the planner can use it. NULL means that it doesn't 
exists.

==

What do you think? :-D



Yes, that seems reasonable to me. Will you provide an updated patch?


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Fix a typo in xlogreader.c

2021-01-12 Thread Masahiko Sawada

Hi,

The function comment of RestoreBlockImage() seems not correct since it
returns a boolean, not the buffer number.

 /*
  * Restore a full-page image from a backup block attached to an XLOG record.
  *
  * Returns the buffer number containing the page.
  */
 bool
 RestoreBlockImage(XLogReaderState *record, uint8 block_id, char *page)

Attached the patch that fixes it.

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/


fix_typo_xlogreader.patch
Description: Binary data

Re: O(n^2) system calls in RemoveOldXlogFiles()

2021-01-12 Thread Thomas Munro

On Tue, Jan 12, 2021 at 8:27 PM Michael Paquier  wrote:
> Yeah, this rings a bell.  I never went back to it even if the thing
> looks rather clean at quick glance (not tested), but I may be able
> to spend some cycles on that.  I don't think that's critical enough
> for a backpatch, so doing something only on HEAD is fine by me.
> What's your take?

I haven't heard any user complaints, and I'd personally be happy with
a fix on master only.

Re: Implementing Incremental View Maintenance

2021-01-12 Thread Yugo NAGATA

Hi,

Attached is the revised patch (v21) to add support for Incremental
Materialized View Maintenance (IVM).

In addition to some typos in the previous enhancement, I fixed a check to
prevent a view from containing an expression including aggregates like
sum(x)/sum(y) in this revision.

Regards,
Yugo Nagata

On Tue, 22 Dec 2020 22:24:22 +0900
Yugo NAGATA  wrote:

> Hi hackers,
> 
> I heard the opinion that this patch is too big and hard to review.
> So, I wander that we should downsize the patch  by eliminating some
> features and leaving other basic features.
> 
> If there are more opinions this makes it easer for reviewers to look
> at this patch, I would like do it. If so, we plan to support only
> selection, projection, inner-join, and some aggregates in the first
> release and leave sub-queries, outer-join, and CTE supports to the
> next release.
> 
> Regards,
> Yugo Nagata
> 
> On Tue, 22 Dec 2020 21:51:36 +0900
> Yugo NAGATA  wrote:
> > Hi,
> > 
> > Attached is the revised patch (v20) to add support for Incremental
> > Materialized View Maintenance (IVM).
> > 
> > In according with Konstantin's suggestion, I made a few optimizations.
> > 
> > 1. Creating an index on the matview automatically
> > 
> > When creating incremental maintainable materialized view (IMMV)s,
> > a unique index on IMMV is created automatically if possible.
> > 
> > If the view definition query has a GROUP BY clause, the index is created
> > on the columns of GROUP BY expressions. Otherwise, if the view contains
> > all primary key attributes of its base tables in the target list, the index
> > is created on these attributes.  Also, if the view has DISTINCT,
> > a unique index is created on all columns in the target list.
> > In other cases, no index is created.
> > 
> > In all cases, a NOTICE message is output to inform users that an index is
> > created or that an appropriate index is necessary for efficient IVM.
> > 
> > 2. Use a weaker lock on the matview if possible
> > 
> > If the view has only one base table in this query, RowExclusiveLock is
> > held on the view instead of AccessExclusiveLock, because we don't
> > need to wait other concurrent transaction's result in order to
> > maintain the view in this case. When the same row in the view is
> > affected due to concurrent maintenances, a row level lock will
> > protect it.
> > 
> > On Tue, 24 Nov 2020 12:46:57 +0300
> > Konstantin Knizhnik  wrote:
> > 
> > > The most obvious optimization is not to use exclusive table lock if view 
> > > depends just on one table (contains no joins).
> > > Looks like there are no any anomalies in this case, are there?
> > 
> > I confirmed the effect of this optimizations.
> > 
> > First, when I performed pgbench (SF=100) without any materialized views,
> > the results is :
> >  
> >  pgbench test4 -T 300 -c 8 -j 4
> >  latency average = 6.493 ms
> >  tps = 1232.146229 (including connections establishing)
> > 
> > Next, created a view as below, I performed the same pgbench.
> >  CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm2 AS
> > SELECT bid, count(abalance), sum(abalance), avg(abalance)
> > FROM pgbench_accounts GROUP BY bid;
> > 
> > The result is here:
> > 
> > [the previous version (v19 with exclusive table lock)]
> >  - latency average = 77.677 ms
> >  - tps = 102.990159 (including connections establishing)
> > 
> > [In the latest version (v20 with weaker lock)]
> >  - latency average = 17.576 ms
> >  - tps = 455.159644 (including connections establishing)
> > 
> > There is still substantial overhead, but we can see that the effect
> > of the optimization.
> > 
> > Regards,
> > Yugo Nagata
> > 
> > -- 
> > Yugo NAGATA 
> 
> 
> -- 
> Yugo NAGATA 
> 
> 


-- 
Yugo NAGATA 


IVM_patches_v21.tar.gz
Description: application/gzip

1 2 >

1 - 100 of 109 matches

Mail list logo