Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-23 Thread Amit Kapila
On Tue, Sep 22, 2020 at 5:15 PM Dilip Kumar  wrote:
>
> On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila  wrote:
> >
> >
> > I am not sure if this suggestion makes it better than what is purposed
> > by Dilip but I think we can declare them in define number order like
> > below:
> > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> > #define LOGICALREP_PROTO_VERSION_NUM 1
> > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
>
> Done this way.
>

- options.proto.logical.proto_version = LOGICALREP_PROTO_VERSION_NUM;
+ options.proto.logical.proto_version = MySubscription->stream ?
+ LOGICALREP_PROTO_STREAM_VERSION_NUM : LOGICALREP_PROTO_VERSION_NUM;

Here, I think instead of using MySubscription->stream, we should use
server/walrecv version number as we used at one place in tablesync.c.
Because say if we decide to add something additional for decode of
prepared xacts in PG14 then this check needs to be extended for stream
and prepared where as server version number check won't need to be
changed.

-- 
With Regards,
Amit Kapila.




Re: Add information to rm_redo_error_callback()

2020-09-23 Thread Michael Paquier
On Mon, Aug 17, 2020 at 05:47:13PM +0200, Drouvot, Bertrand wrote:
> I think it's good to guarantee that we'll always see the same piece of
> information (should a new RM desc() be created in the future for example),
> even if it could lead to some information overlap in some cases.

> I am ok too, but I am also not sure that errcontext is the right place for
> that.

Hmm.  I still think that knowing at least about a FPW could be an
interesting piece of information even here.  Anyway, instead of
copying a logic that exists already in xlog_outrec(), why not moving
the block information print into a separate routine out of the
WAL_DEBUG section, and just reuse the same format for the context of
the redo error callback?  That would also be more consistent with what
we do in pg_waldump where we don't show the fork name of a block when
it is on a MAIN_FORKNUM.  And this would avoid a third copy of the
same logic.  If we add the XID, previous LSN and the record length
on the stack of what is printed, we could just reuse the existing
routine, still that's perhaps too much information displayed.
--
Michael


signature.asc
Description: PGP signature


Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2020-09-23 Thread Thomas Munro
On Thu, Sep 24, 2020 at 2:39 AM Fujii Masao  wrote:
> Does this patch work fine with warm-standby case using pg_standby?
> IIUC the startup process doesn't call WaitLatch() in that case, so ISTM that,
> with the patch, it cannot detect the postmaster death immediately.

Right, RestoreArchivedFile() uses system(), so I guess it can hang
around for a long time after unexpected postmaster exit on every OS if
the command waits.  To respond to various kinds of important
interrupts, I suppose that'd ideally use something like
OpenPipeStream() and a typical WaitLatch() loop with CFI().  I'm not
sure what our policy is or should be for exiting while we have running
subprocesses.  I guess that is a separate issue.

Here's a rebase, no code change.
From 2cd588ecc72729930a06b20da65537dfcb8b2f52 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 24 Sep 2020 17:37:54 +1200
Subject: [PATCH v4] Poll postmaster less frequently in recovery.

Since commits 9f095299 and f98b8476 we don't poll the postmaster
pipe at all during crash recovery on Linux and FreeBSD, but on other
operating systems we were still doing it for every WAL record.  Do it
less frequently on operating systems where system calls are required, at
the cost of delaying exit a bit after postmaster death, to avoid
expensive system calls reported to slow down CPU-bound recovery by
as much as 10-30%.

Replace a couple of pg_usleep()-based wait loops with WaitLatch() loops,
to make sure they respond promptly to postmaster death now that
HandleStartupProcInterrupts() doesn't check every time through on some
systems.

Reviewed-by: Heikki Linnakangas 
Reviewed-by: Fujii Masao 
Discussion: https://postgr.es/m/CA%2BhUKGK1607VmtrDUHQXrsooU%3Dap4g4R2yaoByWOOA3m8xevUQ%40mail.gmail.com
Discussion: https://postgr.es/m/7261eb39-0369-f2f4-1bb5-62f3b6083...@iki.fi
---
 doc/src/sgml/monitoring.sgml   |  4 
 src/backend/access/transam/xlog.c  |  7 +++
 src/backend/postmaster/pgstat.c|  3 +++
 src/backend/postmaster/startup.c   | 16 ++--
 src/backend/replication/walreceiverfuncs.c |  6 --
 src/include/pgstat.h   |  1 +
 6 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4e0193a967..65210ee064 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1734,6 +1734,10 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
   Waiting for confirmation from a remote server during synchronous
replication.
  
+ 
+  WalrcvExit
+  Waiting for the walreceiver to exit.
+ 
  
   XactGroupUpdate
   Waiting for the group leader to update transaction status at
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61754312e2..f9d9b38a8a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5999,7 +5999,7 @@ recoveryStopsAfter(XLogReaderState *record)
  * the paused state starts at the end of recovery because of
  * recovery_target_action=pause, and false otherwise.
  *
- * XXX Could also be done with shared latch, avoiding the pg_usleep loop.
+ * XXX Could also be done with shared latch, avoiding the WL_TIMEOUT loop.
  * Probably not worth the trouble though.  This state shouldn't be one that
  * anyone cares about server power consumption in.
  */
@@ -6028,9 +6028,8 @@ recoveryPausesHere(bool endOfRecovery)
 		HandleStartupProcInterrupts();
 		if (CheckForStandbyTrigger())
 			return;
-		pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE);
-		pg_usleep(100L);	/* 1000 ms */
-		pgstat_report_wait_end();
+		(void) WaitLatch(NULL, WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, 1000,
+		 WAIT_EVENT_RECOVERY_PAUSE);
 	}
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index e6be2b7836..395d61f082 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3875,6 +3875,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_WALRCV_EXIT:
+			event_name = "WalrcvExit";
+			break;
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index 64af7b8707..a802041ca7 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -134,6 +134,10 @@ StartupRereadConfig(void)
 void
 HandleStartupProcInterrupts(void)
 {
+#ifndef USE_POSTMASTER_DEATH_SIGNAL
+	static int	count = 0;
+#endif
+
 	/*
 	 * Process any requests or signals received recently.
 	 */
@@ -151,9 +155,17 @@ HandleStartupProcInterrupts(void)
 
 	/*
 	 * Emergency bailout if postmaster has died.  This is to avoid the
-	 * necessity for manual cleanup of all postmaster children.
+	 * necessity for manual cleanup of all postmaster children.  Do this less
+	 *

Re: problem with RETURNING and update row movement

2020-09-23 Thread Amit Langote
On Thu, Sep 24, 2020 at 4:25 AM Etsuro Fujita  wrote:
> On Wed, Sep 23, 2020 at 10:12 PM Amit Langote  wrote:
> > On Sun, Sep 20, 2020 at 11:41 PM Etsuro Fujita  
> > wrote:
> > > On Mon, Sep 14, 2020 at 10:45 PM Amit Langote  
> > > wrote:
> > > > Although, I am still not sure how to
> > > > "copy" system columns from one partition's slot to another's, that is,
> > > > without assuming what they are.
> > >
> > > I just thought we assume that partitions support all the existing
> > > system attributes until we have the fix for (a), i.e., the slot
> > > assigned for a partition must have the getsysattr callback routine
> > > from which we can fetch each existing system attribute of a underlying
> > > tuple in the slot, regardless of whether that system attribute is used
> > > for the partition or not.
> >
> > Hmm, to copy one slot's system attributes into another, we will also
> > need a way to "set" the system attributes in the destination slot.
> > But maybe I didn't fully understand what you said.
>
> Sorry, my thought was vague.  To store xmin/xmax/cmin/cmax into a
> given slot, we need to extend the TupleTableSlot struct to contain
> these attributes as well?  Or we need to introduce a new callback
> routine for that (say, setsysattr)?  These would not be
> back-patchable, though.

Yeah, I'd think so too.

BTW, the discussion so far on the other thread is oblivious to the
issue being discussed here, where we need to find a way to transfer
system attributes between a pair of partitions that are possibly
incompatible with each other in terms of what set of system attributes
they support.  Although, if we prevent accessing system attributes
when performing the operation on partitioned tables, like what you
seem to propose below, then we wouldn't really have that problem.

> > BTW, do you think we should alter the test in PG 11 branch to test
> > that system attributes are returned correctly?
>
> Yeah, I think so.  I didn’t come up with any good test cases for that, though.
>
> > Once we settle the
> > other issue, we can adjust the HEAD's test to do likewise?
>
> Yeah, but for the other issue, I started thinking that we should just
> forbid referencing xmin/xmax/cmin/cmax in 12, 13, and HEAD...

When the command is being performed on a partitioned table you mean?
That is, it'd be okay to reference them when the command is being
performed directly on a leaf partition, although it's another thing
whether the leaf partitions themselves have sensible values to provide
for them.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




A little enhancement for hashdisk testset

2020-09-23 Thread Hou, Zhijie
Hi,

In (src/test/regress/sql/aggregates.sql)

I found some tables is not dropped at the end of the sqlscript,
It does not hava any problem, but I think it's better to drop the table in time.

Please see the attachment for the patch.

Best regards,
Houzj





0001-enhance-test-for-hashdisk.patch
Description: 0001-enhance-test-for-hashdisk.patch


Re: The ultimate extension hook.

2020-09-23 Thread Daniel Wood


> On 09/23/2020 9:26 PM Tom Lane  wrote:
> ...
> > The hook I'd like to see would be in the PostgresMain() loop
> > for the API "firstchar" messages.
> 
> What, to invent your own protocol?  Where will you find client libraries
> buying into that?

No API/client changes are needed for:
1) API tracing/filtering; or
3) custom SQL like commands through a trivial modification to  Simple Query 
'Q'.  Purely optional as you'll see at the end.

Yes, (2) API extension "case 'A'" could be used to roll ones own protocol.  
When pondering API hooking, in general, I thought of this also but don't let it 
be a distraction.

> I'm not really convinced that any of the specific use-cases you suggest
> are untenable to approach via the existing function fastpath mechanism,
> anyway.

Certainly (3) is just a command level way to execute a function instead of 
'select myfunc()'.  But it does go through the SQL machinery and SQL argument 
type lookup and processing.  I like fast and direct things.  And (3) is so 
trivial to implement.

However, even fastpath doesn't provide a protocol hook function where tracing 
could be done.  If I had that alone I could do my own 'Q' hook and do the 
"!cmd" processing in my extension even if I sold the idea just based on 
tracing/filtering.

We hook all kinds of things in PG.  Think big.  Why should the protocol 
processing not have a hook?  I'll bet some others will think of things I 
haven't even yet thought of that would leverage this.

- Dan Wood




Re: problem with RETURNING and update row movement

2020-09-23 Thread Amit Langote
On Thu, Sep 24, 2020 at 1:54 PM Michael Paquier  wrote:
>
> On Thu, Sep 24, 2020 at 04:25:24AM +0900, Etsuro Fujita wrote:
> > Sorry, my thought was vague.  To store xmin/xmax/cmin/cmax into a
> > given slot, we need to extend the TupleTableSlot struct to contain
> > these attributes as well?  Or we need to introduce a new callback
> > routine for that (say, setsysattr)?  These would not be
> > back-patchable, though.
>
> Please note that the latest patch fails to apply, so this needs a
> rebase.

I saw the CF-bot failure too yesterday, although it seems that it's
because the bot is trying to apply the patch version meant for v11
branch onto HEAD branch.  I just checked that the patches apply
cleanly to their respective branches.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Add features to pg_stat_statements

2020-09-23 Thread legrand legrand
Hi,
Both are probably not needed.
I would prefer it accessible in a view live

Event | date | victims 
Eviction...
Reset...
Part reset ...

As there are other needs to track reset times.
Regards
PAscal




--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




Re: The ultimate extension hook.

2020-09-23 Thread David Rowley
On Thu, 24 Sep 2020 at 16:26, Tom Lane  wrote:
>
> Daniel Wood  writes:
> > Hooks exist all over PG for extensions to cover various specific usages.
> > The hook I'd like to see would be in the PostgresMain() loop
> > for the API "firstchar" messages.
>
> What, to invent your own protocol?  Where will you find client libraries
> buying into that?

Well, Dan did mention other use cases.  It's certainly questionable if
people wanted to use it to invent their own message types as they'd
need client support.  However, when it comes to newly proposed hooks,
I thought we should be asking ourself questions like, are there
legitimate use cases for this?  Is it safe to expose this?  It seems a
bit backwards to consider illegitimate uses of a hook unless they
relate to security.

> I'm not really convinced that any of the specific use-cases you suggest
> are untenable to approach via the existing function fastpath mechanism,
> anyway.

I wondered if there was much in the way of use-cases like a traffic
filter, or statement replication. I wasn't sure if it was a solution
looking for a problem or not, but it seems like it could be productive
to talk about possibilities here and make a judgement call based on if
any alternatives exist today that will allow that problem to be solved
sufficiently in another way.

David




Re: Missing TOAST table for pg_class

2020-09-23 Thread Michael Paquier
On Wed, Sep 23, 2020 at 06:11:06PM -0300, Fabrízio de Royes Mello wrote:
> Adding a TOAST can cause circular dependencies between those relations?? If
> you don't mind can you explain more about it?

The difficult task here is to make sure that we don't have any corner
cases that begin to break because of those additions:
https://www.postgresql.org/message-id/20180719235006.5oqjjscmp3nxj...@alap3.anarazel.de
--
Michael


signature.asc
Description: PGP signature


Re: Feature improvement for pg_stat_statements

2020-09-23 Thread btnakamichin

2020-09-22 04:58 に Andres Freund さんは書きました:

Hi,

On 2020-09-18 17:55:12 +0900, btnakamichin wrote:

I’m thinking of adding adding a function called
pg_stat_statements_reset_time() that returns the last timestamp when
executed pg_stat_statements_reset(). pg_stat_statements can reset each 
SQL
statement. We can record each sql reset timing timestamp but I don’t 
feel
the need. This time, I’m thinking of updating the reset timestamp only 
when

all stats were reset.


What exactly do you mean by "can reset each SQL statement"? I don't
really see what alternative you're analyzing?

It does make sense to me to have a function returning the time of the
last reset.

Greetings,

Andres Freund


I’m Sorry, I forgot to include pgsql_hackers in the cc, so I resend it 
and attach the patch.

Thank you, I appreciate your comment.

I am unfamiliar with this function. I’m sorry if my interpretation is 
mistaken.



What exactly do you mean by "can reset each SQL statement"? I don't
really see what alternative you're analyzing?


pg_stat_statements_reset discards statistics gathered so far by 
pg_stat_statements corresponding to the specified userid, dbid and 
queryid.


If no parameter is specified or all the specified parameters are 
0(invalid), it will discard all statistics.


I think that it is important to record timestamp discarding all 
statistics so I’d like to add a function for only all stats were reset.


The following is an example of this feature.
postgres=# select * from pg_stat_statements_reset_time();
 pg_stat_statements_reset_time
---
 2020-09-24 13:36:44.87005+09
(1 row)

Best regards,

Naoki Nakamichidiff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 081f997d70..3ec627b956 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -6,7 +6,7 @@ OBJS = \
 	pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql \
+DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql \
 	pg_stat_statements--1.7--1.8.sql pg_stat_statements--1.6--1.7.sql \
 	pg_stat_statements--1.5--1.6.sql pg_stat_statements--1.4--1.5.sql \
 	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
new file mode 100644
index 00..55cd5a3a63
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql
@@ -0,0 +1,12 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.8--1.9.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.9'" to load this file. \quit
+
+--- Define pg_stat_statements_reset_time
+
+CREATE FUNCTION pg_stat_statements_reset_time()
+RETURNS TIMESTAMP WITH TIME ZONE
+AS 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE STRICT;
\ No newline at end of file
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1eac9edaee..f2e7fbda77 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -81,12 +81,12 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/timestamp.h"
 
 PG_MODULE_MAGIC;
 
 /* Location of permanent stats file (valid when database is shut down) */
 #define PGSS_DUMP_FILE	PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
-
 /*
  * Location of external query text file.  We don't keep it in the core
  * system's stats_temp_directory.  The core system can safely use that GUC
@@ -222,6 +222,7 @@ typedef struct pgssSharedState
 	Size		extent;			/* current extent of query file */
 	int			n_writers;		/* number of active writers to query file */
 	int			gc_count;		/* query file garbage collection cycle count */
+	TimestampTz reset_time; /* timestamp with all stats reset */
 } pgssSharedState;
 
 /*
@@ -327,6 +328,7 @@ PG_FUNCTION_INFO_V1(pg_stat_statements_1_2);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_3);
 PG_FUNCTION_INFO_V1(pg_stat_statements_1_8);
 PG_FUNCTION_INFO_V1(pg_stat_statements);
+PG_FUNCTION_INFO_V1(pg_stat_statements_reset_time);
 
 static void pgss_shmem_startup(void);
 static void pgss_shmem_shutdown(int code, Datum arg);
@@ -554,6 +556,7 @@ pgss_shmem_startup(void)
 		pgss->extent = 0;
 		pgss->n_writers = 0;
 		pgss->gc_count = 0;
+		pgss->reset_time = GetCurrentTimestamp();
 	}
 
 	memset(&info, 0, sizeof(info));
@@ -673,6 +676,10 @@ pgss_shmem_startup(void)
 		entry->counters = temp.counters;
 	}
 
+	/* read timestamp when all stats were reseted */
+	if (fread(&pgss->reset_time, sizeof(TimestampTz), 1, file) != 1)
+		goto read_error;
+
 	pfree(buffer);
 	FreeFile(file);
 	FreeFile(qfile);
@@ -794,6 +801,10 @@ pgss_shmem_shutdown(int code, Datum arg)
 		}
 	}
 
+	/* store timestamp when all stats were reseted */
+	

Re: problem with RETURNING and update row movement

2020-09-23 Thread Michael Paquier
On Thu, Sep 24, 2020 at 04:25:24AM +0900, Etsuro Fujita wrote:
> Sorry, my thought was vague.  To store xmin/xmax/cmin/cmax into a
> given slot, we need to extend the TupleTableSlot struct to contain
> these attributes as well?  Or we need to introduce a new callback
> routine for that (say, setsysattr)?  These would not be
> back-patchable, though.

Please note that the latest patch fails to apply, so this needs a
rebase.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-09-23 Thread Michael Paquier
On Tue, Aug 18, 2020 at 01:12:51PM +0300, Anna Akenteva wrote:
> I updated the most recent patch and removed the use of "master" from it,
> replacing it with "primary".

This is failing to apply lately, causing the CF bot to complain:
http://cfbot.cputube.org/patch_29_772.log
--
Michael


signature.asc
Description: PGP signature


Re: Online checksums patch - once again

2020-09-23 Thread Michael Paquier
On Wed, Sep 23, 2020 at 02:34:36PM +0200, Daniel Gustafsson wrote:
> While in the patch I realized that the relationlist saved the relkind but the
> code wasn't actually using it, so I've gone ahead and removed that with a lot
> fewer palloc calls as a result. The attached v22 fixes that and the above.

Some of the TAP tests are blowing up here, as the CF bot is telling:
t/003_standby_checksum.pl .. 1/11 # Looks like you planned 11 tests but ran 4.
# Looks like your test exited with 29 just after 4.
t/003_standby_checksum.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
--
Michael


signature.asc
Description: PGP signature


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-23 Thread tsunakawa.ta...@fujitsu.com
In v15:

(1)
+   for (cur_blk = firstDelBlock[j]; cur_blk < 
nblocks; cur_blk++)

The right side of "cur_blk <" should not be nblocks, because nblocks is not the 
number of the relation fork anymore.


(2)
+   BlockNumber nblocks;
+   nblocks = smgrnblocks(smgr_reln, forkNum[j]) - 
firstDelBlock[j];

You should either:

* Combine the two lines into one: BlockNumber nblocks = ...;

or

* Put an empty line between the two lines to separate declarations and 
execution statements.


After correcting these, I think you can check the recovery performance.



Regards
Takayuki Tsunakawa






Re: The ultimate extension hook.

2020-09-23 Thread Tom Lane
Daniel Wood  writes:
> Hooks exist all over PG for extensions to cover various specific usages.
> The hook I'd like to see would be in the PostgresMain() loop
> for the API "firstchar" messages.

What, to invent your own protocol?  Where will you find client libraries
buying into that?

I'm not really convinced that any of the specific use-cases you suggest
are untenable to approach via the existing function fastpath mechanism,
anyway.

regards, tom lane




Re: Any objections to implementing LogicalDecodeMessageCB for pgoutput?

2020-09-23 Thread Michael Paquier
On Tue, Sep 08, 2020 at 12:18:23PM -0700, Andres Freund wrote:
> A test verifying that a non-transactional message in later aborted
> transaction is handled correctly would be good.

On top of that, the patch needs a rebase as it visibly fails to apply,
per the CF bot.
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: Attempt to make dbsize a bit more consistent

2020-09-23 Thread Michael Paquier
On Thu, Sep 10, 2020 at 11:51:30AM +0200, Daniel Gustafsson wrote:
> Some comments on the patch:

Extra comment for this patch: regression tests are failing.
--
Michael


signature.asc
Description: PGP signature


The ultimate extension hook.

2020-09-23 Thread Daniel Wood
Hooks exist all over PG for extensions to cover various specific usages.

The hook I'd like to see would be in the PostgresMain() loop
for the API "firstchar" messages.

While I started just wanting the hook for the absolute minimum overhead to 
execute a function, even faster than fastpath, and in brainstorming with David 
Rowley other use cases became apparent.

API tracing within the engine.  I've heard of client tools for this.
API filtering.  Block/ignore manual checkpoints for instance.
API message altering.
Anything you want to hook into at the highest level well above ExecutorRun.

Originally I just wanted a lightweight mechanism to capture some system 
counters like /proc/stat without going through the SQL execution machinery.  
I'm picky about implementing stuff in the absolute fastest way.  :-)  But I 
think there are other practical things that I haven't even thought of yet.

There are a few implementation mechanisms which achieve slightly different 
possibilities:

1) The generic mechanism would let one or more API filters be installed to 
directly call functions in an extension.  There would be no SQL arg processing 
overhead based on the specific function.   You'd just pass it the 
StringInfoData 'msg' itself. Multiple extensions might use the hook so you'd 
need to rewind the StringInfo buffer.  Maybe I return a boolean to indicate no 
further processing of this message or fall through to the normal "switch 
(firstchar)" processing.

2) switch (firstchar) { case 'A': // New trivial API message for extensions
which would call a single extension installed function to do whatever I wanted 
based on the message payload.  And, yes, I know this can be done just using 
SQL.  It is simply a variation.  But this would require client support and I 
prefer the below.

3) case 'Q':  /* simple query */
if (pq_peekbyte() == '!' && APIHook != NULL) {
(*APIHook)(msg);

...
continue;
}

I've use this last technique to do things like:
if (!strncmp(query_string, "DIEDIEDIE", 9) {
char *np = NULL;
*np = 1;
} else if (!strncmp(query_string, "PING", 4) {
static const char *pong = "PONG";
pq_putmessage('C', pong, strlen(pong) + 1);
send_ready_for_query = true;
continue;
} else if (...)

Then I can simple type PING into psql and get back a PONG.
Or during a stress test on a remote box I can execute the simple query 
"DIEDIEDIE" and crash the server.  I did this inline for experimentation before 
but it would be nice if I had the mechanism to use a "statement" to invoke a 
hook function in an extension.  A single check for "!" in the 'Q' processing 
would allow user defined commands in extensions.  The dispatcher would be in 
the extension.  I just need the "!" check.

Another example where ultimate performance might be a goal, if  you are 
familiar with why redis/memcached/etc. exists then imagine loading SQL results 
into a cache in an extension and executing as a 'simple' query something like:  
!LOOKUP 
and getting the value faster than SQL could do.

Before I prototype I want to get some feedback.  Why not have a hook at the API 
level?

Re: Get memory contexts of an arbitrary backend process

2020-09-23 Thread Michael Paquier
On Thu, Sep 10, 2020 at 09:11:21PM +0900, Kasahara Tatsuhito wrote:
> I think it's fine to have an interface to delete in an emergency, but
> I agree that
> users shouldn't be made aware of the existence or deletion of dump
> files, basically.

Per the CF bot, the number of tests needs to be tweaked, because we
test each entry filtered out with is_deeply(), meaning that the number
of tests needs to be updated to reflect that if the filtered list is
changed:
t/010_pg_basebackup.pl ... 104/109 # Looks like you planned 109 tests but ran 
110.
t/010_pg_basebackup.pl ... Dubious, test returned 255 (wstat 65280, 0xff00)
All 109 subtests passed

Simple enough to fix.
--
Michael


signature.asc
Description: PGP signature


Re: proposal: schema variables

2020-09-23 Thread Pavel Stehule
čt 24. 9. 2020 v 5:56 odesílatel Michael Paquier 
napsal:

> On Sat, Jul 11, 2020 at 06:44:24AM +0200, Pavel Stehule wrote:
> > rebase
>
> Per the CF bot, this needs an extra rebase as it does not apply
> anymore.  This has not attracted much the attention of committers as
> well.
>

I'll fix it today

--
> Michael
>


Re: FETCH FIRST clause PERCENT option

2020-09-23 Thread Michael Paquier
On Mon, Aug 10, 2020 at 01:23:44PM +0300, Surafel Temesgen wrote:
> I also Implement PERCENT WITH TIES option. patch is attached
> i don't start a new tread because the patches share common code

This fails to apply per the CF bot.  Could you send a rebase?
--
Michael


signature.asc
Description: PGP signature


Re: proposal: schema variables

2020-09-23 Thread Michael Paquier
On Sat, Jul 11, 2020 at 06:44:24AM +0200, Pavel Stehule wrote:
> rebase

Per the CF bot, this needs an extra rebase as it does not apply
anymore.  This has not attracted much the attention of committers as
well.
--
Michael


signature.asc
Description: PGP signature


Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-09-23 Thread Amit Langote
Hi Alvaro,

Sorry I totally failed to see the v2 you had posted and a couple of
other emails where you mentioned the issues I brought up.

On Thu, Sep 24, 2020 at 12:23 AM Alvaro Herrera
 wrote:
> On 2020-Sep-23, Amit Langote wrote:
 > I suspect that we don't really need this defensive constraint.  I mean
> > even after committing the 1st transaction, the partition being
> > detached still has relispartition set to true and
> > RelationGetPartitionQual() still returns the partition constraint, so
> > it seems the constraint being added above is duplicative.
>
> Ah, thanks for thinking through that.  I had vague thoughts about this
> being unnecessary in the current mechanics, but hadn't fully
> materialized the thought.  (The patch was completely different a few
> unposted iterations ago).
>
> > Moreover, the constraint is not removed as part of any cleaning up
> > after the DETACH process, so repeated attach/detach of the same
> > partition results in the constraints piling up:
>
> Yeah, I knew about this issue (mentioned in my self-reply on Aug 4) and
> didn't worry too much about it because I was thinking I'd rather get rid
> of the constraint addition in the first place.

Okay, gotcha.

> > Noticed a bug/typo in the patched RelationBuildPartitionDesc():
> >
> > -   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
> > +   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock,
> > +   include_detached);
> >
> > You're passing NoLock for include_detached which means you never
> > actually end up including detached partitions from here.
>
> I fixed this in the version I posted on Sept 10.  I think you were
> reading the version posted at the start of this thread.

I am trying the v2 now and I can confirm that those problems are now fixed.

However, I am a bit curious about including detached partitions in
some cases while not in other, which can result in a (to me)
surprising behavior as follows:

Session 1:

create table foo (a int, b int) partition by range (a);
create table foo1 partition of foo for values from (1) to (2);
create table foo2 partition of foo for values from (2) to (3);

...attach GDB and set breakpoint so as to block right after finishing
the 1st transaction of DETACH PARTITION CONCURRENTLY...
alter table foo detach partition foo2 concurrently;


Session 2:

begin;
insert into foo values (2);  -- ok
select * from foo;
select * from foo;  -- ?!
 a | b
---+---
(0 rows)

Maybe, it's fine to just always exclude detached partitions, although
perhaps I am missing some corner cases that you have thought of?

Also, I noticed that looking up a parent's partitions via
RelationBuildPartitionDesc or directly will inspect inhdetached to
include or exclude partitions, but checking if a child table is a
partition of a given parent table via get_partition_parent doesn't.
Now if you fix get_partition_parent() to also take into account
inhdetached, for example, to return InvalidOid if true, then the
callers would need to not consider the child table a valid partition.
So, RelationGetPartitionQual() on a detached partition should actually
return NIL, making my earlier claim about not needing the defensive
CHECK constraint invalid.  But maybe that's fine if all places agree
on a detached partition not being a valid partition anymore?

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: "cert" + clientcert=verify-ca in pg_hba.conf?

2020-09-23 Thread Michael Paquier
On Tue, Sep 01, 2020 at 10:27:03PM -0400, Bruce Momjian wrote:
> OK, good.  Let's wait a few days and I will then apply it for PG 14.

It has been a few days, and nothing has happened here.  I have not
looked at the patch in details, so I cannot say if that's fine or not,
but please note that the patch fails to apply per the CF bot.
--
Michael


signature.asc
Description: PGP signature


Re: Avoiding hash join batch explosions with extreme skew and weird stats

2020-09-23 Thread Michael Paquier
On Mon, Aug 31, 2020 at 03:13:06PM -0700, Melanie Plageman wrote:
> Attached is the current version of adaptive hash join with two
> significant changes as compared to v10:

The CF bot is complaining about a regression test failure:
@@ -2465,7 +2465,7 @@
  Gather (actual rows=469 loops=1)
Workers Planned: 1
Workers Launched: 1
-   ->  Parallel Hash Left Join (actual rows=234 loops=2)
+   ->  Parallel Hash Left Join (actual rows=235 loops=2)
--
Michael


signature.asc
Description: PGP signature


Re: Display individual query in pg_stat_activity

2020-09-23 Thread Michael Paquier
On Thu, Sep 10, 2020 at 04:06:17PM +0200, Drouvot, Bertrand wrote:
> Attaching a new version as the previous one was not passing the Patch Tester
> anymore.

Ditto, the CF bot is complaining again.  Could you send a rebase?
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Automatic HASH and LIST partition creation

2020-09-23 Thread Michael Paquier
On Mon, Sep 14, 2020 at 02:38:56PM +0300, Anastasia Lubennikova wrote:
> Fixed. This was also caught by cfbot. This version should pass it clean.

Please note that regression tests are failing, because of 6b2c4e59.
--
Michael


signature.asc
Description: PGP signature


Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-23 Thread Greg Nancarrow
On Thu, Sep 24, 2020 at 12:38 PM Amit Kapila  wrote:
>
> I have not checked the patch but I guess if we parallelise Inserts
> with Returning then isn't it better to have Gather node above Parallel
> Inserts?
>

This is indeed the case with the patch applied.

For example:

test=# explain insert into primary_tbl select * from third_tbl
returning index, height;
QUERY PLAN
---
 Gather  (cost=0.00..28.15 rows= width=12)
   Workers Planned: 3
   ->  Parallel Insert on primary_tbl  (cost=0.00..28.15 rows=1040 width=12)
 ->  Parallel Seq Scan on third_tbl  (cost=0.00..87.25
rows=3225 width=12)
(4 rows)

test=# insert into primary_tbl select * from third_tbl returning index, height;
 index | height
---+
 1 |1.2
 2 |1.2
 3 |1.2
 4 |1.2
 5 |1.2
 6 |1.2
 7 |1.2

...

  9435 |1.2
  9619 |1.2
  9620 |1.2
( rows)

INSERT 0 


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Prefer TG_TABLE_NAME over TG_RELNAME in tests

2020-09-23 Thread Michael Paquier
On Wed, Sep 23, 2020 at 12:07:14PM +0200, Daniel Gustafsson wrote:
> TG_RELNAME was marked deprecated in commit 3a9ae3d2068 some 14 years ago, but
> we still use it in the triggers test suite (test added in 59b4cef1eb74a a year
> before deprecation).  Seems about time to move over to TG_TABLE_NAME 
> ourselves,
> as TG_RELNAME is still covered by the test added in the deprecation commit.

No objections from here to remove that from the core tests.  It is
worth noting that Debian Code Search hints that this is used in some
extensions:
https://codesearch.debian.net/search?q=TG_RELNAME&literal=1

These are pgformatter, bucardo, sql-ledger, ledgersmb and davical.
--
Michael


signature.asc
Description: PGP signature


Re: Parallel copy

2020-09-23 Thread Greg Nancarrow
Hi Bharath,

> Few things to follow before testing:
> 1. Is the table being dropped/truncated after the test with 0 workers and 
> before running with 1 worker? If not, then the index insertion time would 
> increase.[1](for me it is increasing by 10 sec). This is obvious because the 
> 1st time index will be created from bottom up manner(from leaves to root), 
> but for the 2nd time it has to search and insert at the proper leaves and 
> inner B+Tree nodes.

Yes, it' being truncated before running each and every COPY.

> 2. If possible, can you also run with custom postgresql.conf settings[2] 
> along with default? Just to ensure that other bg processes such as 
> checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For 
> instance, with default postgresql.conf file, it looks like checkpointing[3] 
> is happening frequently, could you please let us know if that happens at your 
> end?

Yes, have run with default and your custom settings. With default
settings, I can confirm that checkpointing is happening frequently
with the tests I've run here.

> 3. Could you please run the test case 3 times at least? Just to ensure the 
> consistency of the issue.

Yes, have run 4 times. Seems to be a performance hit (whether normal
copy or parallel-1 copy) on the first COPY run on a freshly created
database. After that, results are consistent.

> 4. I ran the tests in a performance test system where no other user 
> processes(except system processes) are running. Is it possible for you to do 
> the same?
>
> Please capture and share the timing logs with us.
>

Yes, I have ensured the system is as idle as possible prior to testing.

I have attached the test results obtained after building with your
Parallel Copy patch and testing patch applied (HEAD at
733fa9aa51c526582f100aa0d375e0eb9a6bce8b).

Test results show that Parallel COPY with 1 worker is performing
better than normal COPY in the test scenarios run. There is a
performance hit (regardless of COPY type) on the very first COPY run
on a freshly-created database.

I ran the test case 4 times. and also in reverse order, with truncate
run before each COPY (output and logs named _0_1 run normal COPY
then parallel COPY, and named _1_0 run parallel COPY and then
normal COPY).

Please refer to attached results.

Regards,
Greg


testing_patch_results.tar.gz
Description: application/gzip


scram-sha-256 broken with FIPS and OpenSSL 1.0.2

2020-09-23 Thread Michael Paquier
Hi all,

Enabling FIPS with OpenSSL 1.0.2 causes direct calls to the SHAXXX
routines to fail:
"Low level API call to digest SHA256 forbidden in fips mode"

This got discussed back in 2018, but I never got back to it:
https://www.postgresql.org/message-id/20180911030250.ga27...@paquier.xyz

One thing I did not like back in the past patch was that we did not
handle failures if one of OpenSSL's call failed, but this can easily
be handled by using a trick similar to jsonapi.c to fail hard if that
happens.

It is worth noting that the low-level SHA routines are not recommended
for years in OpenSSL, and that these have been officially marked as
deprecated in 3.0.0.  So, while the changes in sha2.h don't make this
stuff back-patchable per the ABI breakage it introduces, switching 
sha2_openssl.c to use EVP is a better move in the long term, even if
that means that SCRAM+FIPS would not work with PG 10~13, so the
attached is something for HEAD, even if this would be possible to do
in older releases as the routines used in the attached are available
in versions of OpenSSL older than 1.0.1.

Any thoughts?
--
Michael
diff --git a/src/include/common/sha2.h b/src/include/common/sha2.h
index 9c4abf777d..2c52838161 100644
--- a/src/include/common/sha2.h
+++ b/src/include/common/sha2.h
@@ -51,7 +51,7 @@
 #define _PG_SHA2_H_
 
 #ifdef USE_OPENSSL
-#include 
+#include 
 #endif
 
 /*** SHA224/256/384/512 Various Length Definitions ***/
@@ -70,10 +70,10 @@
 
 /* Context Structures for SHA224/256/384/512 */
 #ifdef USE_OPENSSL
-typedef SHA256_CTX pg_sha256_ctx;
-typedef SHA512_CTX pg_sha512_ctx;
-typedef SHA256_CTX pg_sha224_ctx;
-typedef SHA512_CTX pg_sha384_ctx;
+typedef EVP_MD_CTX *pg_sha256_ctx;
+typedef EVP_MD_CTX *pg_sha512_ctx;
+typedef EVP_MD_CTX *pg_sha224_ctx;
+typedef EVP_MD_CTX *pg_sha384_ctx;
 #else
 typedef struct pg_sha256_ctx
 {
diff --git a/src/common/sha2_openssl.c b/src/common/sha2_openssl.c
index 41673b3a88..1d0b254487 100644
--- a/src/common/sha2_openssl.c
+++ b/src/common/sha2_openssl.c
@@ -20,83 +20,116 @@
 #include "postgres_fe.h"
 #endif
 
-#include 
-
 #include "common/sha2.h"
 
+#ifdef FRONTEND
+#include "common/logging.h"
+#else
+#include "miscadmin.h"
+#endif
+
+#ifdef FRONTEND
+#define sha2_log_and_abort(...) \
+	do { pg_log_fatal(__VA_ARGS__); exit(1); } while(0)
+#else
+#define sha2_log_and_abort(...) elog(ERROR, __VA_ARGS__)
+#endif
+
+static void
+digest_init(EVP_MD_CTX **ctx, const EVP_MD *type)
+{
+	*ctx = EVP_MD_CTX_create();
+	if (*ctx == NULL)
+		sha2_log_and_abort("could not create EVP digest context");
+	EVP_DigestInit_ex(*ctx, type, NULL);
+}
+
+static void
+digest_update(EVP_MD_CTX **ctx, const uint8 *data, size_t len)
+{
+	EVP_DigestUpdate(*ctx, data, len);
+}
+
+static void
+digest_final(EVP_MD_CTX **ctx, uint8 *dest)
+{
+	if (EVP_DigestFinal_ex(*ctx, dest, 0) <= 0)
+		sha2_log_and_abort("could not finalize EVP digest context");
+	EVP_MD_CTX_destroy(*ctx);
+}
 
 /* Interface routines for SHA-256 */
 void
 pg_sha256_init(pg_sha256_ctx *ctx)
 {
-	SHA256_Init((SHA256_CTX *) ctx);
+	digest_init(ctx, EVP_sha256());
 }
 
 void
 pg_sha256_update(pg_sha256_ctx *ctx, const uint8 *data, size_t len)
 {
-	SHA256_Update((SHA256_CTX *) ctx, data, len);
+	digest_update(ctx, data, len);
 }
 
 void
 pg_sha256_final(pg_sha256_ctx *ctx, uint8 *dest)
 {
-	SHA256_Final(dest, (SHA256_CTX *) ctx);
+	digest_final(ctx, dest);
 }
 
 /* Interface routines for SHA-512 */
 void
 pg_sha512_init(pg_sha512_ctx *ctx)
 {
-	SHA512_Init((SHA512_CTX *) ctx);
+	digest_init(ctx, EVP_sha512());
 }
 
 void
 pg_sha512_update(pg_sha512_ctx *ctx, const uint8 *data, size_t len)
 {
-	SHA512_Update((SHA512_CTX *) ctx, data, len);
+	digest_update(ctx, data, len);
 }
 
 void
 pg_sha512_final(pg_sha512_ctx *ctx, uint8 *dest)
 {
-	SHA512_Final(dest, (SHA512_CTX *) ctx);
+	digest_final(ctx, dest);
 }
 
 /* Interface routines for SHA-384 */
 void
 pg_sha384_init(pg_sha384_ctx *ctx)
 {
-	SHA384_Init((SHA512_CTX *) ctx);
+	digest_init(ctx, EVP_sha384());
 }
 
 void
 pg_sha384_update(pg_sha384_ctx *ctx, const uint8 *data, size_t len)
 {
-	SHA384_Update((SHA512_CTX *) ctx, data, len);
+	digest_update(ctx, data, len);
 }
 
 void
 pg_sha384_final(pg_sha384_ctx *ctx, uint8 *dest)
 {
-	SHA384_Final(dest, (SHA512_CTX *) ctx);
+	digest_final(ctx, dest);
 }
 
 /* Interface routines for SHA-224 */
 void
 pg_sha224_init(pg_sha224_ctx *ctx)
 {
-	SHA224_Init((SHA256_CTX *) ctx);
+	digest_init(ctx, EVP_sha224());
 }
 
 void
 pg_sha224_update(pg_sha224_ctx *ctx, const uint8 *data, size_t len)
 {
-	SHA224_Update((SHA256_CTX *) ctx, data, len);
+	digest_update(ctx, data, len);
 }
 
 void
 pg_sha224_final(pg_sha224_ctx *ctx, uint8 *dest)
 {
-	SHA224_Final(dest, (SHA256_CTX *) ctx);
+	digest_final(ctx, dest);
 }


signature.asc
Description: PGP signature


Re: Parallel Inserts in CREATE TABLE AS

2020-09-23 Thread Andres Freund
Hi,

On 2020-09-23 17:20:20 +0530, Bharath Rupireddy wrote:
> The idea of this patch is to allow the leader and each worker insert the
> tuples in parallel if the SELECT part of the CTAS is parallelizable.

Cool!


> The design:

I think it'd be good if you could explain a bit more why you think this
safe to do in the way you have done it.

E.g. from a quick scroll through the patch, there's not even a comment
explaining that the only reason there doesn't need to be code dealing
with xid assignment because we already did the catalog changes to create
the table. But how does that work for SELECT INTO? Are you prohibiting
that? ...


> Pass the into clause, object id, command id from the leader to
> workers, so that each worker can create its own CTAS dest
> receiver. Leader inserts it's share of tuples if instructed to do, and
> so are workers. Each worker writes atomically it's number of inserted
> tuples into a shared memory variable, the leader combines this with
> it's own number of inserted tuples and shares to the client.
> 
> Below things are still pending. Thoughts are most welcome:
> 1. How better we can lift the "cannot insert tuples in a parallel worker"
> from heap_prepare_insert() for only CTAS cases or for that matter parallel
> copy? How about having a variable in any of the worker global contexts and
> use that? Of course, we can remove this restriction entirely in case we
> fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY.

I have mentioned before that I think it'd be good if we changed the
insert APIs to have a more 'scan' like structure. I am thinking of
something like

TableInsertScan* table_begin_insert(Relation);
table_tuple_insert(TableInsertScan *is, other, args);
table_multi_insert(TableInsertScan *is, other, args);
table_end_insert(TableInsertScan *);

that'd then replace the BulkInsertStateData logic we have right now. But
more importantly it'd allow an AM to optimize operations across multiple
inserts, which is important for column stores.

And for the purpose of your question, we could then have a
table_insert_allow_parallel(TableInsertScan *);
or an additional arg to table_begin_insert().



> 3. Need to restrict parallel inserts, if CTAS tries to create temp/global
> tables as the workers will not have access to those tables. Need to analyze
> whether to allow parallelism if CTAS has prepared statements or with no
> data.

In which case does CTAS not create a table? You definitely need to
ensure that the table is created before your workers are started, and
there needs to be in a different CommandId.


Greetings,

Andres Freund




Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-23 Thread Amit Kapila
On Thu, Sep 24, 2020 at 7:57 AM Thomas Munro  wrote:
>
> On Tue, Sep 22, 2020 at 4:56 PM Greg Nancarrow  wrote:
> >  Gather  (cost=0.00..16.00 rows= width=12) (actual
> > time=69.870..70.310 rows=0 loops=1)
> >Workers Planned: 5
> >Workers Launched: 5
> >->  Parallel Insert on primary_tbl  (cost=0.00..16.00 rows=500
> > width=12) (actual time=59.948..59.949 rows=0 loops=6)
>
> Nice.  I took it for a quick spin.  I was initially surprised to see
> Gather.  I suppose I thought that Parallel {Insert|Update|Delete}
> might be a top level node itself, because in such a plan there is no
> need to gather tuples per se.  I understand exactly why you have it
> that way though: Gather is needed to control workers and handle their
> errors etc, and we don't want to have to terminate parallelism anyway
> (thinking of some kind of plan with multiple write subqueries).
>

I have not checked the patch but I guess if we parallelise Inserts
with Returning then isn't it better to have Gather node above Parallel
Inserts?

-- 
With Regards,
Amit Kapila.




Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-23 Thread Thomas Munro
On Tue, Sep 22, 2020 at 4:56 PM Greg Nancarrow  wrote:
>  Gather  (cost=0.00..16.00 rows= width=12) (actual
> time=69.870..70.310 rows=0 loops=1)
>Workers Planned: 5
>Workers Launched: 5
>->  Parallel Insert on primary_tbl  (cost=0.00..16.00 rows=500
> width=12) (actual time=59.948..59.949 rows=0 loops=6)

Nice.  I took it for a quick spin.  I was initially surprised to see
Gather.  I suppose I thought that Parallel {Insert|Update|Delete}
might be a top level node itself, because in such a plan there is no
need to gather tuples per se.  I understand exactly why you have it
that way though: Gather is needed to control workers and handle their
errors etc, and we don't want to have to terminate parallelism anyway
(thinking of some kind of plan with multiple write subqueries).




Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-23 Thread Andres Freund
Hi,

On 2020-09-22 14:55:21 +1000, Greg Nancarrow wrote:
> Following on from Dilip Kumar's POC patch for allowing parallelism of
> the SELECT part of "INSERT INTO ... SELECT ...", I have attached a POC
> patch for allowing parallelism of both the INSERT and SELECT parts,
> where it can be allowed.

Cool!

I think it'd be good if you outlined what your approach is to make this
safe.


> For cases where it can't be allowed (e.g. INSERT into a table with
> foreign keys, or INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE
> ...") it at least allows parallelism of the SELECT part.

I think it'd be good to do this part separately and first, independent
of whether the insert part can be parallelized.


> Obviously I've had to update the planner and executor and
> parallel-worker code to make this happen, hopefully not breaking too
> many things along the way.

Hm, it looks like you've removed a fair bit of checks, it's not clear to
me why that's safe in each instance.


> - Currently only for "INSERT INTO ... SELECT ...". To support "INSERT
> INTO ... VALUES ..." would need additional Table AM functions for
> dividing up the INSERT work amongst the workers (currently only exists
> for scans).

Hm, not entirely following. What precisely are you thinking of here?

I doubt it's really worth adding parallelism support for INSERT
... VALUES, the cost of spawning workers will almost always higher than
the benefit.





> @@ -116,7 +117,7 @@ toast_save_datum(Relation rel, Datum value,
>   TupleDesc   toasttupDesc;
>   Datum   t_values[3];
>   boolt_isnull[3];
> - CommandId   mycid = GetCurrentCommandId(true);
> + CommandId   mycid = GetCurrentCommandId(!IsParallelWorker());
>   struct varlena *result;
>   struct varatt_external toast_pointer;
>   union

Hm? Why do we need this in the various places you have made this change?


> diff --git a/src/backend/access/heap/heapam.c 
> b/src/backend/access/heap/heapam.c
> index 1585861..94c8507 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -2049,11 +2049,6 @@ heap_prepare_insert(Relation relation, HeapTuple tup, 
> TransactionId xid,
>* inserts in general except for the cases where inserts generate a new
>* CommandId (eg. inserts into a table having a foreign key column).
>*/
> - if (IsParallelWorker())
> - ereport(ERROR,
> - (errcode(ERRCODE_INVALID_TRANSACTION_STATE),
> -  errmsg("cannot insert tuples in a parallel 
> worker")));
> -

I'm afraid that this weakens our checks more than I'd like. What if this
ends up being invoked from inside C code?


> @@ -822,19 +822,14 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> Relation NewHeap,
>   isdead = false;
>   break;
>   case HEAPTUPLE_INSERT_IN_PROGRESS:
> -
>   /*
>* Since we hold exclusive lock on the 
> relation, normally the
>* only way to see this is if it was inserted 
> earlier in our
>* own transaction.  However, it can happen in 
> system
>* catalogs, since we tend to release write 
> lock before commit
> -  * there.  Give a warning if neither case 
> applies; but in any
> -  * case we had better copy it.
> +  * there. In any case we had better copy it.
>*/
> - if (!is_system_catalog &&
> - 
> !TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple->t_data)))
> - elog(WARNING, "concurrent insert in 
> progress within table \"%s\"",
> -  
> RelationGetRelationName(OldHeap));
> +
>   /* treat as live */
>   isdead = false;
>   break;
> @@ -1434,16 +1429,11 @@ heapam_index_build_range_scan(Relation heapRelation,
>* the only way to see this is if it 
> was inserted earlier
>* in our own transaction.  However, it 
> can happen in
>* system catalogs, since we tend to 
> release write lock
> -  * before commit there.  Give a warning 
> if neither case
> -  * applies.
> +  * before commit there.
>*/
>   xwait = 
> HeapTupleHeaderGetXmin(heapTuple->t_data);
>   if 
> (!TransactionIdIsCurrentTransactionI

Re: Parallel Inserts in CREATE TABLE AS

2020-09-23 Thread vignesh C
>
> [1] - For table_multi_insert() in CTAS, I used an in-progress patch available 
> at 
> https://www.postgresql.org/message-id/CAEET0ZG31mD5SWjTYsAt0JTLReOejPvusJorZ3kGZ1%3DN1AC-Fw%40mail.gmail.com
> [2] - Table with 2 integer columns, 100million tuples, with leader 
> participation,with default postgresql.conf file. All readings are of triplet 
> form - (workers, exec time in sec, improvement).
> case 1: no multi inserts - 
> (0,120,1X),(1,91,1.32X),(2,75,1.6X),(3,67,1.79X),(4,72,1.66X),(5,77,1.56),(6,83,1.44X)
> case 2: with multi inserts - 
> (0,59,1X),(1,32,1.84X),(2,28,2.1X),(3,25,2.36X),(4,23,2.56X),(5,22,2.68X),(6,22,2.68X)
> case 3: same table but unlogged with multi inserts - 
> (0,50,1X),(1,28,1.78X),(2,25,2X),(3,22,2.27X),(4,21,2.38X),(5,21,2.38X),(6,20,2.5X)
>

I feel this enhancement could give good improvement, +1 for this.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-23 Thread vignesh C
On Tue, Sep 22, 2020 at 10:26 AM Greg Nancarrow  wrote:
>
> Hi Hackers,
>
> Following on from Dilip Kumar's POC patch for allowing parallelism of
> the SELECT part of "INSERT INTO ... SELECT ...", I have attached a POC
> patch for allowing parallelism of both the INSERT and SELECT parts,
> where it can be allowed.
> For cases where it can't be allowed (e.g. INSERT into a table with
> foreign keys, or INSERT INTO ... SELECT ... ON CONFLICT ... DO UPDATE
> ...") it at least allows parallelism of the SELECT part.
> Obviously I've had to update the planner and executor and
> parallel-worker code to make this happen, hopefully not breaking too
> many things along the way.

I feel this will be a very good performance improvement. +1 for this.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: fixing old_snapshot_threshold's time->xid mapping

2020-09-23 Thread Thomas Munro
On Sat, Sep 19, 2020 at 12:19 PM Robert Haas  wrote:
> On Thu, Sep 17, 2020 at 10:40 AM Robert Haas  wrote:
> > Yeah, I plan to push forward with 0001 through 0003 soon, but 0001
> > needs to be revised with a PGDLLIMPORT marking, I think, and 0002
> > needs documentation.
>
> So here's an updated version of those three, with proposed commit
> messages, a PGDLLIMPORT for 0001, and docs for 0002.

LGTM.




Re: Lift line-length limit for pg_service.conf

2020-09-23 Thread Tom Lane
Daniel Gustafsson  writes:
> On 23 Sep 2020, at 21:33, Tom Lane  wrote:
>> 2. In the case where encoding conversion is performed, we still have
>> to pstrdup the result to have a safe copy for "curline".  But I
>> realized that we're making a poor choice of which copy to return to
>> the caller.  The output of encoding conversion is likely to be a good
>> bit bigger than necessary, cf. pg_do_encoding_conversion.  So if the
>> caller is one that saves the output string directly into a long-lived
>> dictionary structure, this wastes space.  We should return the pstrdup
>> result instead, and keep the conversion result as "curline", where
>> we'll free it next time through.

> I wonder if we have other callsites of pg_any_to_server which could benefit
> from lowering the returned allocation, a quick look didn't spot any but today
> has become yesterday here and tiredness might interfere.

After looking more closely, I've realized that actually none of the
existing core-code callers will save the returned string directly.
readstoplist() could do so, depending on what "wordop" is, but
all its existing callers pass lowerstr() which will always make a
new output string.  (Which itself could be a bit bloated :-()

So the concern I expressed above is really just hypothetical.
Still, the code is simpler this way and no slower, so I still
think it's an improvement.

(The bigger picture here is that the API for dictionary init
methods is pretty seriously misdesigned from a memory-consumption
standpoint.  Running the entire init process in the dictionary's
long-lived context goes against everything we've learned about
avoiding memory leaks.  We should run that code in a short-lived
command execution context, and explicitly copy just the data we
want into the long-lived context.  But changing that would be
a pretty big deal, breaking third-party dictionaries.  So I'm
not sure it's enough of a problem to justify the change.)

regards, tom lane




Re: Lift line-length limit for pg_service.conf

2020-09-23 Thread Daniel Gustafsson
> On 23 Sep 2020, at 21:33, Tom Lane  wrote:
> 
> I wrote:
>> So the attached adds a pstrdup/pfree to ensure that "curline"
>> has its own storage, putting us right back at two palloc/pfree
>> cycles per line.  I don't think there's a lot of choice though;
>> in fact, I'm leaning to the idea that we need to back-patch
>> that part of this.  The odds of trouble in a production build
>> probably aren't high, but still...
> 
> So I did that, but while looking at the main patch I realized that
> a few things were still being left on the table:

> 2. In the case where encoding conversion is performed, we still have
> to pstrdup the result to have a safe copy for "curline".  But I
> realized that we're making a poor choice of which copy to return to
> the caller.  The output of encoding conversion is likely to be a good
> bit bigger than necessary, cf. pg_do_encoding_conversion.  So if the
> caller is one that saves the output string directly into a long-lived
> dictionary structure, this wastes space.  We should return the pstrdup
> result instead, and keep the conversion result as "curline", where
> we'll free it next time through.

I wonder if we have other callsites of pg_any_to_server which could benefit
from lowering the returned allocation, a quick look didn't spot any but today
has become yesterday here and tiredness might interfere.

> So those considerations lead me to the attached.

Eyeing memory used by callers of tsearch_readline validates your observations,
I don't see any case where this patch isn't safe and nothing sticks out.  +1 on
this, nice one!

cheers ./daniel



Re: Syncing pg_multixact directories

2020-09-23 Thread Thomas Munro
On Wed, Sep 23, 2020 at 4:11 PM Thomas Munro  wrote:
> On Wed, Sep 23, 2020 at 4:09 PM Michael Paquier  wrote:
> > On Tue, Sep 22, 2020 at 07:05:36PM -0700, Andres Freund wrote:
> > > Good catch! Probably that should probably be backpatched...
> >
> > +1.  Passing that down to the SLRU layer is a nice thing to do.  Were
> > you planning to send a second patch here?  The commit log generated
> > mentions patch 1/2.

While back-patching I realised that 9.5 and 9.6 had the same problem
for other SLRUs, so I updated the commit message accordingly and
pushed.  Thanks!




Re: Report error position in partition bound check

2020-09-23 Thread Tom Lane
I looked this over and pushed it with some minor adjustments.

However, while I was looking at it I couldn't help noticing that
transformPartitionBoundValue's handling of collation concerns seems
less than sane.  There are two things bugging me:

1. Why does it care about the expression's collation only when there's
a top-level CollateExpr?  For example, that means we get an error for

regression=# create table p (f1 text collate "C") partition by list(f1);
CREATE TABLE
regression=# create table c1 partition of p for values in ('a' collate "POSIX");
ERROR:  collation of partition bound value for column "f1" does not match 
partition key collation "C"

but not this:

regression=# create table c2 partition of p for values in ('a' || 'b' collate 
"POSIX");
CREATE TABLE

Given that we will override the expression's collation with the partition
column's collation anyway, I don't see why we have this check at all,
so my preference is to just rip out the entire stanza beginning with
"if (IsA(value, CollateExpr))".  If we keep it, though, I think it needs
to do something else that is more general.

2. Nothing is doing assign_expr_collations() on the partition expression.
This can trivially be shown to cause problems:

regression=# create table p (f1 bool) partition by list(f1);
CREATE TABLE
regression=# create table cf partition of p for values in ('a' < 'b');
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.


If we want to rip out the collation mismatch error altogether, then
fixing #2 would just require inserting assign_expr_collations() before
the expression_planner() call.  The other direction that would make
sense to me is to perform assign_expr_collations() after
coerce_to_target_type(), and then to complain if exprCollation()
isn't default and doesn't match the partition collation.  In any
case a specific test for a CollateExpr seems quite wrong.

regards, tom lane




Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-23 Thread Masahiko Sawada
On Tue, 22 Sep 2020 at 10:17, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Masahiko Sawada 
> > Yes, but it still seems hard to me that we require for all FDW
> > implementations to commit/rollback prepared transactions without the
> > possibility of ERROR.
>
> Of course we can't eliminate the possibility of error, because remote servers 
> require network communication.  What I'm saying is to just require the FDW to 
> return error like xa_commit(), not throwing control away with ereport(ERROR). 
>  I don't think it's too strict.

So with your idea, I think we require FDW developers to not call
ereport(ERROR) as much as possible. If they need to use a function
including palloc, lappend etc that could call ereport(ERROR), they
need to use PG_TRY() and PG_CATCH() and return the control along with
the error message to the transaction manager rather than raising an
error. Then the transaction manager will emit the error message at an
error level lower than ERROR (e.g., WARNING), and call commit/rollback
API again. But normally we do some cleanup on error but in this case
the retrying commit/rollback is performed without any cleanup. Is that
right? I’m not sure it’s safe though.

>
>
> > I think it's not necessarily that all FDW implementations need to be
> > able to support xa_complete(). We can support both synchronous and
> > asynchronous executions of prepare/commit/rollback.
>
> Yes, I think parallel prepare and commit can be an option for FDW.  But I 
> don't think it's an option for a serious scale-out DBMS.  If we want to use 
> FDW as part of PostgreSQL's scale-out infrastructure, we should design (if 
> not implemented in the first version) how the parallelism can be realized.  
> That design is also necessary because it could affect the FDW API.
>
>
> > If you're concerned that executing a UDF function by like 'SELECT
> > myfunc();' updates data on a foreign server, since the UDF should know
> > which foreign server it modifies data on it should be able to register
> > the foreign server and mark as modified. Or you’re concerned that a
> > UDF function in WHERE condition is pushed down and updates data (e.g.,
> >  ‘SELECT … FROM foreign_tbl WHERE id = myfunc()’)?
>
> What I had in mind is "SELECT myfunc(...) FROM mytable WHERE col = ...;"  
> Does the UDF call get pushed down to the foreign server in this case?  If not 
> now, could it be pushed down in the future?  If it could be, it's worth 
> considering how to detect the remote update now.

IIUC aggregation functions can be pushed down to the foreign server
but I have not idea the normal UDF in the select list is pushed down.
I wonder if it isn't.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Missing TOAST table for pg_class

2020-09-23 Thread Fabrízio de Royes Mello
On Tue, Sep 22, 2020 at 10:57 PM Michael Paquier 
wrote:
>
> On Tue, Sep 22, 2020 at 05:35:48PM -0400, Tom Lane wrote:
> > What exactly do you argue has changed since the previous decision
> > that should cause us to change it?  In particular, where is the
> > additional data to change our minds about the safety of such a thing?
>

>From a technical perspective I really don't know how to solve it, but my
intention here is to raise a hand and demonstrate there are real scenarios
where Postgres breaks so easily.

And unfortunately for the user perspective it sounds a bit fragile. Ok it's
not a very common use case and the solution isn't easy, because if it is
I'm sure it was already solved before.


> Not sure that's safe, as we also want to avoid circular dependencies
> similarly for pg_class, pg_index and pg_attribute.
>

Adding a TOAST can cause circular dependencies between those relations?? If
you don't mind can you explain more about it?

Regards,

--
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


Re: Lift line-length limit for pg_service.conf

2020-09-23 Thread Tom Lane
I wrote:
> So the attached adds a pstrdup/pfree to ensure that "curline"
> has its own storage, putting us right back at two palloc/pfree
> cycles per line.  I don't think there's a lot of choice though;
> in fact, I'm leaning to the idea that we need to back-patch
> that part of this.  The odds of trouble in a production build
> probably aren't high, but still...

So I did that, but while looking at the main patch I realized that
a few things were still being left on the table:

1. We can save a palloc/pfree cycle in the case where no encoding
conversion need be performed, by allowing "curline" to point at
the StringInfo buffer instead of necessarily being a separate
palloc'd string.  (This seems safe since no other code should
manipulate the StringInfo before the next tsearch_readline call;
so we can't get confused about whether "curline" has its own storage.)

2. In the case where encoding conversion is performed, we still have
to pstrdup the result to have a safe copy for "curline".  But I
realized that we're making a poor choice of which copy to return to
the caller.  The output of encoding conversion is likely to be a good
bit bigger than necessary, cf. pg_do_encoding_conversion.  So if the
caller is one that saves the output string directly into a long-lived
dictionary structure, this wastes space.  We should return the pstrdup
result instead, and keep the conversion result as "curline", where
we'll free it next time through.

3. AFAICT, it's completely useless for tsearch_readline to call
pg_verify_mbstr, because pg_any_to_server will either do that exact
thing itself, or verify the input encoding as part of conversion.
Some quick testing says that you don't even get different error
messages.  So we should just drop that step.  (It's likely that the
separate call was actually needed when this code was written; I think
we tightened up the expectations for verification within encoding
conversion somewhere along the line.  But we demonstrably don't need
it now.)

So those considerations lead me to the attached.

regards, tom lane

diff --git a/src/backend/tsearch/dict_thesaurus.c b/src/backend/tsearch/dict_thesaurus.c
index cb0835982d..64c979086d 100644
--- a/src/backend/tsearch/dict_thesaurus.c
+++ b/src/backend/tsearch/dict_thesaurus.c
@@ -286,11 +286,6 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 	(errcode(ERRCODE_CONFIG_FILE_ERROR),
 	 errmsg("unexpected end of line")));
 
-		/*
-		 * Note: currently, tsearch_readline can't return lines exceeding 4KB,
-		 * so overflow of the word counts is impossible.  But that may not
-		 * always be true, so let's check.
-		 */
 		if (nwrd != (uint16) nwrd || posinsubst != (uint16) posinsubst)
 			ereport(ERROR,
 	(errcode(ERRCODE_CONFIG_FILE_ERROR),
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index 9b199d0ac1..deaafe57e6 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "catalog/pg_collation.h"
+#include "common/string.h"
 #include "storage/fd.h"
 #include "tsearch/ts_locale.h"
 #include "tsearch/ts_public.h"
@@ -128,6 +129,7 @@ tsearch_readline_begin(tsearch_readline_state *stp,
 		return false;
 	stp->filename = filename;
 	stp->lineno = 0;
+	initStringInfo(&stp->buf);
 	stp->curline = NULL;
 	/* Setup error traceback support for ereport() */
 	stp->cb.callback = tsearch_readline_callback;
@@ -145,7 +147,7 @@ tsearch_readline_begin(tsearch_readline_state *stp,
 char *
 tsearch_readline(tsearch_readline_state *stp)
 {
-	char	   *result;
+	char	   *recoded;
 
 	/* Advance line number to use in error reports */
 	stp->lineno++;
@@ -153,23 +155,31 @@ tsearch_readline(tsearch_readline_state *stp)
 	/* Clear curline, it's no longer relevant */
 	if (stp->curline)
 	{
-		pfree(stp->curline);
+		if (stp->curline != stp->buf.data)
+			pfree(stp->curline);
 		stp->curline = NULL;
 	}
 
 	/* Collect next line, if there is one */
-	result = t_readline(stp->fp);
-	if (!result)
+	if (!pg_get_line_buf(stp->fp, &stp->buf))
 		return NULL;
 
+	/* Validate the input as UTF-8, then convert to DB encoding if needed */
+	recoded = pg_any_to_server(stp->buf.data, stp->buf.len, PG_UTF8);
+
+	/* Save the correctly-encoded string for possible error reports */
+	stp->curline = recoded;		/* might be equal to buf.data */
+
 	/*
-	 * Save a copy of the line for possible use in error reports.  (We cannot
-	 * just save "result", since it's likely to get pfree'd at some point by
-	 * the caller; an error after that would try to access freed data.)
+	 * We always return a freshly pstrdup'd string.  This is clearly necessary
+	 * if pg_any_to_server() returned buf.data, and it is desirable even if
+	 * conversion occurred, because the palloc'd result of conversion is
+	 * likely to be much bigger than minimally necessary.  Since some callers
+	 * save the result string directly into a long-lived dictionary structure,
+	 * w

Re: Batching page logging during B-tree build

2020-09-23 Thread Andres Freund
Hi,

On 2020-09-23 12:02:42 -0700, Peter Geoghegan wrote:
> On Wed, Sep 23, 2020 at 11:29 AM Andres Freund  wrote:
> > Really should replace WAL compression with lz4 (or possibly zstd).
> 
> Yeah. WAL compression is generally a good idea, and we should probably
> find a way to enable it by default (in the absence of some better way
> of dealing with the FPI bottleneck, at least). I had no idea that
> compression could hurt this much with index builds until now, though.
> To be clear: the *entire* index build takes 3 times more wall clock
> time, start to finish -- if I drilled down to the portion of the index
> build that actually writes WAL then it would be an even greater
> bottleneck.

I have definitely seen this before. The faster the storage, the bigger
the issue (because there's little to be gained by reducing the volume of
WAL).  In my experience the problem tends to be bigger when there's
*less* concurrency, because without the concurrency it's much less
likely for WAL writes to be a bottleneck.


> My guess is that the compression algorithm matters a lot less with
> pure OLTP workloads.

I don't think that's quite right. A quick benchmark shows a ~1.4x
slowdown for a single client pgbench on a 1TB 970Pro during the phase
FPIs are emitted (pgbench -i -s 100, then a -T10 -c 1 run). With ~45% of
the time spent in pglz.

My experience in the past has been that wal_compression is a looser
until the first storage limits are reached, then it's a winner until CPU
time becomes the primary bottleneck, at which time the difference gets
smaller and smaller, sometimes reaching the point where wal_compression
looses again.


> I know that we've tested different compression methods in the past,
> but perhaps index build performance was overlooked.

I am pretty sure we have known that pglz for this was much much slower
than alternatives. I seem to recall somebody posting convincing numbers,
but can't find them just now.

Greetings,

Andres Freund




Re: problem with RETURNING and update row movement

2020-09-23 Thread Etsuro Fujita
On Wed, Sep 23, 2020 at 10:12 PM Amit Langote  wrote:
> On Sun, Sep 20, 2020 at 11:41 PM Etsuro Fujita  
> wrote:
> > On Mon, Sep 14, 2020 at 10:45 PM Amit Langote  
> > wrote:
> > > On Mon, Sep 14, 2020 at 5:56 PM Etsuro Fujita  
> > > wrote:
> > > > IIUC, I think two issues are discussed in the thread [1]: (a) there is
> > > > currently no way to define the set of meaningful system columns for a
> > > > partitioned table that contains pluggable storages other than standard
> > > > heap, and (b) even in the case where the set is defined as before,
> > > > like partitioned tables that don’t contain any pluggable storages,
> > > > system column values are not obtained from a tuple inserted into a
> > > > partitioned table in cases as reported in [1], because virtual slots
> > > > are assigned for partitioned tables [2][3].  (I think the latter is
> > > > the original issue in the thread, though.)

> > > > I think we could fix this update-tuple-routing-vs-RETURNING issue
> > > > without the fix for (a).  But to transfer system column values in a
> > > > cleaner way, I think we need to fix (b) first so that we can always
> > > > obtain/copy them from the new tuple moved to another partition by
> > > > INSERT following DELETE.
> > >
> > > Yes, I think you are right.  Although, I am still not sure how to
> > > "copy" system columns from one partition's slot to another's, that is,
> > > without assuming what they are.
> >
> > I just thought we assume that partitions support all the existing
> > system attributes until we have the fix for (a), i.e., the slot
> > assigned for a partition must have the getsysattr callback routine
> > from which we can fetch each existing system attribute of a underlying
> > tuple in the slot, regardless of whether that system attribute is used
> > for the partition or not.
>
> Hmm, to copy one slot's system attributes into another, we will also
> need a way to "set" the system attributes in the destination slot.
> But maybe I didn't fully understand what you said.

Sorry, my thought was vague.  To store xmin/xmax/cmin/cmax into a
given slot, we need to extend the TupleTableSlot struct to contain
these attributes as well?  Or we need to introduce a new callback
routine for that (say, setsysattr)?  These would not be
back-patchable, though.

> > I also created a patch for PG11, which I am attaching
> > as well.
>
> In the patch for PG 11:
>
> +   new_tuple->t_self = new_tuple->t_data->t_ctid =
> +   old_tuple->t_self;
> ...
>
> Should we add a one-line comment above this block of code to transfer
> system attributes?  Maybe: /* Also transfer the system attributes. */?

Will add that comment.  Thanks for reviewing!

> BTW, do you think we should alter the test in PG 11 branch to test
> that system attributes are returned correctly?

Yeah, I think so.  I didn’t come up with any good test cases for that, though.

> Once we settle the
> other issue, we can adjust the HEAD's test to do likewise?

Yeah, but for the other issue, I started thinking that we should just
forbid referencing xmin/xmax/cmin/cmax in 12, 13, and HEAD...

Best regards,
Etsuro Fujita




Re: Batching page logging during B-tree build

2020-09-23 Thread Peter Geoghegan
On Wed, Sep 23, 2020 at 11:29 AM Andres Freund  wrote:
> I wonder what the effect of logging WAL records as huge as this (~256kb)
> is on concurrent sessions. I think it's possible that logging 32 pages
> at once would cause latency increases for concurrent OLTP-ish
> writes. And that a smaller batch size would reduce that, while still
> providing most of the speedup.

Something to consider, but I cannot see any speedup, and so have no
way of evaluating the idea right now.

> > It doesn't seem to make any difference on my machine, which has an
> > NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the
> > sync time isn't exceptional.
>
> Yea, they are surprisingly slow at syncing, somewhat disappointing for
> the upper end of the consumer oriented devices.

I hesitate to call anything about this SSD disappointing, since
overall it performs extremely well -- especially when you consider the
price tag.

Detachable storage is a trend that's here to stay, so the storage
latency is still probably a lot lower than what you'll see on many
serious production systems. For better or worse.

> Really should replace WAL compression with lz4 (or possibly zstd).

Yeah. WAL compression is generally a good idea, and we should probably
find a way to enable it by default (in the absence of some better way
of dealing with the FPI bottleneck, at least). I had no idea that
compression could hurt this much with index builds until now, though.
To be clear: the *entire* index build takes 3 times more wall clock
time, start to finish -- if I drilled down to the portion of the index
build that actually writes WAL then it would be an even greater
bottleneck.

I know that we've tested different compression methods in the past,
but perhaps index build performance was overlooked. My guess is that
the compression algorithm matters a lot less with pure OLTP workloads.
Also, parallel CREATE INDEX may be a bit of an outlier here. Even
still, it's a very important outlier.

-- 
Peter Geoghegan




Re: Batching page logging during B-tree build

2020-09-23 Thread Andres Freund
Hi,

On 2020-09-23 11:19:18 -0700, Peter Geoghegan wrote:
> On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin  
> wrote:
> > Here is PoC with porting that same routine to B-tree. It allows to build 
> > B-trees ~10% faster on my machine.

I wonder what the effect of logging WAL records as huge as this (~256kb)
is on concurrent sessions. I think it's possible that logging 32 pages
at once would cause latency increases for concurrent OLTP-ish
writes. And that a smaller batch size would reduce that, while still
providing most of the speedup.


> It doesn't seem to make any difference on my machine, which has an
> NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the
> sync time isn't exceptional.

Yea, they are surprisingly slow at syncing, somewhat disappointing for
the upper end of the consumer oriented devices.


> BTW, I noticed that the index build is absurdly bottlenecked on
> compressing WAL with wal_compression=on. It's almost 3x slower with
> compression turned on!

Really should replace WAL compression with lz4 (or possibly zstd).

Greetings,

Andres Freund




Re: Batching page logging during B-tree build

2020-09-23 Thread Peter Geoghegan
On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin  wrote:
> Here is PoC with porting that same routine to B-tree. It allows to build 
> B-trees ~10% faster on my machine.

It doesn't seem to make any difference on my machine, which has an
NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the
sync time isn't exceptional. My test case is "reindex index
pgbench_accounts_pkey", with pgbench scale 500. I thought that this
would be a sympathetic case, since it's bottlenecked on writing the
index, with relatively little time spent scanning and sorting in
parallel workers.

Can you provide a test case that is sympathetic towards the patch?

BTW, I noticed that the index build is absurdly bottlenecked on
compressing WAL with wal_compression=on. It's almost 3x slower with
compression turned on!

-- 
Peter Geoghegan




Re: [PoC] Non-volatile WAL buffer

2020-09-23 Thread Takashi Menjo
Hello Gang,

Thank you for your report. I have not taken care of record size deeply yet,
so your report is very interesting. I will also have a test like yours then
post results here.

Regards,
Takashi


2020年9月21日(月) 14:14 Deng, Gang :

> Hi Takashi,
>
>
>
> Thank you for the patch and work on accelerating PG performance with NVM.
> I applied the patch and made some performance test based on the patch v4. I
> stored database data files on NVMe SSD and stored WAL file on Intel PMem
> (NVM). I used two methods to store WAL file(s):
>
> 1.  Leverage your patch to access PMem with libpmem (NVWAL patch).
>
> 2.  Access PMem with legacy filesystem interface, that means use PMem
> as ordinary block device, no PG patch is required to access PMem (Storage
> over App Direct).
>
>
>
> I tried two insert scenarios:
>
> A. Insert small record (length of record to be inserted is 24 bytes),
> I think it is similar as your test
>
> B.  Insert large record (length of record to be inserted is 328 bytes)
>
>
>
> My original purpose is to see higher performance gain in scenario B as it
> is more write intensive on WAL. But I observed that NVWAL patch method had
> ~5% performance improvement compared with Storage over App Direct method in
> scenario A, while had ~20% performance degradation in scenario B.
>
>
>
> I made further investigation on the test. I found that NVWAL patch can
> improve performance of XlogFlush function, but it may impact performance of
> CopyXlogRecordToWAL function. It may be related to the higher latency of
> memcpy to Intel PMem comparing with DRAM. Here are key data in my test:
>
>
>
> Scenario A (length of record to be inserted: 24 bytes per record):
>
> ==
>
>
>NVWAL   SoAD
>
> 
> ---  ---
>
> Througput (10^3 TPS)
> 310.5 296.0
>
> CPU Time % of CopyXlogRecordToWAL
>   0.4 0.2
>
> CPU Time % of XLogInsertRecord
> 1.5 0.8
>
> CPU Time % of XLogFlush
>  2.1 9.6
>
>
>
> Scenario B (length of record to be inserted: 328 bytes per record):
>
> ==
>
>
>NVWAL   SoAD
>
> 
> ---  ---
>
> Througput (10^3 TPS)
> 13.0   16.9
>
> CPU Time % of CopyXlogRecordToWAL
>   3.0 1.6
>
> CPU Time % of XLogInsertRecord
> 23.0   16.4
>
> CPU Time % of XLogFlush
>  2.3 5.9
>
>
>
> Best Regards,
>
> Gang
>
>
>
> *From:* Takashi Menjo 
> *Sent:* Thursday, September 10, 2020 4:01 PM
> *To:* Takashi Menjo 
> *Cc:* pgsql-hack...@postgresql.org
> *Subject:* Re: [PoC] Non-volatile WAL buffer
>
>
>
> Rebased.
>
>
>
>
>
> 2020年6月24日(水) 16:44 Takashi Menjo :
>
> Dear hackers,
>
> I update my non-volatile WAL buffer's patchset to v3.  Now we can use it
> in streaming replication mode.
>
> Updates from v2:
>
> - walreceiver supports non-volatile WAL buffer
> Now walreceiver stores received records directly to non-volatile WAL
> buffer if applicable.
>
> - pg_basebackup supports non-volatile WAL buffer
> Now pg_basebackup copies received WAL segments onto non-volatile WAL
> buffer if you run it with "nvwal" mode (-Fn).
> You should specify a new NVWAL path with --nvwal-path option.  The path
> will be written to postgresql.auto.conf or recovery.conf.  The size of the
> new NVWAL is same as the master's one.
>
>
> Best regards,
> Takashi
>
> --
> Takashi Menjo 
> NTT Software Innovation Center
>
> > -Original Message-
> > From: Takashi Menjo 
> > Sent: Wednesday, March 18, 2020 5:59 PM
> > To: 'PostgreSQL-development' 
> > Cc: 'Robert Haas' ; 'Heikki Linnakangas' <
> hlinn...@iki.fi>; 'Amit Langote'
> > 
> > Subject: RE: [PoC] Non-volatile WAL buffer
> >
> > Dear hackers,
> >
> > I rebased my non-volatile WAL buffer's patchset onto master.  A new v2
> patchset is attached to this mail.
> >
> > I also measured performance before and after patchset, varying
> -c/--client and -j/--jobs options of pgbench, for
> > each scaling factor s = 50 or 1000.  The results are presented in the
> following tables and the attached charts.
> > Conditions, steps, and other details will be shown later.
> >
> >
> > Results (s=50)
> > ==
> >  Throughput [10^3 TPS]  Average latency [ms]
> > ( c, j)  before  after  before  after
> > ---  -  -
> > ( 8, 8)  35.737.1 (+3.9%)   0.224   0.216 (-3.6%)
> > (18,18)  70.974.7 (+5.3%)   0.254   0.241 (-5.1%)
> > (36,18)  76.080.8 (+6.3%)   0.473   0.446 (-5.7%)
> > (54,18)  75.581.8 (+8.3%)   0.715   0.660 (-7.7%)
> >
> >
> > Results (s=1000)
> > 

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-09-23 Thread Alvaro Herrera
On 2020-Sep-23, Amit Langote wrote:

Hi Amit, thanks for reviewing this patch!

> On Tue, Aug 4, 2020 at 8:49 AM Alvaro Herrera  
> wrote:

> I suspect that we don't really need this defensive constraint.  I mean
> even after committing the 1st transaction, the partition being
> detached still has relispartition set to true and
> RelationGetPartitionQual() still returns the partition constraint, so
> it seems the constraint being added above is duplicative.

Ah, thanks for thinking through that.  I had vague thoughts about this
being unnecessary in the current mechanics, but hadn't fully
materialized the thought.  (The patch was completely different a few
unposted iterations ago).

> Moreover, the constraint is not removed as part of any cleaning up
> after the DETACH process, so repeated attach/detach of the same
> partition results in the constraints piling up:

Yeah, I knew about this issue (mentioned in my self-reply on Aug 4) and
didn't worry too much about it because I was thinking I'd rather get rid
of the constraint addition in the first place.

> Noticed a bug/typo in the patched RelationBuildPartitionDesc():
> 
> -   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
> +   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock,
> +   include_detached);
> 
> You're passing NoLock for include_detached which means you never
> actually end up including detached partitions from here.

I fixed this in the version I posted on Sept 10.  I think you were
reading the version posted at the start of this thread.

Thanks,

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Probable documentation errors or improvements

2020-09-23 Thread Justin Pryzby
On Sat, Sep 19, 2020 at 12:55:58PM -0500, Justin Pryzby wrote:
> On Thu, Sep 10, 2020 at 12:19:55PM -0700, Yaroslav wrote:
> > Disclaimer: I'm not a native speaker, so not sure those are actually
> > incorrect, and can't offer non-trivial suggestions.
> 
> https://www.postgresql.org/message-id/flat/CAKFQuwZh3k_CX2-%2BNcZ%3DFZss4bX6ASxDFEXJTY6u4wTH%2BG8%2BKA%40mail.gmail.com#9f9eba0cbbf9b57455503537575f5339
> 
> Most of these appear to be reasonable corrections or improvements.
> 
> Would you want to submit a patch against the master branch ?
> It'll be easier for people to read when it's in a consistent format.
> And less work to read it than to write it.
> 
> I suggest to first handle the 10+ changes which are most important and in need
> of fixing.  After a couple rounds, then see if what's left is worth patching.

A couple of these were already fixed (querytree and "be appear").

I provide a patch for a few others.

Copying Teodor et al regarding tsquery.

wdiff to follow.

commit 6c5a4e6fde2841f6a2ad00f3cc4530e7fcf9a9f3
Author: Justin Pryzby 
Date:   Tue Sep 22 23:34:54 2020 -0500

fix tsquery example

broken since bb140506df605fab58f48926ee1db1f80bdafb59

diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 8af6ac01d3..7daf292468 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -1623,9 +1623,9 @@ occurrences to display in the result.',


SELECT to_tsquery('fat') <-> to_tsquery('cat | rat');
  ?column?
[-]{++}
 'fat' <-> {+(+} 'cat' |[-'fat' <->-] 'rat' {+)+}

  
 

commit 096a9097d58de8e2d21e42fab528b27a5213a699
Author: Justin Pryzby 
Date:   Tue Sep 22 23:13:00 2020 -0500

HAVING conditions cannot be repeated

diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index b93e4ca208..94889b66b4 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -38,7 +38,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionfrom_item [, ...] ]
[ WHERE condition ]
[ GROUP BY grouping_element [, 
...] ]
[ HAVING condition[-[, ...]-] ]
[ WINDOW window_name AS ( 
window_definition ) [, ...] ]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | 
DESC | USING operator ] [ NULLS { 
FIRST | LAST } ] [, ...] ]
diff --git a/doc/src/sgml/ref/select_into.sgml 
b/doc/src/sgml/ref/select_into.sgml
index b1af52a4da..e82e416d60 100644
--- a/doc/src/sgml/ref/select_into.sgml
+++ b/doc/src/sgml/ref/select_into.sgml
@@ -28,7 +28,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionfrom_item [, ...] ]
[ WHERE condition ]
[ GROUP BY expression [, ...] ]
[ HAVING condition[-[, ...]-] ]
[ WINDOW window_name AS ( 
window_definition ) [, ...] ]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | 
DESC | USING operator ] [ NULLS { 
FIRST | LAST } ] [, ...] ]

commit 1e16102f906d9ead33ebdfc0b6f5d862e851131b
Author: Justin Pryzby 
Date:   Tue Sep 22 22:24:14 2020 -0500

style and consistency

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 390c49eb6a..649020b7da 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -612,7 +612,7 @@ amgettuple (IndexScanDesc scan,
   will pass the caller's snapshot test.  On success, 
amgettuple
   must also set scan->xs_recheck to true or false.
   False means it is certain that the index entry matches the scan keys.
   [-true-]{+True+} means this is not certain, and the conditions represented 
by the
   scan keys must be rechecked against the heap tuple after fetching it.
   This provision supports lossy index operators.
   Note that rechecking will extend only to the scan conditions; a partial

commit 44c7efab0499644060c19868d4c431007e8cccaa
Author: Justin Pryzby 
Date:   Tue Sep 22 22:55:59 2020 -0500

distro agnostic

diff --git a/doc/src/sgml/sepgsql.sgml b/doc/src/sgml/sepgsql.sgml
index 9961569afc..15417bf19d 100644
--- a/doc/src/sgml/sepgsql.sgml
+++ b/doc/src/sgml/sepgsql.sgml
@@ -88,7 +88,7 @@ Policy from config file:targeted
  
   To build this module, include the option --with-selinux in
   your PostgreSQL configure command.  Be sure that the
   libselinux-devel [-RPM-]{+package+} is installed at 
build time.
  

  
@@ -177,7 +177,7 @@ $ for DBNAME in template0 template1 postgres; do
   Makefile on your system; the path shown below is only an example.
   (This Makefile is usually supplied by the
   selinux-policy-devel or
   selinux-policy [-RPM.)-]{+package.)+}
   Once built, install this policy package using the
   semodule command, which loads supplied policy packages
   into the kernel.  If the package is correctly installed,

commit 1584c0dd6a5183b915811c89af08f135479509ab
Author: Justin Pryzby 
Date:   Tue Sep 22 22:55:48 2020 -0500

grammar

diff --git a/doc/src/sgml/gin.sgml b/doc/src/sgml/gin.sgml
index 5c8d4d5275..67754f52

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-09-23 Thread Fujii Masao




On 2020/09/21 12:44, Bharath Rupireddy wrote:

On Thu, Sep 17, 2020 at 10:20 PM Fujii Masao mailto:masao.fu...@oss.nttdata.com>> wrote:
 >
 > In 1st way, you may also need to call ReleaseExternalFD() when new 
connection fails
 > to be made, as connect_pg_server() does. Also we need to check that
 > non-superuser has used password to make new connection,
 > as connect_pg_server() does? I'm concerned about the case where
 > pg_hba.conf is changed accidentally so that no password is necessary
 > at the remote server and the existing connection is terminated. In this case,
 > if we connect to the local server as non-superuser, connection to
 > the remote server should fail because the remote server doesn't
 > require password. But with your patch, we can successfully reconnect
 > to the remote server.
 >
 > Therefore I like 2nd option. Also maybe disconnect_ps_server() needs to
 > be called before that. I'm not sure how much useful 1st option is.
 >

Thanks. Above points look sensible. +1 for the 2nd option i.e. instead of 
PQreset(entry->conn);, let's try to disconnect_pg_server() and 
connect_pg_server().

 >
 > What if 2nd attempt happens with have_prep_stmt=true? I'm not sure
 > if this case really happens, though. But if that can, it's strange to start
 > new connection with have_prep_stmt=true even when the caller of
 > GetConnection() doesn't intend to create any prepared statements.
 >
 > I think it's safer to do 2nd attempt in the same way as 1st one. Maybe
 > we can simplify the code by making them into common code block
 > or function.
 >

I don't think the have_prep_stmt will be set by the time we make 2nd attempt because 
entry->have_prep_stmt |= will_prep_stmt; gets hit only after we are successful in 
either 1st attempt or 2nd attempt. I think we don't need to set all transient state. 
No other transient state except changing_xact_state changes from 1st attempt to 2nd 
attempt(see below), so let's set only entry->changing_xact_state to false before 
2nd attempt.

1st attempt:
(gdb) p *entry
$3 = {key = 16389, conn = 0x55a89610, xact_depth = 0, have_prep_stmt = 
false,
   have_error = false, changing_xact_state = false, invalidated = false,
   server_hashvalue = 3905865521, mapping_hashvalue = 2617776010}

2nd attempt i.e. in retry block:
(gdb) p *entry
$4 = {key = 16389, conn = 0x55a89610, xact_depth = 0, have_prep_stmt = 
false,
   have_error = false, changing_xact_state = true, invalidated = false,
   server_hashvalue = 3905865521, mapping_hashvalue = 2617776010}

 >>
 > > If an error occurs in the first attempt, we return from
 > > pgfdw_get_result()'s  if (!PQconsumeInput(conn)) to the catch block we
 > > added and pgfdw_report_error() will never get called. And the below
 > > part of the code is reached only in scenarios as mentioned in the
 > > comments. Removing this might have problems if we receive errors other
 > > than CONNECTION_BAD or for subtxns. We could clear if any result and
 > > just rethrow the error upstream. I think no problem having this code
 > > here.
 >
 > But the orignal code works without this?
 > Or you mean that the original code has the bug?
 >

There's no bug in the original code. Sorry, I was wrong in saying 
pgfdw_report_error() will never get called with this patch. Indeed it gets 
called even when 1's attempt connection is failed. Since we added an extra 
try-catch block, we will not be throwing the error to the user, instead we make 
a 2nd attempt from the catch block.

I'm okay to remove below part of the code

 > >> +                       PGresult *res = NULL;
 > >> +                       res = PQgetResult(entry->conn);
 > >> +                       PQclear(res);
 > >> Are these really necessary? I was just thinking that's not because
 > >> pgfdw_get_result() and pgfdw_report_error() seem to do that
 > >> already in do_sql_command().

Please let me know if okay with the above agreed points, I will work on the new 
patch.


Yes, please work on the patch! Thanks! I may revisit the above points later, 
though ;)

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2020-09-23 Thread Fujii Masao




On 2020/09/23 12:47, Thomas Munro wrote:

On Wed, Sep 23, 2020 at 2:27 PM David Rowley  wrote:

I've gone as far as running the recovery tests on the v3-0001 patch
using a Windows machine. They pass:


Thanks!  I pushed that one, because it was effectively a bug fix
(WaitLatch() without a latch was supposed to work).


Great!




I'll wait longer for feedback on the main patch; perhaps someone has a
better idea, or wants to take issue with the magic number 1024 (ie
limit on how many records we'll replay before we notice the postmaster
has exited), or my plan to harmonise those wait loops?  It has a CF
entry for the next CF.


Does this patch work fine with warm-standby case using pg_standby?
IIUC the startup process doesn't call WaitLatch() in that case, so ISTM that,
with the patch, it cannot detect the postmaster death immediately.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Improper use about DatumGetInt32

2020-09-23 Thread Ashutosh Bapat
On Wed, Sep 23, 2020 at 1:41 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Mon, Sep 21, 2020 at 3:53 PM Andres Freund  wrote:
> >> I think we mostly use it for the few places where we currently expose
> >> data as a signed integer on the SQL level, but internally actually treat
> >> it as a unsigned data.
>
> > So why is the right solution to that not DatumGetInt32() + a cast to uint32?
>
> You're ignoring the xid use-case, for which DatumGetUInt32 actually is
> the right thing.

There is DatumGetTransactionId() which should be used instead.
That made me search if there's PG_GETARG_TRANSACTIONID() and yes it's
there but only defined in xid.c. So pg_xact_commit_timestamp(),
pg_xact_commit_timestamp_origin() and pg_get_multixact_members() use
PG_GETARG_UNIT32. IMO those should be changed to use
PG_GETARG_TRANSACTIONID. That would require moving
PG_GETARG_TRANSACTIONID somewhere outside xid.c; may be fmgr.h where
other PG_GETARG_* are.

> I tend to agree though that if the SQL argument is
> of a signed type, the least API-abusing answer is a signed DatumGetXXX
> macro followed by whatever cast you need.
>

I looked for some uses of PG_GETARG_UNIT32() which is the counterpart
of DatumGetUint32(). Found some buggy usages apart from the ones which
can be converted to PG_GETARG_TRANSACTIONID listed above.
normal_rand() for example returns a huge number of rows and takes
forever if we pass a negative first argument to it. Someone could
misuse that for a DOS attack or it could be just an accident that they
pass a negative value to that function and the query takes forever.
explain analyze select count(*) from normal_rand(-100, 1.0, 1.0);
   QUERY
PLAN
-
 Aggregate  (cost=12.50..12.51 rows=1 width=8) (actual
time=2077574.718..2077574.719 rows=1 loops=1)
   ->  Function Scan on normal_rand  (cost=0.00..10.00 rows=1000
width=0) (actual time=1005176.149..1729994.366 rows=4293967296
loops=1)
 Planning Time: 0.346 ms
 Execution Time: 2079034.835 ms

get_raw_page() also does similar thing but the effect is not as dangerous
SELECT octet_length(get_raw_page('test1', 'main', -1)) AS main_1;
  ERROR:  block number 4294967295 is out of range for relation "test1"
Similarly for bt_page_stats() and bt_page_items()

PFA patches to correct those.

There's Oracle compatible chr() which also uses PG_GETARG_UINT32() but
it's (accidentally?) reporting the negative inputs correctly because
it filters out very large values and reports those using %d. It's
arguable whether we should change that, so I have left it untouched.
But I think we should change that as well and get rid of
PG_GETARG_UNIT32 altogether. This will prevent any future misuse.


--
Best Wishes,
Ashutosh Bapat
From e13eeafc452a977a1e5b9e6c51b043f00dd2d5c7 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Tue, 22 Sep 2020 19:00:56 +0530
Subject: [PATCH 1/2] Handle negative number of tuples passed to normal_rand()

The function converts the first argument i.e. the number of tuples to
return into an unsigned integer which turns out to be huge number when a
negative value is passed. This causes the function to take much longer
time to execute. Instead return no rows in that case.

While at it, improve SQL test to test the number of tuples returned by
this function.

Ashutosh Bapat
---
 contrib/tablefunc/expected/tablefunc.out | 15 +++
 contrib/tablefunc/sql/tablefunc.sql  |  4 +++-
 contrib/tablefunc/tablefunc.c|  4 +++-
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/contrib/tablefunc/expected/tablefunc.out b/contrib/tablefunc/expected/tablefunc.out
index fffadc6e1b..1c9ecef52c 100644
--- a/contrib/tablefunc/expected/tablefunc.out
+++ b/contrib/tablefunc/expected/tablefunc.out
@@ -3,10 +3,17 @@ CREATE EXTENSION tablefunc;
 -- normal_rand()
 -- no easy way to do this for regression testing
 --
-SELECT avg(normal_rand)::int FROM normal_rand(100, 250, 0.2);
- avg 
--
- 250
+SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2);
+ avg | count 
+-+---
+ 250 |   100
+(1 row)
+
+-- negative number of tuples
+SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2);
+ avg | count 
+-+---
+ | 0
 (1 row)
 
 --
diff --git a/contrib/tablefunc/sql/tablefunc.sql b/contrib/tablefunc/sql/tablefunc.sql
index ec375b05c6..02e8a98c73 100644
--- a/contrib/tablefunc/sql/tablefunc.sql
+++ b/contrib/tablefunc/sql/tablefunc.sql
@@ -4,7 +4,9 @@ CREATE EXTENSION tablefunc;
 -- normal_rand()
 -- no easy way to do this for regression testing
 --
-SELECT avg(normal_rand)::int FROM normal_rand(100, 250, 0.2);
+SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2);
+-- negative number of tuples
+SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2);
 
 --
 -- crosstab

Re: new heapcheck contrib module

2020-09-23 Thread Stephen Frost
Greetings,

* Peter Geoghegan (p...@bowt.ie) wrote:
> On Tue, Sep 22, 2020 at 12:41 PM Robert Haas  wrote:
> > But now I see that there's no secondary permission check in the
> > verify_nbtree.c code. Is that intentional? Peter, what's the
> > justification for that?
> 
> As noted by comments in contrib/amcheck/sql/check_btree.sql (the
> verify_nbtree.c tests), this is intentional. Note that we explicitly
> test that a non-superuser role can perform verification following
> GRANT EXECUTE ON FUNCTION ... .

> As I mentioned earlier, this is supported (or at least it is supported
> in my interpretation of things). It just isn't documented anywhere
> outside the test itself.

Would certainly be good to document this but I tend to agree with the
comments that ideally-

a) it'd be nice for a relatively low-privileged user/process could run
   the tests in an ongoing manner
b) we don't want to add more is-superuser checks
c) users shouldn't really be given the ability to see rows they're not
   supposed to have access to

In other places in the code, when an error is generated and the user
doesn't have access to the underlying table or doesn't have BYPASSRLS,
we don't include the details or the actual data in the error.  Perhaps
that approach would make sense here (or perhaps not, but it doesn't seem
entirely crazy to me, anyway).  In other words:

a) keep the ability for someone who has EXECUTE on the function to be
   able to run the function against any relation
b) when we detect an issue, perform a permissions check to see if the
   user calling the function has rights to read the rows of the table
   and, if RLS is enabled on the table, if they have BYPASSRLS
c) if the user has appropriate privileges, log the detailed error, if
   not, return a generic error with a HINT that details weren't
   available due to lack of privileges on the relation

I can appreciate the concerns regarding dead rows ending up being
visible to someone who wouldn't normally be able to see them but I'd
argue we could simply document that fact rather than try to build
something to address it, for this particular case.  If there's push back
on that then I'd suggest we have a "can read dead rows" or some such
capability that can be GRANT'd (in the form of a default role, I would
think) which a user would also have to have in order to get detailed
error reports from this function.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Report error position in partition bound check

2020-09-23 Thread Amit Langote
On Wed, Sep 23, 2020 at 10:22 PM Ashutosh Bapat
 wrote:
> On Wed, 23 Sep 2020 at 14:41, Amit Langote  wrote:
> Setting this CF entry as "RFC". Thanks.

Great, thanks for your time on this.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-23 Thread Ashutosh Bapat
On Wed, Sep 23, 2020 at 2:13 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Ashutosh Bapat 
> > parallelism here has both pros and cons. If one of the servers errors
> > out while preparing for a transaction, there is no point in preparing
> > the transaction on other servers. In parallel execution we will
> > prepare on multiple servers before realising that one of them has
> > failed to do so. On the other hand preparing on multiple servers in
> > parallel provides a speed up.
>
> And pros are dominant in practice.  If many transactions are erroring out 
> (during prepare), the system is not functioning for the user.  Such an 
> application should be corrected before they are put into production.
>
>
> > But this can be an improvement on version 1. The current approach
> > doesn't render such an improvement impossible. So if that's something
> > hard to do, we should do that in the next version rather than
> > complicating this patch.
>
> Could you share your idea on how the current approach could enable 
> parallelism?  This is an important point, because (1) the FDW may not lead us 
> to a seriously competitive scale-out DBMS, and (2) a better FDW API and/or 
> implementation could be considered for non-parallel interaction if we have 
> the realization of parallelism in mind.  I think that kind of consideration 
> is the design (for the future).
>

The way I am looking at is to put the parallelism in the resolution
worker and not in the FDW. If we use multiple resolution workers, they
can fire commit/abort on multiple foreign servers at a time.

But if we want parallelism within a single resolution worker, we will
need a separate FDW APIs for firing asynchronous commit/abort prepared
txn and fetching their results resp. But given the variety of FDWs,
not all of them will support asynchronous API, so we have to support
synchronous API anyway, which is what can be targeted in the first
version.

Thinking more about it, the core may support an API which accepts a
list of prepared transactions, their foreign servers and user mappings
and let FDW resolve all those either in parallel or one by one. So
parallelism is responsibility of FDW and not the core. But then we
loose parallelism across FDWs, which may not be a common case.

Given the complications around this, I think we should go ahead
supporting synchronous API first and in second version introduce
optional asynchronous API.

-- 
Best Wishes,
Ashutosh Bapat




Re: Report error position in partition bound check

2020-09-23 Thread Ashutosh Bapat
On Wed, 23 Sep 2020 at 14:41, Amit Langote  wrote:

> Thanks Ashutosh.
>
> On Fri, Sep 18, 2020 at 7:33 PM Ashutosh Bapat
>  wrote:
> > Thanks Amit for addressing comments.
> >
> > @@ -4256,5 +4256,8 @@ transformPartitionBoundValue(ParseState *pstate,
> Node *val,
> >   if (!IsA(value, Const))
> >   elog(ERROR, "could not evaluate partition bound expression");
> >
> > + /* Preserve parser location information. */
> > + ((Const *) value)->location = exprLocation(val);
> > +
> >   return (Const *) value;
> >  }
> >
> > This caught my attention and I was wondering whether transformExpr()
> itself should transfer the location from input expression to the output
> expression. Some minions of transformExprRecurse() seem to be doing that.
> The change here may be an indication that some of them are not doing this.
> In that case may be it's better to find those and fix rather than a
> white-wash fix here. In what case did we find that location was not set by
> transformExpr? Sorry for not catching this earlier.
>
> AFAICS, transformExpr() is fine.  What loses the location value is the
> unconditional evaluate_expr() call which generates a fresh Const node,
> possibly after evaluating a non-Const expression that is passed to it.
> I don't find it very desirable to change evaluate_expr() to accept a
> location value, because other callers of it don't seem to care.
> Instead, in the updated patch, I have made calling evaluate_expr()
> conditional on the expression at hand being a non-Const node and
> assign location by hand on return.  If the expression is already
> Const, we don't need to update the location field as it should already
> be correct.  Though, I did notice that the evaluate_expr() call has an
> additional responsibility which is to pass the partition key specified
> collation to the bound expression, so we should not fail to update an
> already-Const node's collation likewise.
>

Thanks for the detailed explanation. I am not sure whether skipping one
evaluate_expr() call for a constant is better or reassigning the location.
This looks better than the last patch.


> > /* New lower bound is certainly >= bound at offet. */
> > offet/offset? But this comment is implied by the comment just two lines
> above. So I am not sure it's really needed.
>
> Given that cmpval is set all the way in partition_range_bsearch(), I
> thought it would help to clarify why this code can assume it must be
> >= 0.  It is because a valid offset returned by
> partition_range_bsearch() must correspond to a bound that it found to
> be <= the probe bound passed to it.
>

> > /* Fetch the problem bound from lower datums list. */
> > This is fetching problematic key value rather than the whole problematic
> bound. I think the comment would be useful if it explains why cmpval -1 th
> key is problematic but then that's evident from the prologue of
> partition_rbound_cmp() so I am not sure if this comment is really required.
> For example, we aren't adding a comment here
> > + overlap_location = ((PartitionRangeDatum *)
> > + list_nth(spec->upperdatums, -cmpval - 1))->location;
>
> In the attached updated patch, I have tried to make the code and
> comments for different cases consistent.  Please have a look.
>
>

The comments look okay to me. I don't see a way to keep them short and yet
avoid reading the prologue of partition_range_bsearch(). And there is no
point in repeating a portion of that prologue at multiple places. So I am
fine with these set of comments.

Setting this CF entry as "RFC". Thanks.

-- 
Best Wishes,
Ashutosh


Re: problem with RETURNING and update row movement

2020-09-23 Thread Amit Langote
Fujita-san,

On Sun, Sep 20, 2020 at 11:41 PM Etsuro Fujita  wrote:
> On Mon, Sep 14, 2020 at 10:45 PM Amit Langote  wrote:
> > On Mon, Sep 14, 2020 at 5:56 PM Etsuro Fujita  
> > wrote:
> > > IIUC, I think two issues are discussed in the thread [1]: (a) there is
> > > currently no way to define the set of meaningful system columns for a
> > > partitioned table that contains pluggable storages other than standard
> > > heap, and (b) even in the case where the set is defined as before,
> > > like partitioned tables that don’t contain any pluggable storages,
> > > system column values are not obtained from a tuple inserted into a
> > > partitioned table in cases as reported in [1], because virtual slots
> > > are assigned for partitioned tables [2][3].  (I think the latter is
> > > the original issue in the thread, though.)
> >
> > Right, (b) can be solved by using a leaf partition's tuple slot as
> > proposed.
>
> You mean what is proposed in [3]?

Yes.  Although, I am for assigning a dedicated slot to partitions
*unconditionally*, whereas the PoC patch Andres shared makes it
conditional on either needing tuple conversion between the root and
the partition or having a RETURNING projection present in the query.

> > > I think we could fix this update-tuple-routing-vs-RETURNING issue
> > > without the fix for (a).  But to transfer system column values in a
> > > cleaner way, I think we need to fix (b) first so that we can always
> > > obtain/copy them from the new tuple moved to another partition by
> > > INSERT following DELETE.
> >
> > Yes, I think you are right.  Although, I am still not sure how to
> > "copy" system columns from one partition's slot to another's, that is,
> > without assuming what they are.
>
> I just thought we assume that partitions support all the existing
> system attributes until we have the fix for (a), i.e., the slot
> assigned for a partition must have the getsysattr callback routine
> from which we can fetch each existing system attribute of a underlying
> tuple in the slot, regardless of whether that system attribute is used
> for the partition or not.

Hmm, to copy one slot's system attributes into another, we will also
need a way to "set" the system attributes in the destination slot.
But maybe I didn't fully understand what you said.

> > > (I think we could address this issue for v11 independently of [1], off 
> > > course.)
> >
> > Yeah.
>
> I noticed that I modified your patch incorrectly.  Sorry for that.
> Attached is an updated patch fixing that.

Ah, no problem.  Thanks for updating.

>  I also added a bit of code
> to copy CTID from the new tuple moved to another partition, not just
> table OID, and did some code/comment adjustments, mainly to match
> other places.  I also created a patch for PG11, which I am attaching
> as well.

In the patch for PG 11:

+   new_tuple->t_self = new_tuple->t_data->t_ctid =
+   old_tuple->t_self;
...

Should we add a one-line comment above this block of code to transfer
system attributes?  Maybe: /* Also transfer the system attributes. */?

BTW, do you think we should alter the test in PG 11 branch to test
that system attributes are returned correctly?  Once we settle the
other issue, we can adjust the HEAD's test to do likewise?

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-23 Thread Amit Kapila
On Wed, Sep 23, 2020 at 12:00 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > The idea is that we can't use this optimization if the value is not
> > cached because we can't rely on lseek behavior. See all the discussion
> > between Horiguchi-San and me in the thread above. So, how would you
> > ensure that if we don't use Kirk-San's proposal?
>
> Hmm, buggy Linux kernel...  (Until when should we be worried about the bug?)
>
> According to the following Horiguchi-san's suggestion, it's during normal 
> operation, not during recovery, when we should be careful, right?
>

No, during recovery also we need to be careful. We need to ensure that
we use cached value during recovery and cached value is always
up-to-date. We can't rely on lseek and I have provided some scenario
up thread [1] where such behavior can cause problem and then see the
response from Tom Lane why the same can be true for recovery as well.

The basic approach we are trying to pursue here is to rely on the
cached value of 'number of blocks' (as that always gives correct value
and even if there is a problem that will be our bug, we don't need to
rely on OS for correct value and it will be better w.r.t performance
as well). It is currently only possible during recovery so we are
using it in recovery path and later once Thomas's patch to cache it
for non-recovery cases is also done, we can use it for non-recovery
cases as well.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1LqaJvT%3DbFOpc4i5Haq4oaVQ6wPbAcg64-Kt1qzp_MZYA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.




Parallel Inserts in CREATE TABLE AS

2020-09-23 Thread Bharath Rupireddy
Hi,

The idea of this patch is to allow the leader and each worker insert the
tuples in parallel if the SELECT part of the CTAS is parallelizable. Along
with the parallel inserts, if the CTAS code path is allowed to do
table_multi_insert()[1], then the gain we achieve is as follows:

For a table with 2 integer columns, 100million tuples(more testing results
are at [2]), the exec time on the HEAD is *120sec*, where as with the
parallelism patch proposed here and multi insert patch [1], with 3 workers
and leader participation the exec time is *22sec(5.45X)*. With the current
CTAS code which does single tuple insert(see intorel_receive()), the
performance gain is limited to ~1.7X with parallelism. This is due to the
fact that the workers contend more for locks on buffer pages while
extending the table. So, the maximum benefit we could get for CTAS is with
both parallelism and multi tuple inserts.

The design:
Let the planner know that the SELECT is from CTAS in createas.c so that it
can set the number of tuples transferred from the workers to Gather node to
0. With this change, there are chances that the planner may choose the
parallel plan. After the planning, check if the upper plan node is Gather
in createas.c and mark a parallelism flag in the CTAS dest receiver. Pass
the into clause, object id, command id from the leader to workers, so that
each worker can create its own CTAS dest receiver. Leader inserts it's
share of tuples if instructed to do, and so are workers. Each worker writes
atomically it's number of inserted tuples into a shared memory variable,
the leader combines this with it's own number of inserted tuples and shares
to the client.

Below things are still pending. Thoughts are most welcome:
1. How better we can lift the "cannot insert tuples in a parallel worker"
from heap_prepare_insert() for only CTAS cases or for that matter parallel
copy? How about having a variable in any of the worker global contexts and
use that? Of course, we can remove this restriction entirely in case we
fully allow parallelism for INSERT INTO SELECT, CTAS, and COPY.
2. How to represent the parallel insert for CTAS in explain plans? The
explain CTAS shows the plan for only the SELECT part. How about having some
textual info along with the Gather node?

 -
 Gather  (cost=1000.00..108738.90 rows=0 width=8)
 Workers Planned: 2
->  Parallel Seq Scan on t_test  (cost=0.00..106748.00 rows=4954
width=8)
 Filter: (many < 1)

 -
3. Need to restrict parallel inserts, if CTAS tries to create temp/global
tables as the workers will not have access to those tables. Need to analyze
whether to allow parallelism if CTAS has prepared statements or with no
data.
4. Need to stop unnecessary parallel shared state such as tuple queue being
created and shared to workers.
5. Addition of new test cases. Testing with more scenarios and different
data sets, sizes, tablespaces, select into. Analysis on the 2 mismatches in
write_parallel.sql regression test.

Thoughts?

Credits:
1. Thanks to DIlip Kumar for the main design idea and the discussions.
Thanks to Vignesh for the discussions.
2. Patch development, testing is by me.
3. Thanks to the authors of table_multi_insert() in CTAS patch [1].

[1] - For table_multi_insert() in CTAS, I used an in-progress patch
available at
https://www.postgresql.org/message-id/CAEET0ZG31mD5SWjTYsAt0JTLReOejPvusJorZ3kGZ1%3DN1AC-Fw%40mail.gmail.com
[2] - Table with 2 integer columns, 100million tuples, with leader
participation,with default postgresql.conf file. All readings are of
triplet form - (workers, exec time in sec, improvement).
case 1: no multi inserts -
(0,120,1X),(1,91,1.32X),(2,75,1.6X),(3,67,1.79X),(4,72,1.66X),(5,77,1.56),(6,83,1.44X)
case 2: with multi inserts -
(0,59,1X),(1,32,1.84X),(2,28,2.1X),(3,25,2.36X),(4,23,2.56X),(5,22,2.68X),(6,22,2.68X)
case 3: same table but unlogged with multi inserts -
(0,50,1X),(1,28,1.78X),(2,25,2X),(3,22,2.27X),(4,21,2.38X),(5,21,2.38X),(6,20,2.5X)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 9e45426a6d4d6f030ba24ed58eb0e2ff5912a972 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Sun, 20 Sep 2020 09:23:06 +0530
Subject: [PATCH v1] Parallel Inserts in CREATE TABLE AS

The idea of this patch is to allow the leader and each worker
insert the tuples in parallel if the SELECT part of the CTAS is
parallelizable.

The design:
Let the planner know that the SELECT is from CTAS in createas.c
so that it can set the number of tuples transferred from the
workers to Gather node to 0. With this change, there are chances
that the planner may choose the parallel plan. After the planning,
check if the upper plan node is Gather in createas.c and mark a
parallelism flag in the CTAS dest receiver. Pass the into clause,
object id, command id from the leader to

Prefer TG_TABLE_NAME over TG_RELNAME in tests

2020-09-23 Thread Daniel Gustafsson
TG_RELNAME was marked deprecated in commit 3a9ae3d2068 some 14 years ago, but
we still use it in the triggers test suite (test added in 59b4cef1eb74a a year
before deprecation).  Seems about time to move over to TG_TABLE_NAME ourselves,
as TG_RELNAME is still covered by the test added in the deprecation commit.

cheers ./daniel



tg_relname.patch
Description: Binary data


Re: Report error position in partition bound check

2020-09-23 Thread Amit Langote
Thanks Ashutosh.

On Fri, Sep 18, 2020 at 7:33 PM Ashutosh Bapat
 wrote:
> Thanks Amit for addressing comments.
>
> @@ -4256,5 +4256,8 @@ transformPartitionBoundValue(ParseState *pstate, Node 
> *val,
>   if (!IsA(value, Const))
>   elog(ERROR, "could not evaluate partition bound expression");
>
> + /* Preserve parser location information. */
> + ((Const *) value)->location = exprLocation(val);
> +
>   return (Const *) value;
>  }
>
> This caught my attention and I was wondering whether transformExpr() itself 
> should transfer the location from input expression to the output expression. 
> Some minions of transformExprRecurse() seem to be doing that. The change here 
> may be an indication that some of them are not doing this. In that case may 
> be it's better to find those and fix rather than a white-wash fix here. In 
> what case did we find that location was not set by transformExpr? Sorry for 
> not catching this earlier.

AFAICS, transformExpr() is fine.  What loses the location value is the
unconditional evaluate_expr() call which generates a fresh Const node,
possibly after evaluating a non-Const expression that is passed to it.
I don't find it very desirable to change evaluate_expr() to accept a
location value, because other callers of it don't seem to care.
Instead, in the updated patch, I have made calling evaluate_expr()
conditional on the expression at hand being a non-Const node and
assign location by hand on return.  If the expression is already
Const, we don't need to update the location field as it should already
be correct.  Though, I did notice that the evaluate_expr() call has an
additional responsibility which is to pass the partition key specified
collation to the bound expression, so we should not fail to update an
already-Const node's collation likewise.

> /* New lower bound is certainly >= bound at offet. */
> offet/offset? But this comment is implied by the comment just two lines 
> above. So I am not sure it's really needed.

Given that cmpval is set all the way in partition_range_bsearch(), I
thought it would help to clarify why this code can assume it must be
>= 0.  It is because a valid offset returned by
partition_range_bsearch() must correspond to a bound that it found to
be <= the probe bound passed to it.

> /* Fetch the problem bound from lower datums list. */
> This is fetching problematic key value rather than the whole problematic 
> bound. I think the comment would be useful if it explains why cmpval -1 th 
> key is problematic but then that's evident from the prologue of 
> partition_rbound_cmp() so I am not sure if this comment is really required. 
> For example, we aren't adding a comment here
> + overlap_location = ((PartitionRangeDatum *)
> + list_nth(spec->upperdatums, -cmpval - 1))->location;

In the attached updated patch, I have tried to make the code and
comments for different cases consistent.  Please have a look.


--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com


v5-0001-Improve-check-new-partition-bound-error-position-.patch
Description: Binary data


Re: Parallel INSERT (INTO ... SELECT ...)

2020-09-23 Thread Greg Nancarrow
> - When INSERTs are made parallel, currently the reported row-count in
> the "INSERT 0 " status only reflects the rows that the
> leader has processed (not the workers) - so it is obviously less than
> the actual number of rows inserted.

Attached an updated patch which fixes this issue (for parallel
INSERTs, each worker's processed tuple count is communicated in shared
memory back to the leader, where it is added to the global
"es_processed" count).


0002-ParallelInsertSelect.patch
Description: Binary data


Re: OpenSSL 3.0.0 compatibility

2020-09-23 Thread Daniel Gustafsson
> On 23 Sep 2020, at 10:19, Michael Paquier  wrote:
> 
> On Tue, Sep 22, 2020 at 02:01:18PM +0200, Daniel Gustafsson wrote:
>> Another option would be to follow OpenSSL's deprecations and mark these 
>> ciphers
>> as deprecated such that we can remove them in case they indeed get removed 
>> from
>> libcypto.  That would give users a heads-up that they should have a migration
>> plan for if that time comes.
> 
> Does that mean a deprecation note in the docs as well as a WARNING
> when attempting to use those ciphers in pgcryto with the version of
> OpenSSL marking such ciphers as deprecated?  I would assume that we
> should do both, rather than only one of them to bring more visibility
> to the user.

We generally don't issue WARNINGs for deprecated functionality do we?  The only
one I can see is for GLOBAL TEMPORARY in temp table creation.  The relevant
errcode ERRCODE_WARNING_DEPRECATED_FEATURE is also not used anywhere.

I'd expect it to just be a note in the documentation, with a prominent
placement in the release notes, if we decide to do something like this.

cheers ./daniel



Re: OpenSSL 3.0.0 compatibility

2020-09-23 Thread Michael Paquier
On Tue, Sep 22, 2020 at 02:01:18PM +0200, Daniel Gustafsson wrote:
> Another option would be to follow OpenSSL's deprecations and mark these 
> ciphers
> as deprecated such that we can remove them in case they indeed get removed 
> from
> libcypto.  That would give users a heads-up that they should have a migration
> plan for if that time comes.

Does that mean a deprecation note in the docs as well as a WARNING
when attempting to use those ciphers in pgcryto with the version of
OpenSSL marking such ciphers as deprecated?  I would assume that we
should do both, rather than only one of them to bring more visibility
to the user.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Add features to pg_stat_statements

2020-09-23 Thread Katsuragi Yuta

On 2020-09-23 16:01, Julien Rouhaud wrote:

Not only providing a view but also logging evictions
along with the number of evicted entries might be a choice.
This idea is from legrand legrand [1].


+1.  I'm wondering if logging each evicted entry, with its queryid,
would help to estimate the actual size of the normalised queries set.


I agree with the estimation of the actual size of the
query set is important.
It looks difficult to estimate the actual size of the
query set from queryid because queryid is a 64bit hash value.

Regards,
Katsuragi Yuta




Re: OpenSSL 3.0.0 compatibility

2020-09-23 Thread Michael Paquier
On Mon, Sep 21, 2020 at 08:18:42PM +0200, Daniel Gustafsson wrote:
> I'm far from fluent in OpenSSL, but AFAICT it has to do with the new provider
> API.  The default value for padding is unchanged, but it needs to be propaged
> into the provider to be set in the context where the old code picked it up
> automatically.  The relevant OpenSSL commit (83f68df32f0067ee7b0) which 
> changes
> the docs to say the function should be called doesn't contain enough
> information to explain why.

Hmm.  Perhaps we should consider making this part conditional on
3.0.0?  But I don't see an actual reason why we cannot do it
unconditionally either.  This needs careful testing for sure.

> Turns out it was coming from when I tested against OpenSSL git HEAD, so it may
> come in alpha7 (or whatever the next version will be).  Let's disregard this
> for now until dust settles, I've dropped the patch from the series.

Yeah.  I have just tried that on the latest HEAD at e771249 and I
could reproduce what you saw.  It smells to me like a regression
introduced by upstream, as per 9a30f40c and c2150f7.  I'd rather wait
for 3.0.0 to be GA before concluding here.  If it proves to be
reproducible with their golden version as a bug (or even not as a
bug), we will need to have a workaround in any case.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Add features to pg_stat_statements

2020-09-23 Thread Katsuragi Yuta

On 2020-09-19 16:35, legrand legrand wrote:

+1 !

An other way is to log evictions, it provides informations about time 
and

amount :

for (i = 0; i < nvictims; i++)
{
hash_search(pgssp_hash, &entries[i]->key, HASH_REMOVE, NULL);
}

pfree(entries);

	/* trace when evicting entries, if appening too often this can slow 
queries

...
 * increasing pg_stat_sql_plans.max value could help */
 ereport(LOG,
(errmsg("pg_stat_sql_plans evicting %d entries", nvictims),
errhidecontext(true), errhidestmt(true)));



--
Sent from: 
https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


Thank you for your comments!
In addition to providing a view that is like a summary of history,
logging might be a good choice.

Regards,
Katsuragi Yuta




Re: Memory allocation abstraction in pgcrypto

2020-09-23 Thread Michael Paquier
On Fri, Sep 18, 2020 at 10:00:04PM +0200, Daniel Gustafsson wrote:
> The attached removes it in favor of using palloc() et.al directly.  Is there
> any reason to keep it around still?

I doubt that anybody has been compiling with PX_OWN_ALLOC in the last
years, so let's remove this abstraction.  And +1 for your patch.
--
Michael


signature.asc
Description: PGP signature


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-23 Thread k.jami...@fujitsu.com
On Wednesday, September 23, 2020 2:37 PM, Tsunakawa, Takayuki wrote:
> > I revised the patch based from my understanding of Horiguchi-san's
> > comment, but I could be wrong.
> > Quoting:
> >
> > "
> > +   /* Get the number of blocks for the supplied
> relation's
> > fork */
> > +   nblocks = smgrnblocks(smgr_reln,
> > forkNum[fork_num]);
> > +   Assert(BlockNumberIsValid(nblocks));
> > +
> > +   if (nblocks <
> BUF_DROP_FULLSCAN_THRESHOLD)
> >
> > As mentioned upthread, the criteria whether we do full-scan or
> > lookup-drop is how large portion of NBUFFERS this relation-drop can be
> > going to invalidate.  So the nblocks above should be the sum of number
> > of blocks to be truncated (not just the total number of blocks) of all
> > designated forks.  Then once we decided to do lookup-drop method, we
> > do that for all forks."
> 
> One takeaway from Horiguchi-san's comment is to use the number of blocks
> to invalidate for comparison, instead of all blocks in the fork.  That is, use
> 
> nblocks = smgrnblocks(fork) - firstDelBlock[fork];
> 
> Does this make sense?

Hmm. Ok, I think it got too much to my head that I misunderstood what it meant.
I'll debug again by using ereport just to check the values and behavior are 
correct.
Your comment about V14 patch has dawned on me that it reverted to previous
slower version where we scan NBuffers for each fork. Thank you for explaining 
it.

> What do you think is the reason for summing up all forks?  I didn't
> understand why.  Typically, FSM and VM forks are very small.  If the main
> fork is larger than NBuffers / 500, then v14 scans the entire shared buffers 
> for
> the FSM and VM forks as well as the main fork, resulting in three scans in
> total.
> 
> Also, if you want to judge the criteria based on the total blocks of all 
> forks, the
> following if should be placed outside the for loop, right?  Because this if
> condition doesn't change inside the for loop.
> 
> + if ((nblocks / (uint32)NBuffers) <
> BUF_DROP_FULLSCAN_THRESHOLD &&
> + BlockNumberIsValid(nblocks))
> + {
> 
> 
> 
> > > (2)
> > > + if ((nblocks / (uint32)NBuffers) <
> > > BUF_DROP_FULLSCAN_THRESHOLD &&
> > > + BlockNumberIsValid(nblocks))
> > > + {
> > >
> > > The division by NBuffers is not necessary, because both sides of =
> > > are number of blocks.
> >
> > Again I based it from my understanding of the comment above, so
> > nblocks is the sum of all blocks to be truncated for all forks.
> 
> But the left expression of "<" is a percentage, while the right one is a block
> count.  Two different units are compared.
> 

Right. Makes sense. Fixed.

> > > Why is BlockNumberIsValid(nblocks)) call needed?
> >
> > I thought we need to ensure that nblocks is not invalid, so I also
> > added
> 
> When is it invalid?  smgrnblocks() seems to always return a valid block
> number.  Am I seeing a different source code (I saw HEAD)?

It's based from the discussion upthread to guarantee the cache to be valid 
while recovery
and that we don't want to proceed with the optimization in case that nblocks is 
invalid.
It may not be needed so I already removed it, because the correct direction is 
ensuring that
smgrnblocks return the precise value.
Considering the test case that Horiguchi-san suggested (attached as separate 
patch),
then maybe there's no need to indicate it in the loop condition.
For now, I haven't modified the design (or created a new function) of 
smgrnblocks, 
and I just updated the patches based from the recent comments.

Thank you very much again for the reviews.

Best regards,
Kirk Jamison


v15-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v15-Speedup-dropping-of-relation-buffers-during-recovery.patch


v1-Prevent-smgrextend-from-invalidating-blocks-during-recovery.patch
Description:  v1-Prevent-smgrextend-from-invalidating-blocks-during-recovery.patch


Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-09-23 Thread Keisuke Kuroda
Hi hackers,

I found a problem in logical replication.
It seems to have the same cause as the following problem.

  Creating many tables gets logical replication stuck
  
https://www.postgresql.org/message-id/flat/20f3de7675f83176253f607b5e199b228406c21c.camel%40cybertec.at

  Logical decoding CPU-bound w/ large number of tables
  
https://www.postgresql.org/message-id/flat/CAHoiPjzea6N0zuCi%3D%2Bf9v_j94nfsy6y8SU7-%3Dbp4%3D7qw6_i%3DRg%40mail.gmail.com

# problem

* logical replication enabled
* walsender process has RelfilenodeMap cache(2000 relations in this case)
* TRUNCATE or DROP or CREATE many tables in same transaction

At this time, walsender process continues to use 100% of the CPU 1core.

# environment

PostgreSQL 12.4(rpm)
CentOS 7.6.1810

# test case

sh logical_replication_truncate.sh

# perf report walsender process

Samples: 25K of event 'cpu-clock:uhH', Event count (approx.): 631525
Overhead  Command   Shared Object   Symbol
  87.34%  postgres  postgres[.] hash_seq_search
   2.80%  postgres  postgres[.] hash_search_with_hash_value
   1.57%  postgres  postgres[.] LocalExecuteInvalidationMessage
   1.50%  postgres  postgres[.] hash_search
   1.31%  postgres  postgres[.] RelfilenodeMapInvalidateCallback
   0.83%  postgres  postgres[.] uint32_hash
   0.79%  postgres  pgoutput.so [.] rel_sync_cache_relation_cb
   0.63%  postgres  postgres[.] hash_uint32
   0.59%  postgres  postgres[.] PlanCacheRelCallback
   0.48%  postgres  postgres[.] CatCacheInvalidate
   0.43%  postgres  postgres[.] ReorderBufferCommit
   0.36%  postgres  libc-2.17.so[.] __memcmp_sse4_1
   0.34%  postgres  postgres[.] deregister_seq_scan
   0.27%  postgres  postgres[.] hash_seq_init
   0.27%  postgres  postgres[.] CallSyscacheCallbacks
   0.23%  postgres  postgres[.] SysCacheInvalidate
   0.10%  postgres  postgres[.] memcmp@plt
   0.08%  postgres  postgres[.] RelationCacheInvalidateEntry
   0.05%  postgres  postgres[.] InvalidateCatalogSnapshot
   0.03%  postgres  postgres[.] GetCurrentTransactionNestLevel
   0.01%  postgres  pgoutput.so [.] hash_search@plt
   0.00%  postgres  libpthread-2.17.so  [.] __pread_nocancel

# backtrace walsender process

0x00889cb4 in hash_seq_search
(status=status@entry=0x7ffd5ae38310) at dynahash.c:1442
#0  0x00889cb4 in hash_seq_search
(status=status@entry=0x7ffd5ae38310) at dynahash.c:1442
#1  0x00877df8 in RelfilenodeMapInvalidateCallback
(arg=, relid=17284) at relfilenodemap.c:64
#2  0x0086999a in LocalExecuteInvalidationMessage
(msg=0x2f2ef18) at inval.c:595
#3  0x0070b81e in ReorderBufferExecuteInvalidations
(rb=, txn=, txn=) at
reorderbuffer.c:2149
#4  ReorderBufferCommit (rb=0x2cbda90, xid=xid@entry=490,
commit_lsn=94490944, end_lsn=,
commit_time=commit_time@entry=653705313747147,
origin_id=origin_id@entry=0, origin_lsn=origin_lsn@entry=0) at
reorderbuffer.c:1770
#5  0x00701248 in DecodeCommit (xid=490,
parsed=0x7ffd5ae38600, buf=0x7ffd5ae387c0, ctx=0x2c2f1e0) at
decode.c:640
#6  DecodeXactOp (ctx=0x2c2f1e0, buf=buf@entry=0x7ffd5ae387c0) at decode.c:248
#7  0x007015fa in LogicalDecodingProcessRecord (ctx=0x2c2f1e0,
record=0x2c2f458) at decode.c:117
#8  0x00712cd1 in XLogSendLogical () at walsender.c:2848
#9  0x00714e92 in WalSndLoop
(send_data=send_data@entry=0x712c70 ) at
walsender.c:2199
#10 0x00715b51 in StartLogicalReplication (cmd=0x2c8c5c0) at
walsender.c:1129
#11 exec_replication_command (
cmd_string=cmd_string@entry=0x2c0af60 "START_REPLICATION SLOT
\"subdb_test\" LOGICAL 0/0 (proto_version '1', publication_names
'\"pubdb_test\"')") at walsender.c:1545
#12 0x00760ff1 in PostgresMain (argc=,
argv=argv@entry=0x2c353d0, dbname=0x2c35288 "postgres",
username=)
at postgres.c:4243
#13 0x00484172 in BackendRun (port=,
port=) at postmaster.c:4448
#14 BackendStartup (port=0x2c2d200) at postmaster.c:4139
#15 ServerLoop () at postmaster.c:1704
#16 0x006f10b3 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x2c05b20) at postmaster.c:1377
#17 0x00485073 in main (argc=3, argv=0x2c05b20) at main.c:228

# probable cause

RelfilenodeMapInvalidateCallback is called many times.

Every time the CommandId is incremented, execute all invalidations.
ReorderBufferExecuteInvalidations repeats the invalidation
of all TRUNCATE objects in same transaction by the number of objects.

hash_seq_search in RelfilenodeMapInvalidateCallback is heavy,
but I have no idea to solve this problem...

./src/backend/replication/logical/reorderbuffer.c
1746 case REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID:
1747   Assert(change->data.command_id != InvalidCommandId);
1748
1749   if (command_id < change->data.command_id)
1750  

Re: [PATCH] Add features to pg_stat_statements

2020-09-23 Thread Julien Rouhaud
On Wed, Sep 23, 2020 at 2:48 PM Katsuragi Yuta
 wrote:
>
> On 2020-09-18 18:49, Julien Rouhaud wrote:
> > Did you consider also adding the cumulated number of
> > evicted entries?  This could be useful to know how to configure
> > pg_stat_statements.max.
>
> Thank you for your comments!
> I overlooked the cumulated number of evicted entries.
> This statistic looks important. But, I am not sure
> if I should add this statistic to a view.
> This is because I am not sure how to utilize the cumulated
> number of evicted entries for configuring pg_stat_statements.max.

You're right, as the number of evicted entries isn't depending on the
number of entries that wouls been needed to contain the entirely
workload.

> Not only providing a view but also logging evictions
> along with the number of evicted entries might be a choice.
> This idea is from legrand legrand [1].

+1.  I'm wondering if logging each evicted entry, with its queryid,
would help to estimate the actual size of the normalised queries set.