Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-11-23 Thread Andrey Lepikhov




On 11/23/20 7:49 AM, tsunakawa.ta...@fujitsu.com wrote:

Hi Andrey-san,

From: Tomas Vondra 

I needed to look at this patch while working on something related, and I found 
it
got broken by 6973533650c a couple days ago. So here's a fixed version, to keep
cfbot happy. I haven't done any serious review yet.


Could I or my colleague continue this patch in a few days?  It looks it's 
stalled over one month.


I don't found any problems with this patch that needed to be corrected. 
It is wait for actions from committers side, i think.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Bogus documentation for bogus geometric operators

2020-11-23 Thread Pavel Borisov
>
> >> undocumented.  Maybe instead of removing, change the text to be
> >> "Deprecated, use the equivalent XXX operator instead."  Or we could
> >> add a footnote similar to what was there for a previous renaming:
>
> > The problem that this new <<| is equivalent to <^ only for points (To
> > recap: the source of a problem is the same name of <^  operator for
> points
> > and boxes with different meaning for these types).
>
> I don't think it's that hard to be clear; see proposed wording below.
>
> The other loose end is that I don't think we can take away the opclass
> entries for the old spellings, unless we're willing to visibly break
> people's queries by removing those operator names altogether.  That
> doesn't seem like it'll fly when we haven't even deprecated the old
> names yet.  So for now, we have to support both names in the opclasses.
> I extended the patch to do that.
>
> This version seems committable to me --- any thoughts?
>
The wording seems no problem to me. I  looked into a patch and changes also
seem sensible but I can not apply this patch because of really many
rejects. Which commit should I use to apply it onto?

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-23 Thread Bharath Rupireddy
Attaching v2 patch, rebased on the latest master 17958972.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v2-0001-Multi-Inserts-in-Create-Table-As.patch
Description: Binary data


Re: Split copy.c

2020-11-23 Thread Heikki Linnakangas

On 18/11/2020 08:21, Soumyadeep Chakraborty wrote:

On Tue, Nov 17, 2020 at 2:38 AM Heikki Linnakangas  wrote:

Fixed all the other things you listed, fixed a bug in setting
'file_encoding', and trimmed down the #includes.


Thanks! LGTM! Marking as Ready for Committer.


Pushed, thanks for the review!

- Heikki




Re: Bogus documentation for bogus geometric operators

2020-11-23 Thread Pavel Borisov
>
> The wording seems no problem to me. I  looked into a patch and changes
>> also seem sensible but I can not apply this patch because of really many
>> rejects. Which commit should I use to apply it onto?
>
> Sorry, the rejects were due to my git configuration. I will apply and make
the final checks soon.


-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-23 Thread Heikki Linnakangas

On 23/11/2020 11:15, Bharath Rupireddy wrote:

Attaching v2 patch, rebased on the latest master 17958972.


I just broke this again with commit c532d15ddd to split up copy.c. 
Here's another rebased version.


- Heikki
>From dca55175c590914f6a802ec3d36e20db55a3e3c7 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 23 Nov 2020 14:03:05 +0530
Subject: [PATCH v3 1/1] Multi Inserts in Create Table As.

This could improve the performance and also benefits Create
Materialized View as it uses the code of Create Table As.
---
 src/backend/commands/copyfrom.c   |  31 +++--
 src/backend/commands/createas.c   | 100 --
 src/backend/executor/execTuples.c |  66 
 src/include/access/tableam.h  |  15 +
 src/include/executor/tuptable.h   |   2 +-
 5 files changed, 185 insertions(+), 29 deletions(-)

diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 1b14e9a6eb0..a721600ade4 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -44,34 +44,17 @@
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
-/*
- * No more than this many tuples per CopyMultiInsertBuffer
- *
- * Caution: Don't make this too big, as we could end up with this many
- * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
- * multiInsertBuffers list.  Increasing this can cause quadratic growth in
- * memory requirements during copies into partitioned tables with a large
- * number of partitions.
- */
-#define MAX_BUFFERED_TUPLES		1000
-
-/*
- * Flush buffers if there are >= this many bytes, as counted by the input
- * size, of tuples stored.
- */
-#define MAX_BUFFERED_BYTES		65535
-
 /* Trim the list of buffers back down to this number after flushing */
 #define MAX_PARTITION_BUFFERS	32
 
 /* Stores multi-insert data related to a single relation in CopyFrom. */
 typedef struct CopyMultiInsertBuffer
 {
-	TupleTableSlot *slots[MAX_BUFFERED_TUPLES]; /* Array to store tuples */
+	TupleTableSlot *slots[MAX_MULTI_INSERT_TUPLES]; /* Array to store tuples */
 	ResultRelInfo *resultRelInfo;	/* ResultRelInfo for 'relid' */
 	BulkInsertState bistate;	/* BulkInsertState for this rel */
 	int			nused;			/* number of 'slots' containing tuples */
-	uint64		linenos[MAX_BUFFERED_TUPLES];	/* Line # of tuple in copy
+	uint64		linenos[MAX_MULTI_INSERT_TUPLES];	/* Line # of tuple in copy
  * stream */
 } CopyMultiInsertBuffer;
 
@@ -214,7 +197,7 @@ CopyMultiInsertBufferInit(ResultRelInfo *rri)
 	CopyMultiInsertBuffer *buffer;
 
 	buffer = (CopyMultiInsertBuffer *) palloc(sizeof(CopyMultiInsertBuffer));
-	memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_BUFFERED_TUPLES);
+	memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_MULTI_INSERT_TUPLES);
 	buffer->resultRelInfo = rri;
 	buffer->bistate = GetBulkInsertState();
 	buffer->nused = 0;
@@ -273,8 +256,8 @@ CopyMultiInsertInfoInit(CopyMultiInsertInfo *miinfo, ResultRelInfo *rri,
 static inline bool
 CopyMultiInsertInfoIsFull(CopyMultiInsertInfo *miinfo)
 {
-	if (miinfo->bufferedTuples >= MAX_BUFFERED_TUPLES ||
-		miinfo->bufferedBytes >= MAX_BUFFERED_BYTES)
+	if (miinfo->bufferedTuples >= MAX_MULTI_INSERT_TUPLES ||
+		miinfo->bufferedBytes >= MAX_MULTI_INSERT_BUFFERED_BYTES)
 		return true;
 	return false;
 }
@@ -392,7 +375,7 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
 	FreeBulkInsertState(buffer->bistate);
 
 	/* Since we only create slots on demand, just drop the non-null ones. */
-	for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
+	for (i = 0; i < MAX_MULTI_INSERT_TUPLES && buffer->slots[i] != NULL; i++)
 		ExecDropSingleTupleTableSlot(buffer->slots[i]);
 
 	table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
@@ -484,7 +467,7 @@ CopyMultiInsertInfoNextFreeSlot(CopyMultiInsertInfo *miinfo,
 	int			nused = buffer->nused;
 
 	Assert(buffer != NULL);
-	Assert(nused < MAX_BUFFERED_TUPLES);
+	Assert(nused < MAX_MULTI_INSERT_TUPLES);
 
 	if (buffer->slots[nused] == NULL)
 		buffer->slots[nused] = table_slot_create(rri->ri_RelationDesc, NULL);
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 6bf6c5a3106..06d67ba7e4b 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -61,6 +61,13 @@ typedef struct
 	CommandId	output_cid;		/* cmin to insert in output tuples */
 	int			ti_options;		/* table_tuple_insert performance options */
 	BulkInsertState bistate;	/* bulk insert state */
+	MemoryContext	mi_context;	/* A temporary memory context for multi insert */
+	/* Buffered slots for a multi insert batch. */
+	TupleTableSlot *mi_slots[MAX_MULTI_INSERT_TUPLES];
+	/* Number of current buffered slots for a multi insert batch. */
+	intmi_slots_num;
+	/* Total tuple size for a multi insert batch. */
+	intmi_slots_size;
 } DR_intorel;
 
 /* utility functions for CTAS definition creation */
@@ -530,15 +537,33 @@ intorel_startup(DestReceiver *self, int oper

prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619

2020-11-23 Thread Heikki Linnakangas

Hi,

After my commit c532d15ddd to split up copy.c, buildfarm animal "prion" 
failed in pg_upgrade 
(https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2020-11-23%2009%3A23%3A16):



Upgrade Complete

Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:

/home/ec2-user/bf/root/HEAD/pgsql.build/tmp_install/home/ec2-user/bf/root/HEAD/inst/bin/vacuumdb
 --all --analyze-in-stages

Running this script will delete the old cluster's data files:
./delete_old_cluster.sh
waiting for server to start done
server started
pg_dump: error: query failed: ERROR:  missing chunk number 0 for toast value 
14334 in pg_toast_2619
pg_dump: error: query was: SELECT
a.attnum,
a.attname,
a.atttypmod,
a.attstattarget,
a.attstorage,
t.typstorage,
a.attnotnull,
a.atthasdef,
a.attisdropped,
a.attlen,
a.attalign,
a.attislocal,
pg_catalog.format_type(t.oid, a.atttypmod) AS atttypname,
array_to_string(a.attoptions, ', ') AS attoptions,
CASE WHEN a.attcollation <> t.typcollation THEN a.attcollation ELSE 0 END AS 
attcollation,
pg_catalog.array_to_string(ARRAY(SELECT pg_catalog.quote_ident(option_name) || 
' ' || pg_catalog.quote_literal(option_value) FROM 
pg_catalog.pg_options_to_table(attfdwoptions) ORDER BY option_name), E',
') AS attfdwoptions,
a.attidentity,
CASE WHEN a.atthasmissing AND NOT a.attisdropped THEN a.attmissingval ELSE null 
END AS attmissingval,
a.attgenerated
FROM pg_catalog.pg_attribute a LEFT JOIN pg_catalog.pg_type t ON a.atttypid = 
t.oid
WHERE a.attrelid = '35221'::pg_catalog.oid AND a.attnum > 0::pg_catalog.int2
ORDER BY a.attnum
pg_dumpall: error: pg_dump failed on database "regression", exiting
waiting for server to shut down done
server stopped
pg_dumpall of post-upgrade database cluster failed
make: *** [check] Error 1


I don't think this is related to commit c532d15ddd, as there is no COPY 
involved here. I have no clue what might might be going on here, though. 
Any ideas?


Googling around, I found one report that looks somewhat similar: 
https://www.postgresql.org/message-id/20190830142655.GA14011%40telsasoft.com. 
That was with CLUSTER/VACUUM FULL, while pg_upgrade performs ANALYZE.


- Heikki




Re: [HACKERS] logical decoding of two-phase transactions

2020-11-23 Thread Ajin Cherian
On Sun, Nov 22, 2020 at 12:31 AM Amit Kapila  wrote:

> I am planning to continue review of these patches but I thought it is
> better to check about the above changes before proceeding further. Let
> me know what you think?
>

I've had a look at the changes and done a few tests, and have no
comments. However, I did see that the test 002_twophase_streaming.pl
failed once. I've run it at least 30 times after that but haven't seen
it fail again.
Unfortunately my ulimit was not set up to create dumps and so I dont
have a dump when the test case failed. I will continue testing and
reviewing the changes.

regards,
Ajin Cherian
Fujitsu Australia




Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-23 Thread Bharath Rupireddy
On Mon, Nov 23, 2020 at 3:26 PM Heikki Linnakangas  wrote:
>
> On 23/11/2020 11:15, Bharath Rupireddy wrote:
> > Attaching v2 patch, rebased on the latest master 17958972.
>
> I just broke this again with commit c532d15ddd to split up copy.c.
> Here's another rebased version.
>

Thanks! I noticed that and am about to post a new patch. Anyways,
thanks for the rebased v3 patch. Attaching here v3 again for
visibility.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 7795a83a2485440e01dfb432c23c362c37a2a32b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 23 Nov 2020 15:36:36 +0530
Subject: [PATCH v3] Multi Inserts in Create Table As.

This could improve the performance and also benefits Create
Materialized View as it uses the code of Create Table As.
---
 src/backend/commands/copyfrom.c   |  31 +++--
 src/backend/commands/createas.c   | 100 --
 src/backend/executor/execTuples.c |  66 
 src/include/access/tableam.h  |  15 +
 src/include/executor/tuptable.h   |   2 +-
 5 files changed, 185 insertions(+), 29 deletions(-)

diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 1b14e9a6eb..a721600ade 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -44,34 +44,17 @@
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
 
-/*
- * No more than this many tuples per CopyMultiInsertBuffer
- *
- * Caution: Don't make this too big, as we could end up with this many
- * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
- * multiInsertBuffers list.  Increasing this can cause quadratic growth in
- * memory requirements during copies into partitioned tables with a large
- * number of partitions.
- */
-#define MAX_BUFFERED_TUPLES		1000
-
-/*
- * Flush buffers if there are >= this many bytes, as counted by the input
- * size, of tuples stored.
- */
-#define MAX_BUFFERED_BYTES		65535
-
 /* Trim the list of buffers back down to this number after flushing */
 #define MAX_PARTITION_BUFFERS	32
 
 /* Stores multi-insert data related to a single relation in CopyFrom. */
 typedef struct CopyMultiInsertBuffer
 {
-	TupleTableSlot *slots[MAX_BUFFERED_TUPLES]; /* Array to store tuples */
+	TupleTableSlot *slots[MAX_MULTI_INSERT_TUPLES]; /* Array to store tuples */
 	ResultRelInfo *resultRelInfo;	/* ResultRelInfo for 'relid' */
 	BulkInsertState bistate;	/* BulkInsertState for this rel */
 	int			nused;			/* number of 'slots' containing tuples */
-	uint64		linenos[MAX_BUFFERED_TUPLES];	/* Line # of tuple in copy
+	uint64		linenos[MAX_MULTI_INSERT_TUPLES];	/* Line # of tuple in copy
  * stream */
 } CopyMultiInsertBuffer;
 
@@ -214,7 +197,7 @@ CopyMultiInsertBufferInit(ResultRelInfo *rri)
 	CopyMultiInsertBuffer *buffer;
 
 	buffer = (CopyMultiInsertBuffer *) palloc(sizeof(CopyMultiInsertBuffer));
-	memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_BUFFERED_TUPLES);
+	memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_MULTI_INSERT_TUPLES);
 	buffer->resultRelInfo = rri;
 	buffer->bistate = GetBulkInsertState();
 	buffer->nused = 0;
@@ -273,8 +256,8 @@ CopyMultiInsertInfoInit(CopyMultiInsertInfo *miinfo, ResultRelInfo *rri,
 static inline bool
 CopyMultiInsertInfoIsFull(CopyMultiInsertInfo *miinfo)
 {
-	if (miinfo->bufferedTuples >= MAX_BUFFERED_TUPLES ||
-		miinfo->bufferedBytes >= MAX_BUFFERED_BYTES)
+	if (miinfo->bufferedTuples >= MAX_MULTI_INSERT_TUPLES ||
+		miinfo->bufferedBytes >= MAX_MULTI_INSERT_BUFFERED_BYTES)
 		return true;
 	return false;
 }
@@ -392,7 +375,7 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
 	FreeBulkInsertState(buffer->bistate);
 
 	/* Since we only create slots on demand, just drop the non-null ones. */
-	for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
+	for (i = 0; i < MAX_MULTI_INSERT_TUPLES && buffer->slots[i] != NULL; i++)
 		ExecDropSingleTupleTableSlot(buffer->slots[i]);
 
 	table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
@@ -484,7 +467,7 @@ CopyMultiInsertInfoNextFreeSlot(CopyMultiInsertInfo *miinfo,
 	int			nused = buffer->nused;
 
 	Assert(buffer != NULL);
-	Assert(nused < MAX_BUFFERED_TUPLES);
+	Assert(nused < MAX_MULTI_INSERT_TUPLES);
 
 	if (buffer->slots[nused] == NULL)
 		buffer->slots[nused] = table_slot_create(rri->ri_RelationDesc, NULL);
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 6bf6c5a310..06d67ba7e4 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -61,6 +61,13 @@ typedef struct
 	CommandId	output_cid;		/* cmin to insert in output tuples */
 	int			ti_options;		/* table_tuple_insert performance options */
 	BulkInsertState bistate;	/* bulk insert state */
+	MemoryContext	mi_context;	/* A temporary memory context for multi insert */
+	/* Buffered slots for a multi insert batch. */
+	TupleTableSlot *mi_slots[MAX_MULTI_INSERT_TUPLES];
+	/* Number of current buffered slot

Re: Use of "long" in incremental sort code

2020-11-23 Thread Anastasia Lubennikova

On 05.11.2020 02:53, James Coleman wrote:

On Tue, Nov 3, 2020 at 4:42 PM Tomas Vondra
 wrote:

On Tue, Nov 03, 2020 at 03:53:53AM +0100, Tomas Vondra wrote:

Hi,

I took another look at this, and 99% of the patch (the fixes to sort
debug messages) seems fine to me.  Attached is the part I plan to get
committed, including commit message etc.


I've pushed this part. Thanks for the patch, Haiying Tang.


The one change I decided to remove is this change in tuplesort_free:

-   longspaceUsed;
+   int64   spaceUsed;

The reason why I think this variable should be 'long' is that we're
using it for this:

spaceUsed = LogicalTapeSetBlocks(state->tapeset);

and LogicalTapeSetBlocks is defined like this:

extern long LogicalTapeSetBlocks(LogicalTapeSet *lts);

FWIW the "long" is not introduced by incremental sort - it used to be in
tuplesort_end, the incremental sort patch just moved it to a different
function. It's a bit confusing that tuplesort_updatemax has this:

int64   spaceUsed;

But I'd argue this is actually wrong, and should be "long" instead. (And
this actually comes from the incremental sort patch, by me.)


FWIW while looking at what the other places calling LogicalTapeSetBlocks
do, and I noticed this:

uint64  disk_used = LogicalTapeSetBlocks(...);

in the disk-based hashagg patch. So that's a third data type ...


IMHO this should simply switch the current int64 variable to long, as it
was before. Not sure about about the hashagg uint64 variable.

Is there anything that actually limits tape code to using at most 4GB
on 32-bit systems?


At first glance, I haven't found anything that could limit tape code. It 
uses BufFile, which is not limited by the OS file size limit.
Still, If we want to change 'long' in LogicalTapeSetBlocks, we should 
probably also update nBlocksWritten and other variables.


As far as I see, the major part of the patch was committed, so l update 
the status of the CF entry to "Committed". Feel free to create a new 
entry, if you're going to continue working on the remaining issue.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Parallel Inserts in CREATE TABLE AS

2020-11-23 Thread Bharath Rupireddy
On Mon, Oct 19, 2020 at 10:47 PM Bharath Rupireddy
 wrote:
>
> Attaching v3 patch herewith.
>
> I'm done with all the open points in my list. Please review the v3 patch and 
> provide comments.
>

Attaching v4 patch, rebased on the latest master 68b1a4877e. Also,
added this feature to commitfest -
https://commitfest.postgresql.org/31/2841/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 127309ffab82e372b338914ccc900463d8c63157 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Mon, 23 Nov 2020 16:27:45 +0530
Subject: [PATCH v4] Parallel Inserts in CREATE TABLE AS

The idea of this patch is to allow the leader and each worker
insert the tuples in parallel if the SELECT part of the CTAS is
parallelizable.

The design:
Let the planner know that the SELECT is from CTAS in createas.c
so that it can set the number of tuples transferred from the
workers to Gather node to 0. With this change, there are chances
that the planner may choose the parallel plan. After the planning,
check if the upper plan node is Gather in createas.c and mark a
parallelism flag in the CTAS dest receiver. Pass the into clause,
object id, command id from the leader to workers, so that each
worker can create its own CTAS dest receiver. Leader inserts it's
share of tuples if instructed to do, and so are workers. Each
worker writes atomically it's number of inserted tuples into a
shared memory variable, the leader combines this with it's own
number of inserted tuples and shares to the client.
---
 src/backend/access/heap/heapam.c |  11 -
 src/backend/access/transam/xact.c|  30 +-
 src/backend/commands/createas.c  | 314 ---
 src/backend/commands/explain.c   |  36 +++
 src/backend/executor/execMain.c  |  19 ++
 src/backend/executor/execParallel.c  |  60 +++-
 src/backend/executor/nodeGather.c| 101 +-
 src/backend/optimizer/path/costsize.c|  12 +-
 src/include/access/xact.h|   1 +
 src/include/commands/createas.h  |  20 ++
 src/include/executor/execParallel.h  |   1 +
 src/include/nodes/execnodes.h|   5 +
 src/include/nodes/parsenodes.h   |   1 +
 src/test/regress/expected/write_parallel.out | 143 +
 src/test/regress/sql/write_parallel.sql  |  65 
 15 files changed, 682 insertions(+), 137 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 1b2f70499e..3045c0f046 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2043,17 +2043,6 @@ static HeapTuple
 heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
 	CommandId cid, int options)
 {
-	/*
-	 * To allow parallel inserts, we need to ensure that they are safe to be
-	 * performed in workers. We have the infrastructure to allow parallel
-	 * inserts in general except for the cases where inserts generate a new
-	 * CommandId (eg. inserts into a table having a foreign key column).
-	 */
-	if (IsParallelWorker())
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
- errmsg("cannot insert tuples in a parallel worker")));
-
 	tup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
 	tup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
 	tup->t_data->t_infomask |= HEAP_XMAX_INVALID;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 03c553e7ea..c85a9906f1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -763,18 +763,34 @@ GetCurrentCommandId(bool used)
 	/* this is global to a transaction, not subtransaction-local */
 	if (used)
 	{
-		/*
-		 * Forbid setting currentCommandIdUsed in a parallel worker, because
-		 * we have no provision for communicating this back to the leader.  We
-		 * could relax this restriction when currentCommandIdUsed was already
-		 * true at the start of the parallel operation.
-		 */
-		Assert(!IsParallelWorker());
+		 /*
+		  * This is a temporary hack for all common parallel insert cases i.e.
+		  * insert into, ctas, copy from. To be changed later. In a parallel
+		  * worker, set currentCommandIdUsed to true only if it was not set to
+		  * true at the start of the parallel operation (by way of
+		  * SetCurrentCommandIdUsedForWorker()). We have to do this because
+		  * GetCurrentCommandId(true) may be called from anywhere, especially
+		  * for parallel inserts, within parallel worker.
+		  */
+		Assert(!(IsParallelWorker() && !currentCommandIdUsed));
 		currentCommandIdUsed = true;
 	}
 	return currentCommandId;
 }
 
+/*
+ *	SetCurrentCommandIdUsedForWorker
+ *
+ * For a parallel worker, record that the currentCommandId has been used.
+ * This must only be called at the start of a parallel operation.
+*/
+void
+SetCurrentCommandIdUsedForWorker(void)
+{
+	Assert(IsParallelWorker() && !currentCommandIdUsed && currentCommandId != InvalidCommandId);
+	current

Re: [HACKERS] logical decoding of two-phase transactions

2020-11-23 Thread Amit Kapila
On Mon, Nov 23, 2020 at 3:41 PM Ajin Cherian  wrote:
>
> On Sun, Nov 22, 2020 at 12:31 AM Amit Kapila  wrote:
>
> > I am planning to continue review of these patches but I thought it is
> > better to check about the above changes before proceeding further. Let
> > me know what you think?
> >
>
> I've had a look at the changes and done a few tests, and have no
> comments.
>

Okay, thanks. Additionally, I have analyzed whether we need to call
SnapbuildCommittedTxn in DecodePrepare as was raised earlier for this
patch [1]. As mentioned in that thread SnapbuildCommittedTxn primarily
does three things (a) Decide whether we are interested in tracking the
current txn effects and if we are, mark it as committed. (b) Build and
distribute snapshot to all RBTXNs, if it is important. (c) Set base
snap of our xact if it did DDL, to execute invalidations during
replay.

For the first two, as the xact is still not visible to others so we
don't need to make it behave like a committed txn. To make the (DDL)
changes visible to the current txn, the message
REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID copies the snapshot which
fills the subxip array. This will be sufficient to make the changes
visible to the current txn. For the third, I have checked the code
that whenever we have any change message the base snapshot gets set
via SnapBuildProcessChange. It is possible that I have missed
something but I don't want to call SnapbuildCommittedTxn in
DecodePrepare unless we have a clear reason for the same so leaving it
for now. Can you or someone see any reason for the same?

> However, I did see that the test 002_twophase_streaming.pl
> failed once. I've run it at least 30 times after that but haven't seen
> it fail again.
>

This test is based on waiting to see some message in the log. It is
possible it failed due to timeout which can only happen rarely. You
can check some failure logs in test_decoding folder (probably in
tmp_check folder). Even if we get some server or test log, it can help
us to diagnose the problem.

[1] - https://www.postgresql.org/message-id/87zhxrwgvh.fsf%40ars-thinkpad

-- 
With Regards,
Amit Kapila.




Re: Bogus documentation for bogus geometric operators

2020-11-23 Thread Pavel Borisov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

I've made another check of the final state and suppose it is ready to be pushed.

Pavel Borisov

Re: [HACKERS] Custom compression methods

2020-11-23 Thread Dilip Kumar
On Sat, Nov 21, 2020 at 3:50 AM Robert Haas  wrote:
>
> On Wed, Nov 11, 2020 at 9:39 AM Dilip Kumar  wrote:
> > There were a few problems in this rebased version, basically, the
> > compression options were not passed while compressing values from the
> > brin_form_tuple, so I have fixed this.
>
> Since the authorship history of this patch is complicated, it would be
> nice if you would include authorship information and relevant
> "Discussion" links in the patches.
>
> Design level considerations and overall notes:
>
> configure is autogenerated from configure.in, so the patch shouldn't
> include changes only to the former.
>
> Looking over the changes to src/include:
>
> +   PGLZ_COMPRESSION_ID,
> +   LZ4_COMPRESSION_ID
>
> I think that it would be good to assign values to these explicitly.
>
> +/* compresion handler routines */
>
> Spelling.
>
> +   /* compression routine for the compression method */
> +   cmcompress_function cmcompress;
> +
> +   /* decompression routine for the compression method */
> +   cmcompress_function cmdecompress;
>
> Don't reuse cmcompress_function; that's confusing. Just have a typedef
> per structure member, even if they end up being the same.
>
>  #define TOAST_COMPRESS_SET_RAWSIZE(ptr, len) \
> -   (((toast_compress_header *) (ptr))->rawsize = (len))
> +do { \
> +   Assert(len > 0 && len <= RAWSIZEMASK); \
> +   ((toast_compress_header *) (ptr))->info = (len); \
> +} while (0)
>
> Indentation.
>
> +#define TOAST_COMPRESS_SET_COMPRESSION_METHOD(ptr, cm_method) \
> +   ((toast_compress_header *) (ptr))->info |= ((cm_method) << 30);
>
> What about making TOAST_COMPRESS_SET_RAWSIZE() take another argument?
> And possibly also rename it to TEST_COMPRESS_SET_SIZE_AND_METHOD() or
> something? It seems not great to have separate functions each setting
> part of a 4-byte quantity. Too much chance of failing to set both
> parts. I guess you've got a function called
> toast_set_compressed_datum_info() for that, but it's just a wrapper
> around two macros that could just be combined, which would reduce
> complexity overall.
>
> +   T_CompressionRoutine,   /* in access/compressionapi.h */
>
> This looks misplaced. I guess it should go just after these:
>
> T_FdwRoutine,   /* in foreign/fdwapi.h */
> T_IndexAmRoutine,   /* in access/amapi.h */
> T_TableAmRoutine,   /* in access/tableam.h */
>
> Looking over the regression test changes:
>
> The tests at the top of create_cm.out that just test that we can
> create tables with various storage types seem unrelated to the purpose
> of the patch. And the file doesn't test creating a compression method
> either, as the file name would suggest, so either the file name needs
> to be changed (compression, compression_method?) or the tests don't go
> here.
>
> +-- check data is okdd
>
> I guess whoever is responsible for this comment prefers vi to emacs.
>
> I don't quite understand the purpose of all of these tests, and there
> are some things that I feel like ought to be tested that seemingly
> aren't. Like, you seem to test using an UPDATE to move a datum from a
> table to another table with the same compression method, but not one
> with a different compression method. Testing the former is nice and
> everything, but that's the easy case: I think we also need to test the
> latter. I think it would be good to verify not only that the data is
> readable but that it's compressed the way we expect. I think it would
> be a great idea to add a pg_column_compression() function in a similar
> spirit to pg_column_size(). Perhaps it could return NULL when
> compression is not in use or the data type is not varlena, and the
> name of the compression method otherwise. That would allow for better
> testing of this feature, and it would also be useful to users who are
> switching methods, to see what data they still have that's using the
> old method. It could be useful for debugging problems on customer
> systems, too.
>
> I wonder if we need a test that moves data between tables through an
> intermediary. For instance, suppose a plpgsql function or DO block
> fetches some data and stores it in a plpgsql variable and then uses
> the variable to insert into another table. Hmm, maybe that would force
> de-TOASTing. But perhaps there are other cases. Maybe a more general
> way to approach the problem is: have you tried running a coverage
> report and checked which parts of your code are getting exercised by
> the existing tests and which parts are not? The stuff that isn't, we
> should try to add more tests. It's easy to get corner cases wrong with
> this kind of thing.
>
> I notice that LIKE INCLUDING COMPRESSION doesn't seem to be tested, at
> least not by 0001, which reinforces my feeling that the tests here are
> not as thorough as they could be.
>
> +NOTICE:  pg_compression contains unpinned initdb-created object(s)
>
> This seems wrong to me - why is it OK?
>
>

RE: Parallel plans and "union all" subquery

2020-11-23 Thread Phil Florent
Hi Greg,

The implicit conversion was the cause of the non parallel plan, thanks for the 
explanation and the workarounds. It can cause a huge difference in terms of 
performance, I will give the information to our developers.

Regards,

Phil




De : Greg Nancarrow 
Envoyé : lundi 23 novembre 2020 06:04
À : Phil Florent 
Cc : pgsql-hackers@lists.postgresql.org 
Objet : Re: Parallel plans and "union all" subquery

On Sun, Nov 22, 2020 at 11:51 PM Phil Florent  wrote:
>
>
> Hi,
>
>
> I have a question about parallel plans. I also posted it on the general list 
> but perhaps it's a question for hackers. Here is my test case :
>
>
> explain
> select count(*)
> from (select
> n1
> from drop_me
> union all
> values(1)) ua;
>
>
> QUERY PLAN
> 
> Aggregate (cost=2934739.24..2934739.25 rows=1 width=8)
> -> Append (cost=0.00..2059737.83 rows=7113 width=32)
> -> Seq Scan on drop_me (cost=0.00..1009736.12 rows=7112 width=6)
> -> Subquery Scan on "*SELECT* 2" (cost=0.00..0.02 rows=1 width=32)
> -> Result (cost=0.00..0.01 rows=1 width=4)
> JIT:
> Functions: 4
> Options: Inlining true, Optimization true, Expressions true, Deforming true
>
>
> No parallel plan, 2s6
>
>
> I read the documentation but I don't get the reason of the "noparallel" seq 
> scan of drop_me in the last case ?
>

Without debugging this, it looks to me that the UNION type resolution
isn't working as well as it possibly could in this case, for the
generation of a parallel plan. I found that with a minor tweak to your
SQL, either for the table creation or query, it will produce a
parallel plan.

Noting that currently you're creating the drop_me table with a
"numeric" column, you can either:

(1) Change the table creation

FROM:
create unlogged table drop_me as select generate_series(1,7e7) n1;
TO:
create unlogged table drop_me as select generate_series(1,7e7)::int n1;


OR


(2) Change the query

FROM:
explain
select count(*)
from (select
n1
from drop_me
union all
values(1)) ua;

TO:

explain
select count(*)
from (select
n1
from drop_me
union all
values(1::numeric)) ua;


QUERY PLAN

 Finalize Aggregate  (cost=821152.71..821152.72 rows=1 width=8)
   ->  Gather  (cost=821152.50..821152.71 rows=2 width=8)
 Workers Planned: 2
 ->  Partial Aggregate  (cost=820152.50..820152.51 rows=1 width=8)
   ->  Parallel Append  (cost=0.00..747235.71 rows=29166714 width=0)
 ->  Result  (cost=0.00..0.01 rows=1 width=0)
 ->  Parallel Seq Scan on drop_me
(cost=0.00..601402.13 rows=29166713 width=0)
(7 rows)


Regards,
Greg Nancarrow
Fujitsu Australia


Re: Online verification of checksums

2020-11-23 Thread Anastasia Lubennikova

On 21.11.2020 04:30, Michael Paquier wrote:

The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers.  This requires an API we don't have
yet.


It seems reasonable to me to rely on checksums only.

As for retry, I think that API for concurrent I/O will be complicated. 
Instead, we can introduce a function to read the page directly from 
shared buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a 
bullet-proof solution to me. Do you see any possible problems with it?


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Why does create_gather_merge_plan need make_sort?

2020-11-23 Thread James Coleman
On Sun, Nov 22, 2020 at 5:07 PM Tomas Vondra
 wrote:
>
> On 11/22/20 10:31 PM, Tom Lane wrote:
> > Tomas Vondra  writes:
> >> On 11/20/20 11:24 PM, James Coleman wrote:
> >>> While looking at another issue I noticed that create_gather_merge_plan
> >>> calls make_sort if the subplan isn't sufficiently sorted. In all of
> >>> the cases I've seen where a gather merge path (not plan) is created
> >>> the input path is expected to be properly sorted, so I was wondering
> >>> if anyone happened to know what case is being handled by the make_sort
> >>> call. Removing it doesn't seem to break any tests.
> >
> >> Yeah, I think you're right this is dead code, essentially. We're only
> >> ever calling create_gather_merge_path() with pathkeys matching the
> >> subpath. And it's like that on REL_12_STABLE too, i.e. before the
> >> incremental sort was introduced.
> >
> > It's probably there by analogy to the other callers of
> > prepare_sort_from_pathkeys, which all do at least a conditional
> > make_sort().  I'd be inclined to leave it there; it's cheap insurance
> > against somebody weakening the existing behavior.
> >
>
> But how do we know it's safe to actually do the sort there, e.g. in
> light of the volatility/parallel-safety issues discussed in other threads?
>
> I agree the check may be useful, but maybe we should just do elog(ERROR)
> instead.

That was my thought. I'm not a big fan of maintaining a "this might be
useful" path particularly when there isn't any case in the code or
tests at all that exercises it. In other words, not only is it not
useful [yet], but also we don't actually have any signal to know that
it works (or keeps working) -- whether through tests or production
use.

So I'm +1 on turning it into an ERROR log instead, since it seems to
me that encountering this case would almost certainly represent a bug
in path generation.

James




Re: Fix generate_useful_gather_paths for parallel unsafe pathkeys

2020-11-23 Thread James Coleman
On Sun, Nov 22, 2020 at 4:59 PM Tomas Vondra
 wrote:
>
> On 11/21/20 2:55 AM, James Coleman wrote:
> > Over on the -bugs list we had a report [1] of a seg fault with
> > incremental sort. The short of the investigation there was that a
> > subplan wasn't being serialized since it wasn't parallel safe, and
> > that attempting to initialize that subplan resulted in the seg fault.
> > Tom pushed a change to raise an ERROR when this occurs so that at
> > least we don't crash the server.
> >
> > There was some small amount of discussion about this on a thread about
> > a somewhat similar issue [2] (where volatile expressions were the
> > issue), but that thread resulted in a committed patch, and beyond that
> > it seems that this issue deserves its own discussion.
> >
> > I've been digging further into the problem, and my first question was
> > "why doesn't this result in a similar error but with a full sort when
> > incremental sort is disabled?" After all, we have code to consider a
> > gather merge + full sort on the cheapest partial path in
> > generate_useful_gather_paths(), and the plan that I get on Luis's test
> > case with incremental sort disabled has an full sort above a gather,
> > which presumably it'd be cheaper to push down into the parallel plan
> > (using gather merge instead of gather).
> >
> > It turns out that there's a separate bug here: I'm guessing that in
> > the original commit of generate_useful_gather_paths() we had a
> > copy/paste thinko in this code:
> >
> > /*
> >  * If the path has no ordering at all, then we can't use either
> >  * incremental sort or rely on implicit sorting with a gather
> >  * merge.
> >  */
> > if (subpath->pathkeys == NIL)
> >  continue;
> >
> > The comment claims that we should skip subpaths that aren't sorted at
> > all since we can't possibly use a gather merge directly or with an
> > incremental sort applied first. It's true that we can't do either of
> > those things unless the subpath is already at least partially sorted.
> > But subsequently we attempt to construct a gather merge path on top of
> > a full sort (for the cheapest path), and there's no reason to limit
> > that to partially sorted paths (indeed in the presence of incremental
> > sort it seems unlikely that that would be the cheapest path).
> >
> > We still don't want to build incremental sort paths if the subpath has
> > no pathkeys, but that will imply presorted_keys = 0, which we already
> > check.
> >
> > So 0001 removes that branch entirely. And, as expected, we now get a
> > gather merge + full sort the resulting error about subplan not being
> > initialized.
> >
>
> Yeah, that's clearly a thinko in generate_useful_gather_paths. Thanks
> for spotting it! The 0001 patch seems fine to me.
>
> > With that oddity out of the way, I started looking at the actually
> > reported problem, and nominally issue is that we have a correlated
> > subquery (in the final target list and the sort clause), and
> > correlated subqueries (not sure why exactly in this case; see [3])
> > aren't considered parallel-safe.
> >
> > As to why that's happening:
> > 1. find_em_expr_usable_for_sorting_rel doesn't exclude that
> > expression, and arguably correctly so in the general case since one
> > might (though we don't now) use this same kind of logic for
> > non-parallel plans.
> > 2. generate_useful_pathkeys_for_relation is noting that it'd be useful
> > to sort on that expression and sees it as safe in part due to the (1).
> > 3. We create a sort node that includes that expression in its output.
> > That seems pretty much the same (in terms of safety given generalized
> > input paths/plans, not in terms of what Robert brought up in [4]) as
> > what would happen in the prepare_sort_from_pathkeys call in
> > create_gather_merge_plan.
> >
> > So to fix this problem I've added the option to find_em_expr_for_rel
> > to restrict to parallel-safe expressions in 0002.
> >
>
> OK, that seems like a valid fix. I wonder if we can have EC with members
> mixing parallel-safe and parallel-unsafe expressions? I guess no, but
> it's more a feeling than a clear reasoning and so I wonder what would
> happen in such cases.

An immediately contrived case would be something like "t.x =
unsafe(t.y)". As to whether that can actually result in us wanting to
sort by the safe member before we can execute the unsafe member (e.g.,
in a filter), I'm less sure, but it seems at least plausible to me (or
I don't see anything inherently preventing it).

> > Given point 3 above (and the discussion with Robert earlier) above it
> > seems to me that there's a bigger problem here that just hasn't
> > happened to have been exposed yet: in particular the
> > prepare_sort_from_pathkeys call in create_gather_merge_plan seems
> > suspect to me. But it's also possible other usages of
> > prepare_sort_from_pathkeys could be suspect given other seemingly
> > unrelated changed to path generation, since nothing other than
> > implicit knowledge abo

Re: libpq compression

2020-11-23 Thread Daniil Zakhlystov
** this is a plaintext version of the previous HTML-formatted message **

Hi,

I’ve run a couple of pgbenchmarks using this patch with odyssey connection 
pooler, with client-to-pooler ZSTD compression turned on.

pgbench --builtin tpcb-like -t 75 --jobs=32 --client=1000

CPU utilization chart of the configuration above:
https://storage.yandexcloud.net/usernamedt/odyssey-compression.png

CPU overhead on average was about 10%.

pgbench -i -s 1500

CPU utilization chart of the configuration above:
https://storage.yandexcloud.net/usernamedt/odyssey-compression-i-s.png

As you can see, there was not any noticeable difference in CPU utilization with 
ZSTD compression enabled or disabled.

Regarding replication, I've made a couple of fixes for this patch, you can find 
them in this pull request 
https://github.com/postgrespro/libpq_compression/pull/3

With these fixes applied, I've run some tests using this patch with streaming 
physical replication on some large clusters. Here is the difference of network 
usage on the replica with ZSTD replication compression enabled compared to the 
replica without replication compression:

- on pgbench -i -s 1500 there was ~23x less network usage

- on pgbench -T 300 --jobs=32 --client=640 there was ~4.5x less network usage

- on pg_restore of the ~300 GB database there was ~5x less network usage

To sum up, I think that the current version of the patch (with per-connection 
compression) is OK from the protocol point of view except for the compression 
initialization part. As discussed, we can either do initialization before the 
startup packet or move the compression to _pq_ parameter to avoid issues on 
older backends.

Regarding switchable on the fly compression, although it introduces more 
flexibility, seems like that it will significantly increase the implementation 
complexity of both the frontend and backend. To support this approach in the 
future, maybe we should add something like the compression mode to protocol and 
name the current approach as “permanent” while reserving the “switchable” 
compression type for future implementation?

Thanks,

Daniil Zakhlystov

06.11.2020, 11:58, "Andrey Borodin" :
>>  6 нояб. 2020 г., в 00:22, Peter Eisentraut 
>> :
>>
>>  On 2020-11-02 20:50, Andres Freund wrote:
>>>  On 2020-10-31 22:25:36 +0500, Andrey Borodin wrote:
  But the price of compression is 1 cpu for 500MB/s (zstd). With a
  20Gbps network adapters cost of recompressing all traffic is at most
  ~4 cores.
>>>  It's not quite that simple, because presumably each connection is going
>>>  to be handled by one core at a time in the pooler. So it's easy to slow
>>>  down peak throughput if you also have to deal with TLS etc.
>>
>>  Also, current deployments of connection poolers use rather small machine 
>> sizes. Telling users you need 4 more cores per instance now to decompress 
>> and recompress all the traffic doesn't seem very attractive. Also, it's not 
>> unheard of to have more than one layer of connection pooling. With that, 
>> this whole design sounds a bit like a heat-generation system. ;-)
>
> User should ensure good bandwidth between pooler and DB. At least they must 
> be within one availability zone. This makes compression between pooler and DB 
> unnecessary.
> Cross-datacenter traffic is many times more expensive.
>
> I agree that switching between compression levels (including turning it off) 
> seems like nice feature. But
> 1. Scope of its usefulness is an order of magnitude smaller than compression 
> of the whole connection.
> 2. Protocol for this feature is significantly more complicated.
> 3. Restarted compression is much less efficient and effective.
>
> Can we design a protocol so that this feature may be implemented in future, 
> currently focusing on getting things compressed? Are there any drawbacks in 
> this approach?
>
> Best regards, Andrey Borodin.




Re: abstract Unix-domain sockets

2020-11-23 Thread Peter Eisentraut

On 2020-11-20 18:23, David G. Johnston wrote:
If there is dead code there is an underlying problem to 
address/discover, not just removing the dead code.  In this case are we 
saying that a new server won’t ever fail to start because the socket 
file exists but instead will just clobber the file with its own?  


Yes.  (In practice, there will be an error with respect to the lock file 
before you even get to that question, but that is different code elsewhere.)


Because given that error, and a server process that failed to clean up 
after itself, the correction to take would indeed seem to be to manually 
remove the file as the hint says.  IOW, fix the code, not the message?


I don't understand that.

--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/




Re: [PoC] Non-volatile WAL buffer

2020-11-23 Thread Tomas Vondra
Hi,

On 11/23/20 3:01 AM, Tomas Vondra wrote:
> Hi,
> 
> On 10/30/20 6:57 AM, Takashi Menjo wrote:
>> Hi Heikki,
>>
>>> I had a new look at this thread today, trying to figure out where 
>>> we are.
>>
>> I'm a bit confused.
>>>
>>> One thing we have established: mmap()ing WAL files performs worse 
>>> than the current method, if pg_wal is not on a persistent memory 
>>> device. This is because the kernel faults in existing content of 
>>> each page, even though we're overwriting everything.
>>
>> Yes. In addition, after a certain page (in the sense of OS page) is 
>> msync()ed, another page fault will occur again when something is 
>> stored into that page.
>>
>>> That's unfortunate. I was hoping that mmap() would be a good option
>>> even without persistent memory hardware. I wish we could tell the
>>> kernel to zero the pages instead of reading them from the file.
>>> Maybe clear the file with ftruncate() before mmapping it?
>>
>> The area extended by ftruncate() appears as if it were zero-filled 
>> [1]. Please note that it merely "appears as if." It might not be 
>> actually zero-filled as data blocks on devices, so pre-allocating 
>> files should improve transaction performance. At least, on Linux 5.7
>>  and ext4, it takes more time to store into the mapped file just 
>> open(O_CREAT)ed and ftruncate()d than into the one filled already and
>> actually.
>>
> 
> Does is really matter that it only appears zero-filled? I think Heikki's
> point was that maybe ftruncate() would prevent the kernel from faulting
> the existing page content when we're overwriting it.
> 
> Not sure I understand what the benchmark with ext4 was doing, exactly.
> How was that measured? Might be interesting to have some simple
> benchmarking tool to demonstrate this (I believe a small standalone tool
> written in C should do the trick).
> 

One more thought about this - if ftruncate() is not enough to convince
the mmap() to not load existing data from the file, what about not
reusing the WAL segments at all? I haven't tried, though.

>>> That should not be problem with a real persistent memory device, 
>>> however (or when emulating it with DRAM). With DAX, the storage is 
>>> memory-mapped directly and there is no page cache, and no 
>>> pre-faulting.
>>
>> Yes, with filesystem DAX, there is no page cache for file data. A 
>> page fault still occurs but for each 2MiB DAX hugepage, so its 
>> overhead decreases compared with 4KiB page fault. Such a DAX
>> hugepage fault is only applied to DAX-mapped files and is different
>> from a general transparent hugepage fault.
>>
> 
> I don't follow - if there are page faults even when overwriting all the
> data, I'd say it's still an issue even with 2MB DAX pages. How big is
> the difference between 4kB and 2MB pages?
> 
> Not sure I understand how is this different from general THP fault?
> 
>>> Because of that, I'm baffled by what the 
>>> v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it 
>>> correctly, it puts the WAL buffers in a separate file, which is 
>>> stored on the NVRAM. Why? I realize that this is just a Proof of 
>>> Concept, but I'm very much not interested in anything that requires
>>> the DBA to manage a second WAL location. Did you test the mmap()
>>> patches with persistent memory hardware? Did you compare that with
>>> the pmem patchset, on the same hardware? If there's a meaningful
>>> performance difference between the two, what's causing it?
> 
>> Yes, this patchset puts the WAL buffers into the file specified by 
>> "nvwal_path" in postgresql.conf.
>>
>> Why this patchset puts the buffers into the separated file, not 
>> existing segment files in PGDATA/pg_wal, is because it reduces the 
>> overhead due to system calls such as open(), mmap(), munmap(), and 
>> close(). It open()s and mmap()s the file "nvwal_path" once, and keeps
>> that file mapped while running. On the other hand, as for the 
>> patchset mmap()ing the segment files, a backend process should 
>> munmap() and close() the current mapped file and open() and mmap() 
>> the new one for each time the inserting location for that process 
>> goes over segments. This causes the performance difference between 
>> the two.
>>
> 
> I kinda agree with Heikki here - having to manage yet another location
> for WAL data is rather inconvenient. We should aim not to make the life
> of DBAs unnecessarily difficult, IMO.
> 
> I wonder how significant the syscall overhead is - can you show share
> some numbers? I don't see any such results in this thread, so I'm not
> sure if it means losing 1% or 10% throughput.
> 
> Also, maybe there are alternative ways to reduce the overhead? For
> example, we can increase the size of the WAL segment, and with 1GB
> segments we'd do 1/64 of syscalls. Or maybe we could do some of this
> asynchronously - request a segment ahead, and let another process do the
> actual work etc. so that the running process does not wait.
> 
> 
> Do I understand correctly that the patch 

Re: cutting down the TODO list thread

2020-11-23 Thread John Naylor
With the exception of "Fix /contrib/btree_gist's implementation of inet
indexing", all items above have been either moved over, or removed if they
were done already by PG13.

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [bug+patch] Inserting DEFAULT into generated columns from VALUES RTE

2020-11-23 Thread Dean Rasheed
On Sun, 22 Nov 2020 at 20:58, Tom Lane  wrote:
>
> I found only one nitpicky bug: in
> findDefaultOnlyColumns, the test must be bms_is_empty(default_only_cols)
> not just default_only_cols == NULL, or it will fail to fall out early
> as intended when the first row contains some DEFAULTs but later rows
> don't.

Ah, good point. Thanks for fixing that.

> > I haven't touched the existing error messages. I think that's a
> > subject for a separate patch.
>
> Fair.  After looking around a bit, I think that getting an error
> cursor as I'd speculated about is more trouble than it's worth.
> For starters, we'd have to pass down the query string into this
> code, and there might be some ticklish issues about whether a given
> chunk of parsetree came from that string or from some rule or view.
> However, I think that just adjusting the error string would be
> helpful, as attached.

+1

> (I'm also wondering why the second case is generic ERRCODE_SYNTAX_ERROR
> and not ERRCODE_GENERATED_ALWAYS.  Didn't change it here, though.)

I can't see any reason for it to be different, and
ERRCODE_GENERATED_ALWAYS seems like the right code to use for both
cases.

Regards,
Dean




Re: pg_proc.dat "proargmodes is not a 1-D char array"

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-17, Tom Lane wrote:

> Robert Haas  writes:
> > On Tue, Nov 17, 2020 at 10:32 AM Tom Lane  wrote:
> >> Adding the expected length to the error message might be OK though.
> 
> > Certainly seems like we should do at least that much. The current
> > message is just wrong, right?
> 
> It's incomplete, for sure.  Doesn't mention nulls either.

So let's go with this one.
diff --git a/src/backend/utils/fmgr/funcapi.c b/src/backend/utils/fmgr/funcapi.c
index b9efa77291..9696c88f24 100644
--- a/src/backend/utils/fmgr/funcapi.c
+++ b/src/backend/utils/fmgr/funcapi.c
@@ -1123,7 +1123,7 @@ get_func_arg_info(HeapTuple procTup,
 			numargs < 0 ||
 			ARR_HASNULL(arr) ||
 			ARR_ELEMTYPE(arr) != OIDOID)
-			elog(ERROR, "proallargtypes is not a 1-D Oid array");
+			elog(ERROR, "proallargtypes is not a 1-D Oid array or it contains nulls");
 		Assert(numargs >= procStruct->pronargs);
 		*p_argtypes = (Oid *) palloc(numargs * sizeof(Oid));
 		memcpy(*p_argtypes, ARR_DATA_PTR(arr),
@@ -1170,7 +1170,8 @@ get_func_arg_info(HeapTuple procTup,
 			ARR_DIMS(arr)[0] != numargs ||
 			ARR_HASNULL(arr) ||
 			ARR_ELEMTYPE(arr) != CHAROID)
-			elog(ERROR, "proargmodes is not a 1-D char array");
+			elog(ERROR, "proargmodes is not a 1-D char array of length %d or it contains nulls",
+ numargs);
 		*p_argmodes = (char *) palloc(numargs * sizeof(char));
 		memcpy(*p_argmodes, ARR_DATA_PTR(arr),
 			   numargs * sizeof(char));
@@ -1210,7 +1211,7 @@ get_func_trftypes(HeapTuple procTup,
 			nelems < 0 ||
 			ARR_HASNULL(arr) ||
 			ARR_ELEMTYPE(arr) != OIDOID)
-			elog(ERROR, "protrftypes is not a 1-D Oid array");
+			elog(ERROR, "protrftypes is not a 1-D Oid array or it contains nulls");
 		Assert(nelems >= ((Form_pg_proc) GETSTRUCT(procTup))->pronargs);
 		*p_trftypes = (Oid *) palloc(nelems * sizeof(Oid));
 		memcpy(*p_trftypes, ARR_DATA_PTR(arr),
@@ -1261,7 +1262,7 @@ get_func_input_arg_names(char prokind,
 	if (ARR_NDIM(arr) != 1 ||
 		ARR_HASNULL(arr) ||
 		ARR_ELEMTYPE(arr) != TEXTOID)
-		elog(ERROR, "proargnames is not a 1-D text array");
+		elog(ERROR, "proargnames is not a 1-D text array or it contains nulls");
 	deconstruct_array(arr, TEXTOID, -1, false, TYPALIGN_INT,
 	  &argnames, NULL, &numargs);
 	if (proargmodes != PointerGetDatum(NULL))
@@ -1271,7 +1272,8 @@ get_func_input_arg_names(char prokind,
 			ARR_DIMS(arr)[0] != numargs ||
 			ARR_HASNULL(arr) ||
 			ARR_ELEMTYPE(arr) != CHAROID)
-			elog(ERROR, "proargmodes is not a 1-D char array");
+			elog(ERROR, "proargmodes is not a 1-D char array of length %d or it contains nulls",
+ numargs);
 		argmodes = (char *) ARR_DATA_PTR(arr);
 	}
 	else
@@ -1368,14 +1370,15 @@ get_func_result_name(Oid functionId)
 			numargs < 0 ||
 			ARR_HASNULL(arr) ||
 			ARR_ELEMTYPE(arr) != CHAROID)
-			elog(ERROR, "proargmodes is not a 1-D char array");
+			elog(ERROR, "proargmodes is not a 1-D char array or it contains nulls");
 		argmodes = (char *) ARR_DATA_PTR(arr);
 		arr = DatumGetArrayTypeP(proargnames);	/* ensure not toasted */
 		if (ARR_NDIM(arr) != 1 ||
 			ARR_DIMS(arr)[0] != numargs ||
 			ARR_HASNULL(arr) ||
 			ARR_ELEMTYPE(arr) != TEXTOID)
-			elog(ERROR, "proargnames is not a 1-D text array");
+			elog(ERROR, "proargnames is not a 1-D text array of length %d or it contains nulls",
+ numargs);
 		deconstruct_array(arr, TEXTOID, -1, false, TYPALIGN_INT,
 		  &argnames, NULL, &nargnames);
 		Assert(nargnames == numargs);
@@ -1506,14 +1509,15 @@ build_function_result_tupdesc_d(char prokind,
 		numargs < 0 ||
 		ARR_HASNULL(arr) ||
 		ARR_ELEMTYPE(arr) != OIDOID)
-		elog(ERROR, "proallargtypes is not a 1-D Oid array");
+		elog(ERROR, "proallargtypes is not a 1-D Oid array or it contains nulls");
 	argtypes = (Oid *) ARR_DATA_PTR(arr);
 	arr = DatumGetArrayTypeP(proargmodes);	/* ensure not toasted */
 	if (ARR_NDIM(arr) != 1 ||
 		ARR_DIMS(arr)[0] != numargs ||
 		ARR_HASNULL(arr) ||
 		ARR_ELEMTYPE(arr) != CHAROID)
-		elog(ERROR, "proargmodes is not a 1-D char array");
+		elog(ERROR, "proargmodes is not a 1-D char array of length %d or it contains nulls",
+			 numargs);
 	argmodes = (char *) ARR_DATA_PTR(arr);
 	if (proargnames != PointerGetDatum(NULL))
 	{
@@ -1522,7 +1526,8 @@ build_function_result_tupdesc_d(char prokind,
 			ARR_DIMS(arr)[0] != numargs ||
 			ARR_HASNULL(arr) ||
 			ARR_ELEMTYPE(arr) != TEXTOID)
-			elog(ERROR, "proargnames is not a 1-D text array");
+			elog(ERROR, "proargnames is not a 1-D text array of length %d or it contains nulls",
+ numargs);
 		deconstruct_array(arr, TEXTOID, -1, false, TYPALIGN_INT,
 		  &argnames, NULL, &nargnames);
 		Assert(nargnames == numargs);


Re: pg_proc.dat "proargmodes is not a 1-D char array"

2020-11-23 Thread Tom Lane
Alvaro Herrera  writes:
> So let's go with this one.

WFM.

regards, tom lane




Re: Online verification of checksums

2020-11-23 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Fri, Nov 20, 2020 at 11:08:27AM -0500, Stephen Frost wrote:
> > David Steele (da...@pgmasters.net) wrote:
> >> Our current plan for pgBackRest:
> >> 
> >> 1) Remove the LSN check as you have done in your patch and when rechecking
> >> see if the page has become valid *or* the LSN is ascending.
> >> 2) Check the LSN against the max LSN reported by PostgreSQL to make sure it
> >> is valid.
> > 
> > Yup, that's my recollection also as to our plans for how to improve
> > things here.
> >
> >> These do completely rule out any type of corruption, but they certainly
> >> narrows the possibility by a lot.
> > 
> > *don't :)
> 
> Have you considered the possibility of only using pd_checksums for the
> validation?  This is the only source of truth in the page header we
> can rely on to validate the full contents of the page, so if the logic
> relies on anything but the checksum then you expose the logic to risks
> of reporting pages as corrupted while they were just torn, or just
> miss corrupted pages, which is what we should avoid for such things.
> Both are bad.

There's no doubt that you'll get checksum failures from time to time,
and that it's an entirely valid case if the page is being concurrently
written, so we have to decide if we should be reporting those failures,
retrying, or what.

It's not at all clear what you're suggesting here as to how you can use
'only' the checksum.

> >> As for your patch, it mostly looks good but my objection is that a page may
> >> be reported as invalid after 5 retries when in fact it may just be very 
> >> hot.
> > 
> > Yeah.. while unlikely that it'd actually get written out that much, it
> > does seem at least possible.
> > 
> >> Maybe checking for an ascending LSN is a good idea there as well? At least
> >> in that case we could issue a different warning, instead of "checksum
> >> verification failed" perhaps "checksum verification skipped due to
> >> concurrent modifications".
> > 
> > +1.
> 
> I don't quite understand how you can make sure that the page is not
> corrupted here?  It could be possible that the last 4kB of a 8kB page
> got corrupted, where the header had valid data but failing the
> checksum verification.

Not sure that the proposed approach was really understood here.
Specifically what we're talking about is:

- read(), save the LSN seen
- calculate checksum- get a failure
- re-read(), compare LSN to prior LSN, maybe also re-check checksum

If checksum fails again AND the LSN has changed and increased (and
perhaps otherwise seems reasonable) then we have at least a bit more
confidence that the failing checksum is due to the page being rewritten
concurrently and not due to latest storage corruption, which is the
specific distinction that we're trying to discern here.

> So if you are not careful you could have at
> hand a corrupted page discarded because of it failed the retry
> multiple times in a row.

The point of checking for an ascending LSN is to see if the page is
being concurrently modified.  If it is, then we actually don't care if
the page is corrupted because it's going to be rewritten during WAL
replay as part of the restore process.

> The only method I can think as being really
> reliable is based on two facts:
> - Do a check only on pd_checksums, as that validates the full contents
> of the page.
> - When doing a retry, make sure that there is no concurrent I/O
> activity in the shared buffers.  This requires an API we don't have
> yet.

I don't think we actually want the backup process to start locking
pages, which it seems like is what you're suggesting here..?  Trying to
do a check without a lock and without having PG end up reading the page
back in if it had been evicted due to pressure seems likely to be hard
to do reliably and without race conditions complicating things.

The other 100% reliable approach, as David discussed before, is to be
scanning the WAL at the same time and to ignore any checksum failures
for pages that we know are in the WAL with FPIs.  Unfortunately, reading
WAL for all different versions of PG is a fair bit of work and we
haven't quite gotten to biting that off yet (though it's on the
roadmap), and the core code certainly doesn't help us in that regard
since any given version only supports the current major version WAL (an
issue pg_basebackup would also have to deal with it, were it to be
modified to use such an approach and to continue working with older
versions of PG..).  In a similar vein to what we do (in pgbackrest) with
pg_control, we expect to develop our own library basically vendorizing
WAL reading code from all the major versions of PG which we support in
order to track FPIs, restore points, all the kinds of potential recovery
targets, and other useful information.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-18, Andres Freund wrote:

> In 13 this is:
>   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>   MyPgXact->vacuumFlags |= PROC_IN_VACUUM;
>   if (params->is_wraparound)
>   MyPgXact->vacuumFlags |= PROC_VACUUM_FOR_WRAPAROUND;
>   LWLockRelease(ProcArrayLock);
> 
> Lowering this to a shared lock doesn't seem right, at least without a
> detailed comment explaining why it's safe. Because GetSnapshotData() etc
> look at all procs with just an LW_SHARED ProcArrayLock, changing
> vacuumFlags without a lock means that two concurrent horizon
> computations could come to a different result.
> 
> I'm not saying it's definitely wrong to relax things here, but I'm not
> sure we've evaluated it sufficiently.

True.  Let's evaluate it.

I changed three spots where statusFlags bits are written:

a) StartupDecodingContext: sets PROC_IN_LOGICAL_DECODING
b) ReplicationSlotRelease: removes PROC_IN_LOGICAL_DECODING
c) vacuum_rel: sets PROC_IN_VACUUM and potentially
   PROC_VACUUM_FOR_WRAPAROUND

Who reads these flags?

PROC_IN_LOGICAL_DECODING is read by:
 * ComputeXidHorizons
 * GetSnapshotData

PROC_IN_VACUUM is read by:
 * GetCurrentVirtualXIDs
 * ComputeXidHorizons
 * GetSnapshotData
 * ProcArrayInstallImportedXmin

PROC_VACUUM_FOR_WRAPAROUND is read by:
 * ProcSleep


ProcSleep:  no bug here.
The flags are checked to see if we should kill() the process (used when
autovac blocks some other process).  The lock here appears to be
inconsequential, since we release it before we do kill(); so strictly
speaking, there is still a race condition where the autovac process
could reset the flag after we read it and before we get a chance to
signal it.  The lock level autovac uses to set the flag is not relevant
either.

ProcArrayInstallImportedXmin:
This one is just searching for a matching backend; not affected by the
flags.

PROC_IN_LOGICAL_DECODING:
Oddly enough, I think the reset of PROC_IN_LOGICAL_DECODING in
ReplicationSlotRelease might be the most problematic one of the lot.
That's because a proc's xmin that had been ignored all along by
ComputeXidHorizons, will now be included in the computation.  Adding
asserts that proc->xmin and proc->xid are InvalidXid by the time we
reset the flag, I got hits in pg_basebackup, test_decoding and
subscription tests.  I think it's OK for ComputeXidHorizons (since it
just means that a vacuum that reads a later will remove less rows.)  But
in GetSnapshotData it is just not correct to have the Xmin go backwards.

Therefore it seems to me that this code has a bug independently of the
lock level used.


GetCurrentVirtualXIDs, ComputeXidHorizons, GetSnapshotData:

In these cases, what we need is that the code computes some xmin (or
equivalent computation) based on a set of transactions that exclude
those marked with the flags.  The behavior we want is that if some
transaction is marked as vacuum, we ignore the Xid/Xmin *if there is
one*.  In other words, if there's no Xid/Xmin, then the flag is not
important.  So if we can ensure that the flag is set first, and the
Xid/xmin is installed later, that's sufficient correctness and we don't
need to hold exclusive lock.  But if we can't ensure that, then we must
use exclusive lock, because otherwise we risk another process seeing our
Xid first and not our flag, which would be bad.

In other words, my conclusion is that there definitely is a bug here and
I am going to restore the use of exclusive lock for setting the
statusFlags.


GetSnapshotData has an additional difficulty -- we do the
UINT32_ACCESS_ONCE(ProcGlobal->xid[i]) read *before* testing the bit. 
So it's not valid to set the bit without locking out GSD, regardless of
any barriers we had; if we want this to be safe, we'd have to change
this so that the flag is read first, and we read the xid only
afterwards, with a read barrier.

I *think* we could relax the lock if we had a write barrier in
between: set the flag first, issue a write barrier, set the Xid.
(I have to admit I'm confused about what needs to happen in the read
case: read the bit first, potentially skip the PGPROC entry; but can we
read the Xid ahead of reading the flag, and if we do read the xid
afterwards, do we need a read barrier in between?)
Given this uncertainty, I'm not proposing to relax the lock from
exclusive to shared.


I didn't get around to reading ComputeXidHorizons, but it seems like
it'd have the same problem as GSD.




Re: Parallel plans and "union all" subquery

2020-11-23 Thread Luc Vlaming

On 23-11-2020 13:17, Phil Florent wrote:

Hi Greg,

The implicit conversion was the cause of the non parallel plan, thanks 
for the explanation and the workarounds. It can cause a huge difference 
in terms of performance, I will give the information to our developers.


Regards,

Phil




*De :* Greg Nancarrow 
*Envoyé :* lundi 23 novembre 2020 06:04
*À :* Phil Florent 
*Cc :* pgsql-hackers@lists.postgresql.org 


*Objet :* Re: Parallel plans and "union all" subquery
On Sun, Nov 22, 2020 at 11:51 PM Phil Florent  
wrote:



Hi,


I have a question about parallel plans. I also posted it on the general list 
but perhaps it's a question for hackers. Here is my test case :


explain
select count(*)
from (select
n1
from drop_me
union all
values(1)) ua;


QUERY PLAN

Aggregate (cost=2934739.24..2934739.25 rows=1 width=8)
-> Append (cost=0.00..2059737.83 rows=7113 width=32)
-> Seq Scan on drop_me (cost=0.00..1009736.12 rows=7112 width=6)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..0.02 rows=1 width=32)
-> Result (cost=0.00..0.01 rows=1 width=4)
JIT:
Functions: 4
Options: Inlining true, Optimization true, Expressions true, Deforming true


No parallel plan, 2s6


I read the documentation but I don't get the reason of the "noparallel" seq 
scan of drop_me in the last case ?



Without debugging this, it looks to me that the UNION type resolution
isn't working as well as it possibly could in this case, for the
generation of a parallel plan. I found that with a minor tweak to your
SQL, either for the table creation or query, it will produce a
parallel plan.

Noting that currently you're creating the drop_me table with a
"numeric" column, you can either:

(1) Change the table creation

FROM:
     create unlogged table drop_me as select generate_series(1,7e7) n1;
TO:
     create unlogged table drop_me as select generate_series(1,7e7)::int n1;


OR


(2) Change the query

FROM:
     explain
     select count(*)
     from (select
     n1
     from drop_me
     union all
     values(1)) ua;

TO:

     explain
     select count(*)
     from (select
     n1
     from drop_me
     union all
     values(1::numeric)) ua;


     QUERY PLAN

  Finalize Aggregate  (cost=821152.71..821152.72 rows=1 width=8)
    ->  Gather  (cost=821152.50..821152.71 rows=2 width=8)
  Workers Planned: 2
  ->  Partial Aggregate  (cost=820152.50..820152.51 rows=1 width=8)
    ->  Parallel Append  (cost=0.00..747235.71 rows=29166714 
width=0)

  ->  Result  (cost=0.00..0.01 rows=1 width=0)
  ->  Parallel Seq Scan on drop_me
(cost=0.00..601402.13 rows=29166713 width=0)
(7 rows)


Regards,
Greg Nancarrow
Fujitsu Australia


Hi,

For this problem there is a patch I created, which is registered under 
https://commitfest.postgresql.org/30/2787/ that should fix this without 
any workarounds. Maybe someone can take a look at it?


Regards,
Luc
Swarm64




Re: Online verification of checksums

2020-11-23 Thread Stephen Frost
Greetings,

* Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
> On 21.11.2020 04:30, Michael Paquier wrote:
> >The only method I can think as being really
> >reliable is based on two facts:
> >- Do a check only on pd_checksums, as that validates the full contents
> >of the page.
> >- When doing a retry, make sure that there is no concurrent I/O
> >activity in the shared buffers.  This requires an API we don't have
> >yet.
> 
> It seems reasonable to me to rely on checksums only.
> 
> As for retry, I think that API for concurrent I/O will be complicated.
> Instead, we can introduce a function to read the page directly from shared
> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
> solution to me. Do you see any possible problems with it?

We might end up reading pages back in that have been evicted, for one
thing, which doesn't seem great, and this also seems likely to be
awkward for cases which aren't using the replication protocol, unless
every process maintains a connection to PG the entire time, which also
doesn't seem great.

Also- what is the point of reading the page from shared buffers
anyway..?  All we need to do is prove that the page will be rewritten
during WAL replay.  If we can prove that, we don't actually care what
the contents of the page are.  We certainly can't calculate the
checksum on a page we plucked out of shared buffers since we only
calculate the checksum when we go to write the page out.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: cutting down the TODO list thread

2020-11-23 Thread Bruce Momjian
On Mon, Nov 23, 2020 at 10:41:25AM -0400, John Naylor wrote:
> With the exception of "Fix /contrib/btree_gist's implementation of inet
> indexing", all items above have been either moved over, or removed if they 
> were
> done already by PG13.

Thanks.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-23 Thread Tom Lane
Alvaro Herrera  writes:
> PROC_IN_LOGICAL_DECODING:
> Oddly enough, I think the reset of PROC_IN_LOGICAL_DECODING in
> ReplicationSlotRelease might be the most problematic one of the lot.
> That's because a proc's xmin that had been ignored all along by
> ComputeXidHorizons, will now be included in the computation.  Adding
> asserts that proc->xmin and proc->xid are InvalidXid by the time we
> reset the flag, I got hits in pg_basebackup, test_decoding and
> subscription tests.  I think it's OK for ComputeXidHorizons (since it
> just means that a vacuum that reads a later will remove less rows.)  But
> in GetSnapshotData it is just not correct to have the Xmin go backwards.

> Therefore it seems to me that this code has a bug independently of the
> lock level used.

That is only a bug if the flags are *cleared* in a way that's not
atomic with clearing the transaction's xid/xmin, no?  I agree that
once set, the flag had better stay set till transaction end, but
that's not what's at stake here.

> GetCurrentVirtualXIDs, ComputeXidHorizons, GetSnapshotData:

> In these cases, what we need is that the code computes some xmin (or
> equivalent computation) based on a set of transactions that exclude
> those marked with the flags.  The behavior we want is that if some
> transaction is marked as vacuum, we ignore the Xid/Xmin *if there is
> one*.  In other words, if there's no Xid/Xmin, then the flag is not
> important.  So if we can ensure that the flag is set first, and the
> Xid/xmin is installed later, that's sufficient correctness and we don't
> need to hold exclusive lock.  But if we can't ensure that, then we must
> use exclusive lock, because otherwise we risk another process seeing our
> Xid first and not our flag, which would be bad.

I don't buy this either.  You get the same result if someone looks just
before you take the ProcArrayLock to set the flag.  So if there's a
problem, it's inherent in the way that the flags are defined or used;
the strength of lock used in this stanza won't affect it.

regards, tom lane




Re: abstract Unix-domain sockets

2020-11-23 Thread David G. Johnston
On Mon, Nov 23, 2020 at 6:50 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2020-11-20 18:23, David G. Johnston wrote:
> > If there is dead code there is an underlying problem to
> > address/discover, not just removing the dead code.  In this case are we
> > saying that a new server won’t ever fail to start because the socket
> > file exists but instead will just clobber the file with its own?
>
> Yes.  (In practice, there will be an error with respect to the lock file
> before you even get to that question, but that is different code
> elsewhere.)
>
> > Because given that error, and a server process that failed to clean up
> > after itself, the correction to take would indeed seem to be to manually
> > remove the file as the hint says.  IOW, fix the code, not the message?
>
> I don't understand that.
>
>
So presently there is no functioning code to prevent two PostgreSQL
instances from using the same socket so long as they do not also use the
same data directory?  We only handle the case of an unclean crash - where
the pid and socket are both left behind - having the system tell the user
to remove the pid lock file but then auto-replacing the socket (I was
conflating the behavior with the pid lock file and the socket file).

I would expect that we handle port misconfiguration also, by not
auto-replacing the socket and instead have the existing error message (with
modified hint) remain behind.  This provides behavior consistent with TCP
port binding.  Or is it the case that we always attempt to bind the TCP/IP
port, regardless of the presence of a socket file, in which case the
failure for port binding does cover the socket situation as well?  If this
is the case, pointing that out in [1] and a code comment, while removing
that particular error as "dead code", would work.

David J.

[1]
https://www.postgresql.org/docs/13/server-start.html#SERVER-START-FAILURES


Re: [bug+patch] Inserting DEFAULT into generated columns from VALUES RTE

2020-11-23 Thread Tom Lane
Dean Rasheed  writes:
> On Sun, 22 Nov 2020 at 20:58, Tom Lane  wrote:
>> However, I think that just adjusting the error string would be
>> helpful, as attached.

> +1

>> (I'm also wondering why the second case is generic ERRCODE_SYNTAX_ERROR
>> and not ERRCODE_GENERATED_ALWAYS.  Didn't change it here, though.)

> I can't see any reason for it to be different, and
> ERRCODE_GENERATED_ALWAYS seems like the right code to use for both
> cases.

Sounds good to me; pushed that way.

regards, tom lane




Re: pg_proc.dat "proargmodes is not a 1-D char array"

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-23, Tom Lane wrote:

> Alvaro Herrera  writes:
> > So let's go with this one.
> 
> WFM.

Thanks, pushed.




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-11-23 Thread Alexey Kondratov

Hi,

On 2020-11-23 09:48, Bharath Rupireddy wrote:


Here is how I'm making 4 separate patches:

1. new function and it's documentation.
2. GUC and it's documentation.
3. server level option and it's documentation.
4. test cases for all of the above patches.



Hi, I'm attaching the patches here. Note that, though the code changes
for this feature are small, I divided them up as separate patches to
make review easy.

v1-0001-postgres_fdw-function-to-discard-cached-connections.patch



This patch looks pretty straightforward for me, but there are some 
things to be addressed IMO:


+   server = GetForeignServerByName(servername, true);
+
+   if (server != NULL)
+   {

Yes, you return a false if no server was found, but for me it worth 
throwing an error in this case as, for example, dblink does in the 
dblink_disconnect().


+ result = disconnect_cached_connections(FOREIGNSERVEROID,
+hashvalue,
+false);

+   if (all || (!all && cacheid == FOREIGNSERVEROID &&
+   entry->server_hashvalue == hashvalue))
+   {
+   if (entry->conn != NULL &&
+   !all && cacheid == FOREIGNSERVEROID &&
+   entry->server_hashvalue == hashvalue)

These conditions look bulky for me. First, you pass FOREIGNSERVEROID to 
disconnect_cached_connections(), but actually it just duplicates 'all' 
flag, since when it is 'FOREIGNSERVEROID', then 'all == false'; when it 
is '-1', then 'all == true'. That is all, there are only two calls of 
disconnect_cached_connections(). That way, it seems that we should keep 
only 'all' flag at least for now, doesn't it?


Second, I think that we should just rewrite this if statement in order 
to simplify it and make more readable, e.g.:


if ((all || entry->server_hashvalue == hashvalue) &&
entry->conn != NULL)
{
disconnect_pg_server(entry);
result = true;
}

+   if (all)
+   {
+   hash_destroy(ConnectionHash);
+   ConnectionHash = NULL;
+   result = true;
+   }

Also, I am still not sure that it is a good idea to destroy the whole 
cache even in 'all' case, but maybe others will have a different 
opinion.




v1-0002-postgres_fdw-add-keep_connections-GUC-to-not-cache-connections.patch



+   entry->changing_xact_state) ||
+   (entry->used_in_current_xact &&
+   !keep_connections))

I am not sure, but I think, that instead of adding this additional flag 
into ConnCacheEntry structure we can look on entry->xact_depth and use 
local:


bool used_in_current_xact = entry->xact_depth > 0;

for exactly the same purpose. Since we set entry->xact_depth to zero at 
the end of xact, then it was used if it is not zero. It is set to 1 by 
begin_remote_xact() called by GetConnection(), so everything seems to be 
fine.


Otherwise, both patches seem to be working as expected. I am going to 
have a look on the last two patches a bit later.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: pg_upgrade fails with non-standard ACL

2020-11-23 Thread Grigory Smolkin
Tested this patch by running several upgrades from different major 
versions and different setups.
ACL, that are impossible to apply, are detected and reported, so it 
looks good for me.






Re: Bogus documentation for bogus geometric operators

2020-11-23 Thread Tom Lane
Pavel Borisov  writes:
> I've made another check of the final state and suppose it is ready to be 
> pushed.

Sounds good, pushed.

regards, tom lane




Re: Online checksums patch - once again

2020-11-23 Thread Heikki Linnakangas

On 17/11/2020 10:56, Daniel Gustafsson wrote:

On 5 Oct 2020, at 13:36, Heikki Linnakangas  wrote:
2. The signaling between enable_data_checksums() and the launcher process looks 
funny to me. The general idea seems to be that enable_data_checksums() just 
starts the launcher process, and the launcher process figures out what it need 
to do and makes all the changes to the global state. But then there's this 
violation of the idea: enable_data_checksums() checks 
DataChecksumsOnInProgress(), and tells the launcher process whether it should 
continue a previously crashed operation or start from scratch. I think it would 
be much cleaner if the launcher process figured that out itself, and 
enable_data_checksums() would just tell the launcher what the target state is.

enable_data_checksums() and disable_data_checksums() seem prone to race 
conditions. If you call enable_data_checksums() in two backends concurrently, 
depending on the timing, there are two possible outcomes:

a) One call returns true, and launches the background process. The other call 
returns false.

b) Both calls return true, but one of them emits a "NOTICE: data checksums worker is 
already running".

In disable_data_checksum() imagine what happens if another backend calls 
enable_data_checksums() in between the ShutdownDatachecksumsWorkerIfRunning() 
and SetDataChecksumsOff() calls.


I've reworked this in the attached such that the enable_ and disable_ functions
merely call into the launcher with the desired outcome, and the launcher is
responsible for figuring out the rest.  The datachecksumworker is now the sole
place which initiates a state transfer.


Well, you still fill the DatachecksumsWorkerShmem->operations array in 
the backend process that launches the datacheckumworker, not in the 
worker process. I find that still a bit surprising, but I believe it works.



/*
 * Mark the buffer as dirty and force a full page write.  We 
have to
 * re-write the page to WAL even if the checksum hasn't changed,
 * because if there is a replica it might have a slightly 
different
 * version of the page with an invalid checksum, caused by 
unlogged
 * changes (e.g. hintbits) on the master happening while 
checksums
 * were off. This can happen if there was a valid checksum on 
the page
 * at one point in the past, so only when checksums are first 
on, then
 * off, and then turned on again. Iff wal_level is set to 
"minimal",
 * this could be avoided iff the checksum is calculated to be 
correct.
 */
START_CRIT_SECTION();
MarkBufferDirty(buf);
log_newpage_buffer(buf, false);
END_CRIT_SECTION();


It's really unfortunate that we have to dirty the page even if the checksum 
already happens to match. Could we only do the log_newpage_buffer() call and 
skip MarkBufferDirty() in that case?


I think we can, but I've intentionally stayed away from such optimizations
until the basic version of the patch was deemed safe and approaching done.
It's complicated enough as it is IMO.


Fair enough.


The attached fixes the above plus a few other small things found while hacking
on this version.


I haven't read through the whole patch, but a few random comments below, 
in no particular order:


pg_enable/disable_data_checksums() should perform a superuser-check. I 
don't think we want to expose them to any users.



+/*
+ * Local state for Controlfile data_checksum_version. After initialization,
+ * this is only updated when absorbing a procsignal barrier during interrupt
+ * processing. Thus, it can be read by backends without the need for a lock.
+ * Possible values are the checksum versions defined in storage/bufpage.h and
+ * zero for when checksums are disabled.
+ */
+static uint32 LocalDataChecksumVersion = 0;


The comment is a bit confusing: "Thus, it can be read by backends 
without the need for a lock". Since it's a variable in backend-private 
memory, it can only be read by the same backend, not "backends". And 
that's also why you don't need a lock, not because it's updated during 
interrupt processing. I understand how this works, but maybe rephrase 
the comment a bit.



+/*
+ * DataChecksumsOnInProgress
+ * Returns whether data checksums are being enabled
+ *
+ * Most callsites shouldn't need to worry about the "inprogress" states, since
+ * they should check the requirement for verification or writing. Some low-
+ * level callsites dealing with page writes need however to know. It's also
+ * used to check for aborted checksum processing which need to be restarted.
  */
 bool
-DataChecksumsEnabled(void)
+DataChecksumsOnInProgress(void)
+{
+   return (LocalDataChecksumVersion == 
PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION);
+}


s/need/needs/. The whole paragraph feels a bit awkwardly worded in

Re: truncating timestamps on arbitrary intervals

2020-11-23 Thread John Naylor
On Thu, Nov 12, 2020 at 9:56 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:
>
> On 2020-06-30 06:34, John Naylor wrote:
> > In v9, I've simplified the patch somewhat to make it easier for future
> > work to build on.
> >
> > - When truncating on month-or-greater intervals, require the origin to
> > align on month. This removes the need to handle weird corner cases
> > that have no straightforward behavior.
> > - Remove hackish and possibly broken code to allow origin to be after
> > the input timestamp. The default origin is Jan 1, 1 AD, so only AD
> > dates will behave correctly by default. This is not enforced for now,
> > since it may be desirable to find a way to get this to work in a nicer
> > way.
> > - Rebase docs over PG13 formatting changes.
>
> This looks pretty solid now.  Are there any more corner cases and other
> areas with unclear behavior that you are aware of?

Hi Peter,

Thanks for taking a look!

I believe there are no known corner cases aside from not throwing an error
if origin > input, but I'll revisit that when we are more firm on what
features we want support.

> A couple of thoughts:
>
> - After reading the discussion a few times, I'm not so sure anymore
> whether making this a cousin of date_trunc is the right way to go.  As
> you mentioned, there are some behaviors specific to date_trunc that
> don't really make sense in date_trunc_interval, and maybe we'll have
> more of those.

As far as the behaviors, I'm not sure exactly what you what you were
thinking of, but here are two issues off the top of my head:

- If the new functions are considered variants of date_trunc(), there is
the expectation that the options work the same way, in particular the
timezone parameter. You asked specifically about that below, so I'll
address that separately.
- In the "week" case, the boundary position depends on the origin, since a
week-long interval is just 7 days.

> Also, date_trunc_interval isn't exactly a handy name.
> Maybe something to think about.  It's obviously fairly straightforward
> to change it.

Effectively, it puts timestamps into bins, so maybe date_bin() or something
like that?

> - There were various issues with the stride interval having months and
> years.  I'm not sure we even need that.  It could be omitted unless you
> are confident that your implementation is now sufficient.

Months and years were a bit tricky, so I'd be happy to leave that out if
there is not much demand for it. date_trunc() already has quarters,
decades, centuries, and millenia.

> - Also, negative intervals could be prohibited, but I suppose that
> matters less.

Good for the sake of completeness. I think they happen to work in v9 by
accident, but it would be better not to expose that.

> - I'm curious about the origin being set to 0001-01-01.  This seems to
> work correctly in that it sets the origin to a Monday, which is what we
> wanted, but according to Google that day was a Saturday.  Something to
> do with Julian vs. Gregorian calendar?

Right, working backwards from our calendar today, it's Monday, but at the
time it would theoretically be Saturday, barring leap year miscalculations.

> Maybe we should choose a date
> that is a bit more recent and easier to reason with.

2001-01-01 would also be a Monday aligned with centuries and millenia, so
that would be my next suggestion. If we don't care to match with
date_trunc() on those larger units, we could also use 1900-01-01, so the
vast majority of dates in databases are after the origin.

> - Then again, I'm thinking that maybe we should make the origin
> mandatory.  Otherwise, the default answers when having strides larger
> than a day are entirely arbitrary, e.g.,
>
> => select date_trunc_interval('10 year', '0196-05-20 BC'::timestamp);
> 0190-01-01 00:00:00 BC
>
> => select date_trunc_interval('10 year', '0196-05-20 AD'::timestamp);
> 0191-01-01 00:00:00

Right. In the first case, the default origin is also after the input, and
crosses the AD/BC boundary. Tricky to get right.

> Perhaps the origin could be defaulted if the interval is less than a day
> or something like that.

If we didn't allow months and years to be units, it seems the default would
always make sense?

> - I'm wondering whether we need the date_trunc_interval(interval,
> timestamptz, timezone) variant.  Isn't that the same as
> date_trunc_interval(foo AT ZONE xyz, value)?

I based this on 600b04d6b5ef6 for date_trunc(), whose message states:

date_trunc(field, timestamptz, zone_name)

is the same as

date_trunc(field, timestamptz at time zone zone_name) at time zone zone_name

so without the shorthand, you need to specify the timezone twice, once for
the calculation, and once for the output.

--
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Connection using ODBC and SSL

2020-11-23 Thread Corbit, Dann
Thank you for the assistance.


From: Andrew Dunstan 
Sent: Saturday, November 21, 2020 11:14
To: Corbit, Dann ; PostgreSQL Developers 

Cc: Luton, Bill ; Fifer, Brian 
; Lao, Alexander 
Subject: Re: Connection using ODBC and SSL


On 11/20/20 4:54 PM, Corbit, Dann wrote:
>
> I would like to have all my certificates and keys on the same machine
> (localhost for local connections and dcorbit for tcp/ip).
> I found a couple tutorials and tried them but it failed.
> I saw one document that said the common name should be the postgres
> user name and that it should also be the connecting machine name.  Is
> that correct?
> Is there a document or tutorial that explains the correct steps?



I did a webinar about a year ago that went into some detail about what
you need in the CN, where the certificates go, etc.


See

(Yes, this is a corporate webinar, sorry about that)




> Equally important, is there a way to get more complete diagnostics
> when something goes wrong (like WHY did the certificate verify fail)?
>

The diagnostics in the Postgres log are usually fairly explanatory.



cheers


andrew



Re: enable_incremental_sort changes query behavior

2020-11-23 Thread Tom Lane
James Coleman  writes:
> But I think that still leaves something missing: after all,
> prepare_sort_from_pathkeys does know how to insert new target list
> entries, so something either there (or in the caller/how its called)
> has to be enforcing an apparently implicit rule about what point in
> the tree it's safe to do such. Or even just no path generation had
> ever considered it before (that feels unsatisfactory as an explanation
> to me, because it feels more implicit than I'd like, but...)

Hi guys,

I hadn't been paying any attention to this thread, but Robert asked
me to take a look.  A few comments:

1. James was wondering, far upthread, why we would do projections
pre-sort or post-sort.  I think the answers can be found by studying
planner.c's make_sort_input_target(), which separates out what we want
to do pre-sort and post-sort.  (There, the "sort" is envisioned as a
standalone Sort node atop all the scan/join steps; but its decisions
nonetheless constrain what will be considered for sorting that's
implemented further down.)  It has a *very* long introductory comment
laying out all the considerations.

2. Robert is correct that under normal circumstances, the targetlists of
both baserels and join rels contain only Vars.  Any computations we have
to do are postponed to the final top-level targetlist, whether they are
volatile or not.  The fact that this policy is applied regardless of
volatility may explain why James isn't seeing volatility checks where he
expected them.  The core (and, I think, only) exception to this is that
an expression that is a sort or group key has to be evaluated earlier.
But even then, it won't be pushed down as far as the reltargets of any
scan or join relations; it'll be computed after the final join.

3. If we have a volatile sort/group key, that constrains the set of plans
we can validly generate, because the key expression mustn't be evaluated
redundantly.  With the addition of parallelism, a non-parallel-safe
sort/group key expression likewise constrains the set of valid plans,
even if it's not considered volatile.

4. In the past, the only way we'd consider non-Var sort keys in any way
during scan/join planning was if (a) a relation had an expression index
matching a requested sort key; of course, that's a-fortiori going to be
a non-volatile expression.  Or (b) we needed to sort in order to provide
suitable input for a merge join ... but again, volatile expressions
aren't considered candidates for mergejoin.  So there basically wasn't
any need to consider sorting by volatiles below the top level sort/group
processing, and that again contributes to why you don't see explicit
tests in that area.  I'm not sure offhand whether the parallel-query
patches added anything to guard against nonvolatile-but-parallel-unsafe
sort expressions.  If not, there might be live bugs there independently
of incremental sort; or that might be unreachable because of overall
limitations on parallelism.

5. While I've not dug far enough into the code to identify the exact
issue, the impression I have about this bug and the parallel-unsafe
subplan bug is that the incremental sort code is generating Paths
without due regard to point 3.  If it is making paths based on explicit
sorts for final-stage pathkeys, independently of either expression
indexes or mergejoin requirements, that could well explain why it's
got problems that weren't seen before.

6. I did look at ebb7ae839, and I suspect that the last part of
find_em_expr_usable_for_sorting_rel(), ie the search of the reltarget
list, is dead code.  There is no case where a volatile expression would
appear there, per point 2.  The committed regression test cases
certainly do not provide a counterexample: a look at the code coverage
report will show you that that loop never finds a match in any
regression test case.  I'd be inclined to flush that code and just
say that we won't return volatile expressions here.

regards, tom lane




Re: error_severity of brin work item

2020-11-23 Thread Alvaro Herrera
I think this formulation (attached v3) has fewer moving parts.

However, now that I did that, I wonder if this is really the best
approach to solve this problem.  Maybe instead of doing this at the BRIN
level, it should be handled at the autovac level, by having the worker
copy the work-item to local memory and remove it from the shared list as
soon as it is in progress.  That way, if *any* error occurs while trying
to execute it, it will go away instead of being retried for all
eternity.

Preliminary patch for that attached as autovacuum-workitem.patch.

I would propose to clean that up to apply instead of your proposed fix.
>From e0b457866e503a1b70b94c540147026b652fb019 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 13 Nov 2020 13:39:31 -0300
Subject: [PATCH v3] Avoid errors in brin summarization..

..which can happen if an index is reindexed concurrently
---
 src/backend/access/brin/brin.c | 37 --
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c60..4d012ebd76 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -33,6 +33,7 @@
 #include "postmaster/autovacuum.h"
 #include "storage/bufmgr.h"
 #include "storage/freespace.h"
+#include "storage/lmgr.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
@@ -886,21 +887,35 @@ brin_summarize_range(PG_FUNCTION_ARGS)
 
 	/*
 	 * We must lock table before index to avoid deadlocks.  However, if the
-	 * passed indexoid isn't an index then IndexGetRelation() will fail.
-	 * Rather than emitting a not-very-helpful error message, postpone
-	 * complaining, expecting that the is-it-an-index test below will fail.
+	 * passed indexoid isn't an index, then IndexGetRelation(false) would
+	 * emit a not-very-helpful error message.  Instead, postpone complaining
+	 * until the is-it-an-index test, below.
 	 */
 	heapoid = IndexGetRelation(indexoid, true);
-	if (OidIsValid(heapoid))
-		heapRel = table_open(heapoid, ShareUpdateExclusiveLock);
-	else
-		heapRel = NULL;
+	if (heapoid != InvalidOid)
+		LockRelationOid(heapoid, ShareUpdateExclusiveLock);
 
-	indexRel = index_open(indexoid, ShareUpdateExclusiveLock);
+	/* Now we can open the index; silently skip if it's gone. */
+	indexRel = try_relation_open(indexoid, ShareUpdateExclusiveLock);
+	if (!indexRel)
+	{
+		if (heapoid != InvalidOid)
+			UnlockRelationOid(heapoid, ShareUpdateExclusiveLock);
+		PG_RETURN_INT32(0);
+	}
+
+	/* If passed a non-index, fail noisily */
+	if (indexRel->rd_rel->relkind != RELKIND_INDEX)
+		ereport(ERROR,
+(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not an index",
+		RelationGetRelationName(indexRel;
+
+	/* Now we can open the table */
+	heapRel = table_open(heapoid, NoLock);
 
 	/* Must be a BRIN index */
-	if (indexRel->rd_rel->relkind != RELKIND_INDEX ||
-		indexRel->rd_rel->relam != BRIN_AM_OID)
+	if (indexRel->rd_rel->relam != BRIN_AM_OID)
 		ereport(ERROR,
 (errcode(ERRCODE_WRONG_OBJECT_TYPE),
  errmsg("\"%s\" is not a BRIN index",
@@ -916,7 +931,7 @@ brin_summarize_range(PG_FUNCTION_ARGS)
 	 * barely possible that a race against an index drop/recreation could have
 	 * netted us the wrong table.  Recheck.
 	 */
-	if (heapRel == NULL || heapoid != IndexGetRelation(indexoid, false))
+	if (heapoid != IndexGetRelation(indexoid, false))
 		ereport(ERROR,
 (errcode(ERRCODE_UNDEFINED_TABLE),
  errmsg("could not open parent table of index %s",
-- 
2.20.1

commit 2326df869ee8516b8b1f0e01930946e9a63800b5
Author: Alvaro Herrera 
AuthorDate: Mon Nov 23 15:56:23 2020 -0300
CommitDate: Mon Nov 23 15:57:28 2020 -0300

autovacuum-centered workitem fix

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index aa5b97fbac..77c1401bc9 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2561,6 +2561,7 @@ deleted:
 	for (i = 0; i < NUM_WORKITEMS; i++)
 	{
 		AutoVacuumWorkItem *workitem = &AutoVacuumShmem->av_workItems[i];
+		AutoVacuumWorkItem my_workitem;
 
 		if (!workitem->avw_used)
 			continue;
@@ -2569,11 +2570,14 @@ deleted:
 		if (workitem->avw_database != MyDatabaseId)
 			continue;
 
+		my_workitem = *workitem;
+		workitem->avw_used = false;
+
 		/* claim this one, and release lock while performing it */
 		workitem->avw_active = true;
 		LWLockRelease(AutovacuumLock);
 
-		perform_work_item(workitem);
+		perform_work_item(&my_workitem);
 
 		/*
 		 * Check for config changes before acquiring lock for further jobs.


mark/restore failures on unsorted merge joins

2020-11-23 Thread Andrew Gierth
>From a report by user "kes" on irc, I constructed this test case:

create table marktst (id integer, val numrange, exclude using gist (val with 
&&));
insert into marktst select i, numrange(i, i+1, '[)') from 
generate_series(1,1000) i;
vacuum marktst;
analyze marktst;

select * from (select val from marktst where val && numrange(10,11)) s1 full 
join (values (1)) v(a) on true;
ERROR:  function ammarkpos is not defined for index marktst_val_excl

for which the query plan is:
 QUERY PLAN 


 Merge Full Join  (cost=0.14..4.21 rows=2 width=20)
   ->  Result  (cost=0.00..0.01 rows=1 width=4)
   ->  Index Only Scan using marktst_val_excl on marktst  (cost=0.14..4.18 
rows=2 width=16)
 Index Cond: (val && '[10,11)'::numrange)
(4 rows)

The problem is that the planner calls ExecSupportsMarkRestore to find
out whether a Materialize node is needed, and that function looks no
further than the Path type of T_Index[Only]Path in order to return true,
even though in this case it's a GiST index which does not support
mark/restore.

(Usually this can't be a problem because the merge join would need
sorted input, thus the index scan would be a btree; but a merge join
that doesn't actually have any sort keys could take unsorted input from
any index type.)

Going forward, this looks like IndexOptInfo needs another am* boolean
field, but that's probably not appropriate for the back branches; maybe
as a workaround, ExecSupportsMarkRestore should just check for btree?

-- 
Andrew (irc:RhodiumToad)




Re: mark/restore failures on unsorted merge joins

2020-11-23 Thread Tom Lane
Andrew Gierth  writes:
> The problem is that the planner calls ExecSupportsMarkRestore to find
> out whether a Materialize node is needed, and that function looks no
> further than the Path type of T_Index[Only]Path in order to return true,
> even though in this case it's a GiST index which does not support
> mark/restore.

> (Usually this can't be a problem because the merge join would need
> sorted input, thus the index scan would be a btree; but a merge join
> that doesn't actually have any sort keys could take unsorted input from
> any index type.)

Sounds like the right analysis.

> Going forward, this looks like IndexOptInfo needs another am* boolean
> field, but that's probably not appropriate for the back branches; maybe
> as a workaround, ExecSupportsMarkRestore should just check for btree?

Uh, why would you not just look to see if the ammarkpos/amrestrpos fields
are non-null?

regards, tom lane




Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2020-11-23 Thread Anastasia Lubennikova

On 30.09.2020 05:00, David G. Johnston wrote:
On Tue, Sep 15, 2020 at 3:48 PM Alexander Korotkov 
mailto:aekorot...@gmail.com>> wrote:


Hi!

I've skimmed through the thread and checked the patchset. Everything
looks good, except one paragraph, which doesn't look completely clear.

+  
+   This emulates the functionality provided by
+    but sets the created object's
+   type
definition
+   to domain.
+  

As I get it states that CREATE DOMAIN somehow "emulates" CREATE TYPE.
Could you please, rephrase it?  It looks confusing to me yet.


v5 attached, looking at this fresh and with some comments to consider.

I ended up just combining both patches into one.

I did away with the glossary changes altogether, and the invention of 
the new term.  I ended up limiting "type's type" to just domain usage 
but did a couple of a additional tweaks that tried to treat domains as 
not being actual types even though, at least in PostgreSQL, they are 
(at least as far as DROP TYPE is concerned - and since I don't have 
any understanding of the SQL Standard's decision to separate out 
create domain and create type I'll just stick to the implementation in 
front of me.


David J.


Reminder from a CF manager, as this thread was inactive for a while.
Alexander, I see you signed up as a committer for this entry. Are you 
going to continue this work?


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-19, Michael Paquier wrote:

> On Thu, Nov 19, 2020 at 12:13:44PM +0900, Michael Paquier wrote:
> > That still looks useful for debugging, so DEBUG1 sounds fine to me.
> 
> By the way, it strikes me that you could just do nothing as long as
> (log_min_messages > DEBUG1), so you could encapsulate most of the
> logic that plays with the lock tag using that.

Good idea, done.

I also noticed that if we're going to accept a race (which BTW already
exists) we may as well simplify the code about it.

I think the attached is the final form of this.
>From cab95d8813cd79adb7b2a1b5501714c2dd87bec2 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 17 Nov 2020 21:09:22 -0300
Subject: [PATCH] Don't hold ProcArrayLock longer than needed in rare cases

While cancelling an autovacuum worker, we hold ProcArrayLock while
formatting a debugging log string.  We can make this shorter by saving
the data we need to produce the message and doing the formatting outside
the locked region.

This isn't terribly critical, as it only occurs pretty rarely: when a
backend runs deadlock detection and it happens to be blocked by a
autovacuum running autovacuum.  Still, there's no need to cause a hiccup
in ProcArrayLock processing, which can be very high-traffic in some
cases.

While at it, rework code so that we only print the string when it is
really going to be used, as suggested by Michael Paquier.

Discussion: https://postgr.es/m/20201118214127.GA3179@alvherre.pgsql
Reviewed-by: Michael Paquier 
---
 src/backend/storage/lmgr/proc.c | 63 -
 1 file changed, 38 insertions(+), 25 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index d1738c65f5..dbfdd7b3c7 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1311,40 +1311,58 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 		{
 			PGPROC	   *autovac = GetBlockingAutoVacuumPgproc();
 			uint8		statusFlags;
+			uint8		lockmethod_copy;
+			LOCKTAG		locktag_copy;
 
-			LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+			/*
+			 * Grab info we need, then release lock immediately.  Note this
+			 * coding means that there is a tiny chance that the process
+			 * terminates its current transaction and starts a different one
+			 * before we have a change to send the signal; the worst possible
+			 * consequence is that a for-wraparound vacuum is cancelled.  But
+			 * that could happen in any case unless we were to do kill() with
+			 * the lock held, which is much more undesirable.
+			 */
+			LWLockAcquire(ProcArrayLock, LW_SHARED);
+			statusFlags = ProcGlobal->statusFlags[autovac->pgxactoff];
+			lockmethod_copy = lock->tag.locktag_lockmethodid;
+			locktag_copy = lock->tag;
+			LWLockRelease(ProcArrayLock);
 
 			/*
 			 * Only do it if the worker is not working to protect against Xid
 			 * wraparound.
 			 */
-			statusFlags = ProcGlobal->statusFlags[autovac->pgxactoff];
 			if ((statusFlags & PROC_IS_AUTOVACUUM) &&
 !(statusFlags & PROC_VACUUM_FOR_WRAPAROUND))
 			{
 int			pid = autovac->pid;
-StringInfoData locktagbuf;
-StringInfoData logbuf;	/* errdetail for server log */
 
-initStringInfo(&locktagbuf);
-initStringInfo(&logbuf);
-DescribeLockTag(&locktagbuf, &lock->tag);
-appendStringInfo(&logbuf,
- _("Process %d waits for %s on %s."),
- MyProcPid,
- GetLockmodeName(lock->tag.locktag_lockmethodid,
- lockmode),
- locktagbuf.data);
+/* report the case, if configured to do so */
+if (log_min_messages <= DEBUG1 || client_min_messages <= DEBUG1)
+{
+	StringInfoData locktagbuf;
+	StringInfoData logbuf;	/* errdetail for server log */
 
-/* release lock as quickly as possible */
-LWLockRelease(ProcArrayLock);
+	initStringInfo(&locktagbuf);
+	initStringInfo(&logbuf);
+	DescribeLockTag(&locktagbuf, &locktag_copy);
+	appendStringInfo(&logbuf,
+	 _("Process %d waits for %s on %s."),
+	 MyProcPid,
+	 GetLockmodeName(lockmethod_copy, lockmode),
+	 locktagbuf.data);
+
+	ereport(DEBUG1,
+			(errmsg("sending cancel to blocking autovacuum PID %d",
+	pid),
+			 errdetail_log("%s", logbuf.data)));
+
+	pfree(logbuf.data);
+	pfree(locktagbuf.data);
+}
 
 /* send the autovacuum worker Back to Old Kent Road */
-ereport(DEBUG1,
-		(errmsg("sending cancel to blocking autovacuum PID %d",
-pid),
-		 errdetail_log("%s", logbuf.data)));
-
 if (kill(pid, SIGINT) < 0)
 {
 	/*
@@ -1362,12 +1380,7 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 (errmsg("could not send signal to process %d: %m",
 		pid)));
 }
-
-pfree(logbuf.data);
-pfree(locktagbuf.data);
 			}
-			else
-LWLockRelease(ProcArrayLock);
 
 			/* prevent signal from being sent again more than once */
 			allow_autovacuum_cancel = false;
-- 
2.20.1



Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-11-23 Thread David Rowley
On Sat, 21 Nov 2020 at 03:26, Peter Eisentraut
 wrote:
>
> I did tests of elog_ereport_attribute_cold_v4.patch on an oldish Mac
> Intel laptop with pgbench scale 1 (default), and then:
>
> pgbench -S -T 60
>
> master:  tps = 8251.883229 (excluding connections establishing)
> patched: tps = 9556.836232 (excluding connections establishing)
>
> pgbench -S -T 60 -M prepared
>
> master:  tps = 14713.821837 (excluding connections establishing)
> patched: tps = 16200.066185 (excluding connections establishing)
>
> So from that this seems like an easy win.

Well, that makes it look pretty good.  If we can get 10-15% on some
machines without making things slower on any other machines, then that
seems like a good win to me.

Thanks for testing that.

David




walsender bug: stuck during shutdown

2020-11-23 Thread Alvaro Herrera
Hello

Chloe Dives reported that sometimes a walsender would become stuck
during shutdown and *not* shutdown, thus preventing postmaster from
completing the shutdown cycle.  This has been observed to cause the
servers to remain in such state for several hours.

After a lengthy investigation and thanks to a handy reproducer by Chris
Wilson, we found that the problem is that WalSndDone wants to avoid
shutting down until everything has been sent and acknowledged; but this
test is coded in a way that ignores the possibility that we have never
received anything from the other end.  In that case, both
MyWalSnd->flush and MyWalSnd->write are InvalidRecPtr, so the condition
in WalSndDone to terminate the loop is never fulfilled.  So the
walsender is looping forever and never terminates, blocking shutdown of
the whole instance.

The attached patch fixes the problem by testing for the problematic
condition.

Apparently this problem has existed forever.  Fujii-san almost patched
for it in 5c6d9fc4b2b8 (2014!), but missed it by a zillionth of an inch.

-- 
Álvaro Herrera
>From aca27a0af5616bc1da4f08cbbc93b4d3c9380f60 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 23 Nov 2020 17:51:34 -0300
Subject: [PATCH] Don't loop forever in WalSndDone

For a walsender that hasn't sent anything, the "replicatedPtr" as
computed for shutdown is not valid, so the comparison to sentPtr fails.
Make sure to only compare if replicatedPtr is valid.
---
 src/backend/replication/walsender.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5d1b1a16be..bb86c094a3 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2936,7 +2936,8 @@ WalSndDone(WalSndSendDataCallback send_data)
 	replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ?
 		MyWalSnd->write : MyWalSnd->flush;
 
-	if (WalSndCaughtUp && sentPtr == replicatedPtr &&
+	if (WalSndCaughtUp &&
+		(XLogRecPtrIsInvalid(replicatedPtr) || sentPtr == replicatedPtr) &&
 		!pq_is_send_pending())
 	{
 		QueryCompletion qc;
-- 
2.20.1



optimizer/clauses.h needn't include access/htup.h

2020-11-23 Thread Justin Pryzby
It was only needed between these:

commit a8677e3ff6bb8ef78a9ba676faa647bba237b1c4
Author: Peter Eisentraut 
Date:   Fri Apr 13 17:06:28 2018 -0400

Support named and default arguments in CALL

commit f09346a9c6218dd239fdf3a79a729716c0d305bd
Author: Tom Lane 
Date:   Tue Jan 29 15:48:51 2019 -0500

Refactor planner's header files.

I noticed while looking at "what includes what" and wondered if some of these
are kind of "modularity violations". 

$ find src/include/ -name '*.h' -print0 |xargs -r0 awk -F'[/"]' 'NF>2 && 
/^#include "/{split(FILENAME,a); print a[3],"-",$2 }' |awk '$1!=$3 && 
!/\.h|statistics|partitioning|bootstrap|tsearch|foreign|jit|regex|lib|common/' 
|sort |uniq -c |sort -nr |awk '$1==1'
  1 utils - rewrite
  1 utils - port
  1 utils - parser
  1 tcop - storage
  1 tcop - executor
  1 tcop - catalog
  1 tcop - access
  1 storage - postmaster
  1 rewrite - catalog
  1 rewrite - access
  1 replication - port
  1 replication - datatype
  1 postmaster - datatype
  1 parser - catalog
  1 nodes - commands
  1 executor - portability
  1 executor - port
  1 commands - datatype
  1 catalog - port
  1 catalog - parser
  1 access - tcop
  1 access - replication
  1 access - postmaster
  1 access - executor

pryzbyj@pryzbyj:~/src/postgres$ find src/backend/ -name '*.c' -print0 |xargs 
-r0 awk -F'[/"]' 'NF>2 && /^#include "/{split(FILENAME,a); print a[3],"-",$2 }' 
|awk '$1!=$3 && 
!/\.h/&&!/common|utils|tsearch|main|foreign|port|regex|bootstrap|jit/' |sort 
|uniq -c |sort -nr |awk '$1==1'
  1 storage - libpq
  1 statistics - postmaster
  1 statistics - commands
  1 rewrite - tcop
  1 rewrite - optimizer
  1 replication - syncrep_scanner.c
  1 replication - rewrite
  1 replication - repl_scanner.c
  1 replication - optimizer
  1 postmaster - mb
  1 partitioning - rewrite
  1 partitioning - commands
  1 nodes - mb
  1 libpq - postmaster
  1 libpq - mb
  1 libpq - commands

-- 
Justin




Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2020-11-23 Thread Tom Lane
Justin Pryzby  writes:
>> This patch has been waiting for input from a committer on the approach I've
>> taken with the patches since March 10, so I'm planning to set to "Ready" - at
>> least ready for senior review.

I took a quick look through this.  This is just MHO, of course:

* I don't think it's okay to change the existing signatures of
pg_ls_logdir() et al.  Even if you can make an argument that it's
not too harmful to add more output columns, replacing pg_stat_file's
isdir output with something of a different name and datatype is most
surely not OK --- there is no possible way that doesn't break existing
user queries.

I think possibly a more acceptable approach is to leave these functions
alone but add documentation explaining how to get the additional info.
You could say things along the lines of "pg_ls_waldir() is the same as
pg_ls_dir_metadata('pg_wal') except for showing fewer columns."

* I'm not very much on board with implementing pg_ls_dir_recurse()
as a SQL function that depends on a WITH RECURSIVE construct.
I do not think that's okay from either performance or security
standpoints.  Surely it can't be hard to build a recursion capability
into the C code?  We could then also debate whether this ought to be
a separate function at all, instead of something you invoke via an
additional boolean flag parameter to pg_ls_dir_metadata().

* I'm fairly unimpressed with the testing approach, because it doesn't
seem like you're getting very much coverage.  It's hard to do better while
still having the totally-fixed output expected by our regular regression
test framework, but to me that just says not to test these functions in
that framework.  I'd consider ripping all of that out in favor of a
TAP test.

While I didn't read the C code in any detail, a couple of things stood
out to me:

* I noticed that you did s/stat/lstat/.  That's fine on Unix systems,
but it won't have any effect on Windows systems (cf bed90759f),
which means that we'll have to document a platform-specific behavioral
difference.  Do we want to go there?  Maybe this patch needs to wait
on somebody fixing our lack of real lstat() on Windows.  (I assume BTW
that this means the WIN32 code in get_file_type() is unreachable.)

* This bit:

+   /* Skip dot dirs? */
+   if (flags & LS_DIR_SKIP_DOT_DIRS &&
+   (strcmp(de->d_name, ".") == 0 ||
+strcmp(de->d_name, "..") == 0))
+   continue;
+
+   /* Skip hidden files? */
+   if (flags & LS_DIR_SKIP_HIDDEN &&
+   de->d_name[0] == '.')
continue;

doesn't seem to have thought very carefully about the interaction
of those two flags, ie it seems like LS_DIR_SKIP_HIDDEN effectively
implies LS_DIR_SKIP_DOT_DIRS.  Do we want that?

regards, tom lane




Re: Online verification of checksums

2020-11-23 Thread Anastasia Lubennikova

On 23.11.2020 18:35, Stephen Frost wrote:

Greetings,

* Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:

On 21.11.2020 04:30, Michael Paquier wrote:

The only method I can think as being really
reliable is based on two facts:
- Do a check only on pd_checksums, as that validates the full contents
of the page.
- When doing a retry, make sure that there is no concurrent I/O
activity in the shared buffers.  This requires an API we don't have
yet.

It seems reasonable to me to rely on checksums only.

As for retry, I think that API for concurrent I/O will be complicated.
Instead, we can introduce a function to read the page directly from shared
buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
solution to me. Do you see any possible problems with it?

We might end up reading pages back in that have been evicted, for one
thing, which doesn't seem great,
TBH, I think it is highly unlikely that the page that was just updated 
will be evicted.

and this also seems likely to be
awkward for cases which aren't using the replication protocol, unless
every process maintains a connection to PG the entire time, which also
doesn't seem great.
Have I missed something? Now pg_basebackup has only one process + one 
child process for streaming. Anyway, I totally agree with your argument. 
The need to maintain connection(s) to PG is the most unpleasant part of 
the proposed approach.



Also- what is the point of reading the page from shared buffers
anyway..?
Well... Reading a page from shared buffers is a reliable way to get a 
correct page from postgres under any concurrent load. So it just seems 
natural to me.

All we need to do is prove that the page will be rewritten
during WAL replay.
Yes and this is a tricky part. Until you have explained it in your 
latest message, I wasn't sure how we can distinct concurrent update from 
a page header corruption. Now I agree that if page LSN updated and 
increased between rereads, it is safe enough to conclude that we have 
some concurrent load.



  If we can prove that, we don't actually care what
the contents of the page are.  We certainly can't calculate the
checksum on a page we plucked out of shared buffers since we only
calculate the checksum when we go to write the page out.
Good point. I was thinking that we can recalculate checksum. Or even 
save a page without it, as we have checked LSN and know for sure that it 
will be rewritten by WAL replay.



To sum up, I agree with your proposal to reread the page and rely on 
ascending LSNs. Can you submit a patch?

You can write it on top of the latest attachment in this thread:
v8-master-0001-Fix-page-verifications-in-base-backups.patch 
 
from this message 
https://www.postgresql.org/message-id/20201030023028.gc1...@paquier.xyz


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Nov-19, Michael Paquier wrote:
>> By the way, it strikes me that you could just do nothing as long as
>> (log_min_messages > DEBUG1), so you could encapsulate most of the
>> logic that plays with the lock tag using that.

> Good idea, done.

I'm less sure that that's a good idea.  It embeds knowledge here that
should not exist outside elog.c; moreover, I'm not entirely sure that
it's even correct, given the nonlinear ranking of log_min_messages.

Maybe it'd be a good idea to have elog.c expose a new function
along the lines of "bool message_level_is_interesting(int elevel)"
to support this and similar future optimizations in a less fragile way.

regards, tom lane




Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-23, Tom Lane wrote:

> Alvaro Herrera  writes:
> > On 2020-Nov-19, Michael Paquier wrote:
> >> By the way, it strikes me that you could just do nothing as long as
> >> (log_min_messages > DEBUG1), so you could encapsulate most of the
> >> logic that plays with the lock tag using that.
> 
> > Good idea, done.
> 
> I'm less sure that that's a good idea.  It embeds knowledge here that
> should not exist outside elog.c; moreover, I'm not entirely sure that
> it's even correct, given the nonlinear ranking of log_min_messages.

Well, we already do this in a number of places.  But I can get behind
this:

> Maybe it'd be a good idea to have elog.c expose a new function
> along the lines of "bool message_level_is_interesting(int elevel)"
> to support this and similar future optimizations in a less fragile way.




Re: deferred primary key and logical replication

2020-11-23 Thread Anastasia Lubennikova

On 27.10.2020 13:46, Amit Kapila wrote:

On Sun, Oct 25, 2020 at 9:39 PM Euler Taveira
 wrote:

On Mon, 5 Oct 2020 at 08:34, Amit Kapila  wrote:

On Mon, May 11, 2020 at 2:41 AM Euler Taveira
 wrote:

Hi,

While looking at an old wal2json issue, I stumbled on a scenario that a table
with a deferred primary key is not updatable in logical replication. AFAICS it
has been like that since the beginning of logical decoding and seems to be an
oversight while designing logical decoding.


I am not sure if it is an oversight because we document that the index
must be non-deferrable, see "USING INDEX records the old values of the
columns covered by the named index, which must be unique, not partial,
not deferrable, and include only columns marked NOT NULL." in docs
[1].


Inspecting this patch again, I forgot to consider that RelationGetIndexList()
is called by other backend modules. Since logical decoding deals with finished
transactions, it is ok to use a deferrable primary key.


But starting PG-14, we do support logical decoding of in-progress
transactions as well.



Commitfest entry status update.
As far as I see, this patch needs some further work, so I move it to 
"Waiting on author".

Euler, are you going to continue working on it?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: PATCH: Batch/pipelining support for libpq

2020-11-23 Thread Daniel Verite
 Hi,

Here's a new version with the pgbench support included.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 9ce32fb39b..2a94f8f6b9 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3061,6 +3061,30 @@ ExecStatusType PQresultStatus(const PGresult *res);

   
  
+
+ 
+  PGRES_BATCH_END
+  
+   
+The PGresult represents the end of a 
batch.
+This status occurs only when batch mode has been selected.
+   
+  
+ 
+
+ 
+  PGRES_BATCH_ABORTED
+  
+   
+The PGresult represents a batch that's
+received an error from the server.  
PQgetResult
+must be called repeatedly, and it will return this status code,
+until the end of the current batch, at which point it will return
+PGRES_BATCH_END and normal processing can 
resume.
+   
+  
+ 
+
 
 
 If the result status is PGRES_TUPLES_OK or
@@ -4819,6 +4843,482 @@ int PQflush(PGconn *conn);
 
  
 
+ 
+  Batch Mode
+
+  
+   libpq
+   batch mode
+  
+
+  
+   pipelining
+   in libpq
+  
+
+  
+   libpq batch mode allows applications to
+   send a query without having to read the result of the previously
+   sent query.  Taking advantage of the batch mode, a client will wait
+   less for the server, since multiple queries/results can be sent/
+   received in a single network transaction.
+  
+
+  
+   While batch mode provides a significant performance boost, writing
+   clients using the batch mode is more complex because it involves
+   managing a queue of pending queries and finding which result
+   corresponds to which query in the queue.
+  
+
+  
+   Using Batch Mode
+
+   
+To issue batches the application must switch a connection into batch mode.
+Enter batch mode with 
+or test whether batch mode is active with
+.
+In batch mode, only asynchronous 
operations
+are permitted, and COPY is not recommended as it
+may trigger failure in batch processing.  Using any synchronous
+command execution functions such as PQfn,
+PQexec or one of its sibling functions are error
+conditions.
+   
+
+   
+
+ It is best to use batch mode with libpq in
+ non-blocking mode. If used
+ in blocking mode it is possible for a client/server deadlock to occur.
+  
+   
+The client will block trying to send queries to the server, but the
+server will block trying to send results to the client from queries
+it has already processed. This only occurs when the client sends
+enough queries to fill its output buffer and the server's receive
+buffer before switching to processing input from the server,
+but it's hard to predict exactly when that will happen.
+   
+  
+
+
+ Batch mode consumes more memory when send/receive is not done as required,
+ even in non-blocking mode.
+
+   
+
+   
+Issuing Queries
+
+
+ After entering batch mode the application dispatches requests using
+ normal asynchronous libpq functions such as
+ PQsendQueryParams, 
PQsendPrepare,
+ PQsendQueryPrepared, 
PQsendDescribePortal,
+ PQsendDescribePrepared.
+ The asynchronous requests are followed by a
+ 
+ call to mark the end of the batch. The client needs not
+ call PQgetResult immediately after
+ dispatching each operation.
+ Result processing
+ is handled separately.
+
+
+
+ The server executes statements, and returns results, in the order the
+ client sends them.  The server may begin executing the batch before all
+ commands in the batch are queued and the end of batch command is sent.
+ If any statement encounters an error the server aborts the current
+ transaction and skips processing the rest of the batch.
+ Query processing resumes after the end of the failed batch.
+
+
+
+ It's fine for one operation to depend on the results of a
+ prior one. One query may define a table that the next query in the same
+ batch uses; similarly, an application may create a named prepared 
statement
+ then execute it with later statements in the same batch.
+
+   
+
+   
+Processing Results
+
+
+ To process the result of one batch query, the application calls
+ PQgetResult repeatedly and handles each result
+ until PQgetResult returns null.
+ The result from the next batch query may then be retrieved using
+ PQgetResult again and the cycle repeated.
+ The application handles individual statement results as normal.
+ When the results of all the queries in the batch have been
+ returned, PQgetResult returns a result
+ containing the status value PGRES_BA

Re: optimizer/clauses.h needn't include access/htup.h

2020-11-23 Thread Tom Lane
Justin Pryzby  writes:
> It was only needed between these:
> commit a8677e3ff6bb8ef78a9ba676faa647bba237b1c4
> commit f09346a9c6218dd239fdf3a79a729716c0d305bd

Hm, you're right.  Removed.

> I noticed while looking at "what includes what" and wondered if some of these
> are kind of "modularity violations". 

Yeah.  I've ranted before that we ought to have some clearer idea of
module layering within the backend, and avoid cross-header inclusions
that would break the layering.  This particular case didn't really
do so, I suppose, since htup.h would surely be on a lower level than
the optimizer.  But it still seems nicer to not have that inclusion.

Anyway, if you're feeling motivated to explore a more wide-ranging
refactoring, by all means have a go at it.

regards, tom lane




Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Tom Lane
Alvaro Herrera  writes:
> Well, we already do this in a number of places.  But I can get behind
> this:

>> Maybe it'd be a good idea to have elog.c expose a new function
>> along the lines of "bool message_level_is_interesting(int elevel)"
>> to support this and similar future optimizations in a less fragile way.

I'll see about a patch for that.

regards, tom lane




Re: New default role- 'pg_read_all_data'

2020-11-23 Thread Anastasia Lubennikova

On 29.10.2020 17:19, Stephen Frost wrote:

Greetings,

* Georgios Kokolatos (gkokola...@protonmail.com) wrote:

this patch is in "Ready for committer" state and the Cfbot is happy.

Glad that's still the case. :)


Is there any committer that is available for taking a look at it?

If there aren't any objections or further comments, I'll take another
look through it and will commit it during the upcoming CF.

Thanks!

Stephen


CFM reminder. Just in case you forgot about this thread)
The commitfest is heading to the end. And there was a plenty of time for 
anyone to object.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-23, Tom Lane wrote:

> Alvaro Herrera  writes:
> > Well, we already do this in a number of places.  But I can get behind
> > this:
> 
> >> Maybe it'd be a good idea to have elog.c expose a new function
> >> along the lines of "bool message_level_is_interesting(int elevel)"
> >> to support this and similar future optimizations in a less fragile way.
> 
> I'll see about a patch for that.

Looking at that now ...




Re: PATCH: Batch/pipelining support for libpq

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-23, Daniel Verite wrote:

>  Hi,
> 
> Here's a new version with the pgbench support included.

Thanks, incorporated into my local copy.





Re: Online verification of checksums

2020-11-23 Thread Stephen Frost
Greetings,

* Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
> On 23.11.2020 18:35, Stephen Frost wrote:
> >* Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
> >>On 21.11.2020 04:30, Michael Paquier wrote:
> >>>The only method I can think as being really
> >>>reliable is based on two facts:
> >>>- Do a check only on pd_checksums, as that validates the full contents
> >>>of the page.
> >>>- When doing a retry, make sure that there is no concurrent I/O
> >>>activity in the shared buffers.  This requires an API we don't have
> >>>yet.
> >>It seems reasonable to me to rely on checksums only.
> >>
> >>As for retry, I think that API for concurrent I/O will be complicated.
> >>Instead, we can introduce a function to read the page directly from shared
> >>buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
> >>solution to me. Do you see any possible problems with it?
> >We might end up reading pages back in that have been evicted, for one
> >thing, which doesn't seem great,
> TBH, I think it is highly unlikely that the page that was just updated will
> be evicted.

Is it though..?  Consider that the page which was being written out was
being done so specifically to free a page for use by another backend-
while perhaps that doesn't happen all the time, it certainly happens
enough on very busy systems.

> >and this also seems likely to be
> >awkward for cases which aren't using the replication protocol, unless
> >every process maintains a connection to PG the entire time, which also
> >doesn't seem great.
> Have I missed something? Now pg_basebackup has only one process + one child
> process for streaming. Anyway, I totally agree with your argument. The need
> to maintain connection(s) to PG is the most unpleasant part of the proposed
> approach.

I was thinking beyond pg_basebackup, yes; apologies for that not being
clear but that's what I was meaning when I said "aren't using the
replication protocol".

> >Also- what is the point of reading the page from shared buffers
> >anyway..?
> Well... Reading a page from shared buffers is a reliable way to get a
> correct page from postgres under any concurrent load. So it just seems
> natural to me.

Yes, that's true, but if a dirty page was just written out by a backend
in order to be able to evict it, so that the backend can then pull in a
new page, then having pg_basebackup pull that page back in really isn't
great.

> >All we need to do is prove that the page will be rewritten
> >during WAL replay.
> Yes and this is a tricky part. Until you have explained it in your latest
> message, I wasn't sure how we can distinct concurrent update from a page
> header corruption. Now I agree that if page LSN updated and increased
> between rereads, it is safe enough to conclude that we have some concurrent
> load.

Even in this case, it's almost free to compare the LSN to the starting
backup LSN, and to the current LSN position, and make sure it's
somewhere between the two.  While that doesn't entirely eliminite the
possibility that the page happened to get corrupted *and* return a
different result on subsequent reads *and* that it was corrupted in such
a way that the LSN ended up falling between the starting backup LSN and
the current LSN, it's certainly reducing the chances of a false negative
a fair bit.

A concern here, however, is- can we be 100% sure that we'll get a
different result from the two subsequent reads?  For my part, at least,
I've been doubtful that it's possible but it'd be nice to hear it from
someone who has really looked at the kernel side.  To try and clairfy,
let me illustrate:

pg_basebackup (the backend that's sending data to it anyway) starts
reading an 8K page, but gets interrupted halfway through, meaning that
it's read 4K and is now paused.

PG writes that same 8K page, and is able to successfully write the
entire block.

pg_basebackup then wakes up, reads the second half, computes a checksum
and gets a checksum failure.

At this point the question is: if pg_basebackup loops, seeks and
re-reads the same 8K block again, is it possible that pg_basebackup will
get the "old" starting 4K and the "new" ending 4K again?  I'd like to
think that the answer is 'no' and that the kernel will guarantee that if
we managed to read a "new" ending 4K block then the following read of
the full 8K block would be guaranteed to give us the "new" starting 4K.
If that is truely guaranteed then we could be much more confident that
the idea here of simply checking for an ascending LSN, which falls
between the starting LSN of the backup and the current LSN (or perhaps
the final LSN for the backup) would be sufficient to detect this case.

I would also think that, if we can trust that, then there really isn't
any need for the delay in performing the re-read, which I have to admit
that I don't particularly care for.

> >  If we can prove that, we don't actually care what
> >the contents of the page are.  We certainly can't calculate t

Re: New default role- 'pg_read_all_data'

2020-11-23 Thread Stephen Frost
Greetings,

* Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
> On 29.10.2020 17:19, Stephen Frost wrote:
> >* Georgios Kokolatos (gkokola...@protonmail.com) wrote:
> >>this patch is in "Ready for committer" state and the Cfbot is happy.
> >Glad that's still the case. :)
> >
> >>Is there any committer that is available for taking a look at it?
> >If there aren't any objections or further comments, I'll take another
> >look through it and will commit it during the upcoming CF.
> >
> >Thanks!
> >
> >Stephen
> 
> CFM reminder. Just in case you forgot about this thread)
> The commitfest is heading to the end. And there was a plenty of time for
> anyone to object.

Thanks, I've not forgotten, but it's a bit complicated given that I've
another patch in progress to rename default roles to be predefined
roles which conflicts with this one.  Hopefully we'll have a few
comments on that and I can get it committed and this one updated with
the new naming.  I'd rather not commit this one and then immediately
commit changes over top of it.

Thanks again!

Stephen


signature.asc
Description: PGP signature


Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Tom Lane
Here's a draft patch.

regards, tom lane

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 03c553e7ea..9cd0b7c11b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5344,7 +5344,7 @@ static void
 ShowTransactionState(const char *str)
 {
 	/* skip work if message will definitely not be printed */
-	if (log_min_messages <= DEBUG5 || client_min_messages <= DEBUG5)
+	if (message_level_is_interesting(DEBUG5))
 		ShowTransactionStateRec(str, CurrentTransactionState);
 }
 
@@ -5371,7 +5371,6 @@ ShowTransactionStateRec(const char *str, TransactionState s)
 	if (s->parent)
 		ShowTransactionStateRec(str, s->parent);
 
-	/* use ereport to suppress computation if msg will not be printed */
 	ereport(DEBUG5,
 			(errmsg_internal("%s(%d) name: %s; blockState: %s; state: %s, xid/subid/cid: %u/%u/%u%s%s",
 			 str, s->nestingLevel,
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 7e915bcadf..32a3099c1f 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -105,7 +105,7 @@ log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
 	 * tracing of the cause (note the elog context mechanism will tell us
 	 * something about the XLOG record that generated the reference).
 	 */
-	if (log_min_messages <= DEBUG1 || client_min_messages <= DEBUG1)
+	if (message_level_is_interesting(DEBUG1))
 		report_invalid_page(DEBUG1, node, forkno, blkno, present);
 
 	if (invalid_page_tab == NULL)
@@ -159,7 +159,7 @@ forget_invalid_pages(RelFileNode node, ForkNumber forkno, BlockNumber minblkno)
 			hentry->key.forkno == forkno &&
 			hentry->key.blkno >= minblkno)
 		{
-			if (log_min_messages <= DEBUG2 || client_min_messages <= DEBUG2)
+			if (message_level_is_interesting(DEBUG2))
 			{
 char	   *path = relpathperm(hentry->key.node, forkno);
 
@@ -192,7 +192,7 @@ forget_invalid_pages_db(Oid dbid)
 	{
 		if (hentry->key.node.dbNode == dbid)
 		{
-			if (log_min_messages <= DEBUG2 || client_min_messages <= DEBUG2)
+			if (message_level_is_interesting(DEBUG2))
 			{
 char	   *path = relpathperm(hentry->key.node, hentry->key.forkno);
 
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index b0d037600e..245c2f4fc8 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1146,15 +1146,9 @@ reportDependentObjects(const ObjectAddresses *targetObjects,
 	 * If no error is to be thrown, and the msglevel is too low to be shown to
 	 * either client or server log, there's no need to do any of the rest of
 	 * the work.
-	 *
-	 * Note: this code doesn't know all there is to be known about elog
-	 * levels, but it works for NOTICE and DEBUG2, which are the only values
-	 * msglevel can currently have.  We also assume we are running in a normal
-	 * operating environment.
 	 */
 	if (behavior == DROP_CASCADE &&
-		msglevel < client_min_messages &&
-		(msglevel < log_min_messages || log_min_messages == LOG))
+		!message_level_is_interesting(msglevel))
 		return;
 
 	/*
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index babee386c4..87c3ea450e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1215,7 +1215,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
 	SpinLockRelease(&walrcv->mutex);
 
-	if (log_min_messages <= DEBUG2)
+	if (message_level_is_interesting(DEBUG2))
 	{
 		char	   *sendtime;
 		char	   *receipttime;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5d1b1a16be..2eb19ad293 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1900,7 +1900,7 @@ ProcessStandbyReplyMessage(void)
 	replyTime = pq_getmsgint64(&reply_message);
 	replyRequested = pq_getmsgbyte(&reply_message);
 
-	if (log_min_messages <= DEBUG2)
+	if (message_level_is_interesting(DEBUG2))
 	{
 		char	   *replyTimeStr;
 
@@ -2082,7 +2082,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
 	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	if (log_min_messages <= DEBUG2)
+	if (message_level_is_interesting(DEBUG2))
 	{
 		char	   *replyTimeStr;
 
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 1ba47c194b..04c2245338 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -600,6 +600,42 @@ errfinish(const char *filename, int lineno, const char *funcname)
 	CHECK_FOR_INTERRUPTS();
 }
 
+/*
+ * message_level_is_interesting --- would ereport/elog do anything?
+ *
+ * Returns true if ereport/elog with this elevel will not be a no-op.
+ * This is useful to short-circuit any expensive preparatory work that
+ * might be needed for a logging messag

Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-23, Tom Lane wrote:

> Here's a draft patch.

Here's another of my own.  Outside of elog.c it seems identical.

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 03c553e7ea..a4ab9090f9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -5344,7 +5344,7 @@ static void
 ShowTransactionState(const char *str)
 {
 	/* skip work if message will definitely not be printed */
-	if (log_min_messages <= DEBUG5 || client_min_messages <= DEBUG5)
+	if (message_level_is_interesting(DEBUG5))
 		ShowTransactionStateRec(str, CurrentTransactionState);
 }
 
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 7e915bcadf..32a3099c1f 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -105,7 +105,7 @@ log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
 	 * tracing of the cause (note the elog context mechanism will tell us
 	 * something about the XLOG record that generated the reference).
 	 */
-	if (log_min_messages <= DEBUG1 || client_min_messages <= DEBUG1)
+	if (message_level_is_interesting(DEBUG1))
 		report_invalid_page(DEBUG1, node, forkno, blkno, present);
 
 	if (invalid_page_tab == NULL)
@@ -159,7 +159,7 @@ forget_invalid_pages(RelFileNode node, ForkNumber forkno, BlockNumber minblkno)
 			hentry->key.forkno == forkno &&
 			hentry->key.blkno >= minblkno)
 		{
-			if (log_min_messages <= DEBUG2 || client_min_messages <= DEBUG2)
+			if (message_level_is_interesting(DEBUG2))
 			{
 char	   *path = relpathperm(hentry->key.node, forkno);
 
@@ -192,7 +192,7 @@ forget_invalid_pages_db(Oid dbid)
 	{
 		if (hentry->key.node.dbNode == dbid)
 		{
-			if (log_min_messages <= DEBUG2 || client_min_messages <= DEBUG2)
+			if (message_level_is_interesting(DEBUG2))
 			{
 char	   *path = relpathperm(hentry->key.node, hentry->key.forkno);
 
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index b0d037600e..0ad647ed9c 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1146,15 +1146,8 @@ reportDependentObjects(const ObjectAddresses *targetObjects,
 	 * If no error is to be thrown, and the msglevel is too low to be shown to
 	 * either client or server log, there's no need to do any of the rest of
 	 * the work.
-	 *
-	 * Note: this code doesn't know all there is to be known about elog
-	 * levels, but it works for NOTICE and DEBUG2, which are the only values
-	 * msglevel can currently have.  We also assume we are running in a normal
-	 * operating environment.
 	 */
-	if (behavior == DROP_CASCADE &&
-		msglevel < client_min_messages &&
-		(msglevel < log_min_messages || log_min_messages == LOG))
+	if (behavior == DROP_CASCADE && !message_level_is_interesting(msglevel))
 		return;
 
 	/*
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index babee386c4..87c3ea450e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1215,7 +1215,7 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)
 	walrcv->lastMsgReceiptTime = lastMsgReceiptTime;
 	SpinLockRelease(&walrcv->mutex);
 
-	if (log_min_messages <= DEBUG2)
+	if (message_level_is_interesting(DEBUG2))
 	{
 		char	   *sendtime;
 		char	   *receipttime;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5d1b1a16be..2eb19ad293 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1900,7 +1900,7 @@ ProcessStandbyReplyMessage(void)
 	replyTime = pq_getmsgint64(&reply_message);
 	replyRequested = pq_getmsgbyte(&reply_message);
 
-	if (log_min_messages <= DEBUG2)
+	if (message_level_is_interesting(DEBUG2))
 	{
 		char	   *replyTimeStr;
 
@@ -2082,7 +2082,7 @@ ProcessStandbyHSFeedbackMessage(void)
 	feedbackCatalogXmin = pq_getmsgint(&reply_message, 4);
 	feedbackCatalogEpoch = pq_getmsgint(&reply_message, 4);
 
-	if (log_min_messages <= DEBUG2)
+	if (message_level_is_interesting(DEBUG2))
 	{
 		char	   *replyTimeStr;
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 40eac704dc..eb2309ed76 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -1341,22 +1341,25 @@ ProcSleep(LOCALLOCK *locallock, LockMethod lockMethodTable)
 StringInfoData logbuf;	/* errdetail for server log */
 
 /* report the case, if configured to do so */
-initStringInfo(&locktagbuf);
-initStringInfo(&logbuf);
-DescribeLockTag(&locktagbuf, &locktag_copy);
-appendStringInfo(&logbuf,
- _("Process %d waits for %s on %s."),
- MyProcPid,
- GetLockmodeName(lockmethod_copy, lockmode),
- locktagbuf.data);
+if (message_level_is_interesting(DEBUG1))
+{
+	initStringInfo(&locktagbuf);
+	initStringInfo(&logbuf);
+	DescribeLockTag(&locktagbuf, &

Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-23, Alvaro Herrera wrote:

> On 2020-Nov-23, Tom Lane wrote:
> 
> > Here's a draft patch.
> 
> Here's another of my own.  Outside of elog.c it seems identical.

Your version has the advantage that errstart() doesn't get a new
function call.  I'm +1 for going with that ... we could avoid the
duplicate code with some additional contortions but this changes so
rarely that it's probably not worth the trouble.





Re: optimizer/clauses.h needn't include access/htup.h

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-23, Tom Lane wrote:

> Anyway, if you're feeling motivated to explore a more wide-ranging
> refactoring, by all means have a go at it.

I was contemplating commands/trigger.c this morning (after Heikki split
copy.c) thinking about the three pieces embedded in there -- one
catalog/pg_trigger.c, one in executor (execTrigger.c?) and what seems a
very small piece to remain where it is.

Just thinking out loud ...




Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Nov-23, Tom Lane wrote:
>> Here's a draft patch.

> Here's another of my own.  Outside of elog.c it seems identical.

Ah, I see I didn't cover the case in ProcSleep that you were originally on
about ... I'd just looked for existing references to log_min_messages
and client_min_messages.

I think it's important to have the explicit check for elevel >= ERROR.
I'm not too fussed about whether we invent is_log_level_output_client,
although that name doesn't seem well-chosen compared to
is_log_level_output.

Shall I press forward with this, or do you want to?

regards, tom lane




Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Tom Lane
Alvaro Herrera  writes:
> Your version has the advantage that errstart() doesn't get a new
> function call.  I'm +1 for going with that ... we could avoid the
> duplicate code with some additional contortions but this changes so
> rarely that it's probably not worth the trouble.

I was considering adding that factorization, but marking the function
inline to avoid adding overhead.  Most of elog.c predates our use of
inline, so it wasn't considered when this code was written.

regards, tom lane




Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2020-11-23 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Justin Pryzby  writes:
> >> This patch has been waiting for input from a committer on the approach I've
> >> taken with the patches since March 10, so I'm planning to set to "Ready" - 
> >> at
> >> least ready for senior review.
> 
> I took a quick look through this.  This is just MHO, of course:
> 
> * I don't think it's okay to change the existing signatures of
> pg_ls_logdir() et al.  Even if you can make an argument that it's
> not too harmful to add more output columns, replacing pg_stat_file's
> isdir output with something of a different name and datatype is most
> surely not OK --- there is no possible way that doesn't break existing
> user queries.

I disagree that we need to stress over this- we pretty routinely change
the signature of various catalogs and functions and anyone using these
is already of the understanding that we are free to make such changes
between major versions.  If anything, we should be strongly discouraging
the notion of "don't break user queries" when it comes to administrative
and monitoring functions like these because, otherwise, we end up with
things like the mess that is pg_start/stop_backup() (and just contrast
that to what we did to recovery.conf when thinking about "well, do we
need to 'deprecate' or keep around the old stuff so we don't break
things for users who use these functions?" or the changes made in v10,
neither of which have produced much in the way of complaints).

Let's focus on working towards cleaner APIs and functions, accepting a
break when it makes sense to, which seems to be the case with this patch
(though I agree about using a TAP test suite and about performing the
directory recursion in C instead), and not pull forward cruft that we
then are telling ourselves we have to maintain compatibility of
indefinitely and at the expense of sensible APIs.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-23, Tom Lane wrote:

> Ah, I see I didn't cover the case in ProcSleep that you were originally on
> about ... I'd just looked for existing references to log_min_messages
> and client_min_messages.

Yeah, it seemed bad form to add that when you had just argued against it
:-)

> I think it's important to have the explicit check for elevel >= ERROR.
> I'm not too fussed about whether we invent is_log_level_output_client,
> although that name doesn't seem well-chosen compared to
> is_log_level_output.

Just replacing "log" for "client" in that seemed strictly worse, and I
didn't (don't) have any other ideas.

> Shall I press forward with this, or do you want to?

Please feel free to go ahead, including the change to ProcSleep.




Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-11-23 Thread David Rowley
On Tue, 24 Nov 2020 at 09:36, David Rowley  wrote:
> Well, that makes it look pretty good.  If we can get 10-15% on some
> machines without making things slower on any other machines, then that
> seems like a good win to me.

Pushed.

Thank you both for reviewing this.

David




Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2020-11-23 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> I took a quick look through this.  This is just MHO, of course:
>> 
>> * I don't think it's okay to change the existing signatures of
>> pg_ls_logdir() et al.

> I disagree that we need to stress over this- we pretty routinely change
> the signature of various catalogs and functions and anyone using these
> is already of the understanding that we are free to make such changes
> between major versions.

Well, like I said, just MHO.  Anybody else want to weigh in?

I'm mostly concerned about removing the isdir output of pg_stat_file().
Maybe we could compromise to the extent of keeping that, allowing it
to be partially duplicative of a file-type-code output column.

regards, tom lane




Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Nov-23, Tom Lane wrote:
>> I'm not too fussed about whether we invent is_log_level_output_client,
>> although that name doesn't seem well-chosen compared to
>> is_log_level_output.

> Just replacing "log" for "client" in that seemed strictly worse, and I
> didn't (don't) have any other ideas.

I never cared that much for "is_log_level_output" either.  Thinking
about renaming it to "should_output_to_log()", and then the new function
would be "should_output_to_client()".

>> Shall I press forward with this, or do you want to?

> Please feel free to go ahead, including the change to ProcSleep.

Will do.

regards, tom lane




Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-11-23 Thread Greg Nancarrow
On Tue, Nov 24, 2020 at 10:06 AM David Rowley  wrote:
>
> On Tue, 24 Nov 2020 at 09:36, David Rowley  wrote:
> > Well, that makes it look pretty good.  If we can get 10-15% on some
> > machines without making things slower on any other machines, then that
> > seems like a good win to me.
>
> Pushed.
>
> Thank you both for reviewing this.
>
> David
>
>

Hmmm, unfortunately this seems to break my build ...

make[1]: Entering directory `/space2/pg13/postgres/src'
make -C common all
make[2]: Entering directory `/space2/pg13/postgres/src/common'
gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -g -O0 -DFRONTEND -I.
-I../../src/common -I../../src/include  -D_GNU_SOURCE  -DVAL_CC="\"gcc
-std=gnu99\"" -DVAL_CPPFLAGS="\"-D_GNU_SOURCE\"" -DVAL_CFLAGS="\"-Wall
-Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement
-Werror=vla -Wendif-labels -Wmissing-format-attribute
-Wformat-security -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -g -O0\"" -DVAL_CFLAGS_SL="\"-fPIC\""
-DVAL_LDFLAGS="\"-Wl,--as-needed
-Wl,-rpath,'/usr/local/pg14/lib',--enable-new-dtags\""
-DVAL_LDFLAGS_EX="\"\"" -DVAL_LDFLAGS_SL="\"\""
-DVAL_LIBS="\"-lpgcommon -lpgport -lpthread -lz -lreadline -lrt -ldl
-lm \""  -c -o archive.o archive.c
In file included from ../../src/include/postgres_fe.h:25:0,
 from archive.c:19:
../../src/include/c.h:198:49: error: missing binary operator before token "("
 #if defined(__has_attribute) && __has_attribute (cold)
 ^
../../src/include/c.h:204:49: error: missing binary operator before token "("
 #if defined(__has_attribute) && __has_attribute (hot)
 ^
make[2]: *** [archive.o] Error 1
make[2]: Leaving directory `/space2/pg13/postgres/src/common'
make[1]: *** [all-common-recurse] Error 2
make[1]: Leaving directory `/space2/pg13/postgres/src'
make: *** [world-src-recurse] Error 2

[gregn@localhost postgres]$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


I think your commit needs to be fixed based on the following documentation:

https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fattribute.html#g_t_005f_005fhas_005fattribute

"The first ‘#if’ test succeeds only when the operator is supported by
the version of GCC (or another compiler) being used. Only when that
test succeeds is it valid to use __has_attribute as a preprocessor
operator. As a result, combining the two tests into a single
expression as shown below would only be valid with a compiler that
supports the operator but not with others that don’t. "

(Thanks to my colleague Peter Smith for finding the doc explanation)

Regards,
Greg Nancarrow
Fujitsu Australia




RE: Terminate the idle sessions

2020-11-23 Thread kuroda.hay...@fujitsu.com
No one have any comments, patch tester says OK, and I think this works well.
I changed status to "Ready for Committer."

Hayato Kuroda
FUJITSU LIMITED

-Original Message-
From: kuroda.hay...@fujitsu.com  
Sent: Friday, November 20, 2020 11:05 AM
To: 'japin' 
Cc: David G. Johnston ; Kyotaro Horiguchi 
; Thomas Munro ; 
bharath.rupireddyforpostg...@gmail.com; pgsql-hackers@lists.postgresql.org
Subject: RE: Terminate the idle sessions

Dear Li,

> Thanks! Add the comment for idle-session timeout.

I confirmed it. OK.


I don't have any comments anymore. If no one has,
I will change the status few days later.
Other comments or suggestions to him?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-11-23 Thread David Rowley
On Tue, 24 Nov 2020 at 12:50, Greg Nancarrow  wrote:
> Hmmm, unfortunately this seems to break my build ...

> I think your commit needs to be fixed based on the following documentation:
>
> https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fattribute.html#g_t_005f_005fhas_005fattribute

Agreed. I tested on https://godbolt.org/ with a GCC version < 5 and
updating to what's mentioned in the GCC manual works fine.  What I had
did not.

Thanks for the report.

I pushed a fix.

David




Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-23 Thread Andres Freund
Hi,

On 2020-11-23 12:30:05 -0300, Alvaro Herrera wrote:
> ProcSleep:  no bug here.
> The flags are checked to see if we should kill() the process (used when
> autovac blocks some other process).  The lock here appears to be
> inconsequential, since we release it before we do kill(); so strictly
> speaking, there is still a race condition where the autovac process
> could reset the flag after we read it and before we get a chance to
> signal it.  The lock level autovac uses to set the flag is not relevant
> either.

Yea. Even before the recent changes building the lock message under the
lock didn't make sense. Now that the flags are mirrored in pgproc, we
probably should just make this use READ_ONCE(autovac->statusFlags),
without any other use of ProcArrayLock.  As you say the race condition
is between the flag test, the kill, and the signal being processed,
which wasn't previously protected either.


> PROC_IN_LOGICAL_DECODING:
> Oddly enough, I think the reset of PROC_IN_LOGICAL_DECODING in
> ReplicationSlotRelease might be the most problematic one of the lot.
> That's because a proc's xmin that had been ignored all along by
> ComputeXidHorizons, will now be included in the computation.  Adding
> asserts that proc->xmin and proc->xid are InvalidXid by the time we
> reset the flag, I got hits in pg_basebackup, test_decoding and
> subscription tests.  I think it's OK for ComputeXidHorizons (since it
> just means that a vacuum that reads a later will remove less rows.)  But
> in GetSnapshotData it is just not correct to have the Xmin go backwards.

I don't think there's a problem. PROC_IN_LOGICAL_DECODING can only be
set when outside a transaction block, i.e. walsender. In which case
there shouldn't be an xid/xmin, I think? Or did you gate your assert on
PROC_IN_LOGICAL_DECODING being set?


> GetCurrentVirtualXIDs, ComputeXidHorizons, GetSnapshotData:
> 
> In these cases, what we need is that the code computes some xmin (or
> equivalent computation) based on a set of transactions that exclude
> those marked with the flags.  The behavior we want is that if some
> transaction is marked as vacuum, we ignore the Xid/Xmin *if there is
> one*. In other words, if there's no Xid/Xmin, then the flag is not
> important.  So if we can ensure that the flag is set first, and the
> Xid/xmin is installed later, that's sufficient correctness and we don't
> need to hold exclusive lock.

That'd require at least memory barriers in GetSnapshotData()'s loop,
which I'd really like to avoid. Otherwise the order in which memory gets
written in one process doesn't guarantee the order of visibility in
another process...



> In other words, my conclusion is that there definitely is a bug here and
> I am going to restore the use of exclusive lock for setting the
> statusFlags.

Cool.


> GetSnapshotData has an additional difficulty -- we do the
> UINT32_ACCESS_ONCE(ProcGlobal->xid[i]) read *before* testing the bit. 
> So it's not valid to set the bit without locking out GSD, regardless of
> any barriers we had; if we want this to be safe, we'd have to change
> this so that the flag is read first, and we read the xid only
> afterwards, with a read barrier.

Which we don't want, because that'd mean slowing down the *extremely*
common case of the majority of backends neither having an xid assigned
nor doing logical decoding / vacuum. We'd be reading two cachelines
instead of one.

Greetings,

Andres Freund




Re: Parallel plans and "union all" subquery

2020-11-23 Thread Greg Nancarrow
On Tue, Nov 24, 2020 at 2:34 AM Luc Vlaming  wrote:
>
> Hi,
>
> For this problem there is a patch I created, which is registered under
> https://commitfest.postgresql.org/30/2787/ that should fix this without
> any workarounds. Maybe someone can take a look at it?
>

I tried your patch with the latest PG source code (24/11), but
unfortunately a non-parallel plan was still produced in this case.

test=# explain
select count(*)
from (select
n1
from drop_me
union all
values(1)) ua;
   QUERY PLAN

 Aggregate  (cost=1889383.54..1889383.55 rows=1 width=8)
   ->  Append  (cost=0.00..1362834.03 rows=42123961 width=32)
 ->  Seq Scan on drop_me  (cost=0.00..730974.60 rows=42123960 width=32)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..0.02 rows=1 width=32)
   ->  Result  (cost=0.00..0.01 rows=1 width=4)
(5 rows)


That's not to say your patch doesn't have merit - but maybe just not a
fix for this particular case.

As before, if the SQL is tweaked to align the types for the UNION, you
get a parallel plan:

test=# explain
select count(*)
from (select
n1
from drop_me
union all
values(1::numeric)) ua;
 QUERY PLAN

 Finalize Aggregate  (cost=821152.71..821152.72 rows=1 width=8)
   ->  Gather  (cost=821152.50..821152.71 rows=2 width=8)
 Workers Planned: 2
 ->  Partial Aggregate  (cost=820152.50..820152.51 rows=1 width=8)
   ->  Parallel Append  (cost=0.00..747235.71 rows=29166714 width=0)
 ->  Result  (cost=0.00..0.01 rows=1 width=0)
 ->  Parallel Seq Scan on drop_me
(cost=0.00..601402.13 rows=29166713 width=0)
(7 rows)


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-11-23 Thread Dagfinn Ilmari Mannsåker
David Rowley  writes:

> On Tue, 24 Nov 2020 at 12:50, Greg Nancarrow  wrote:
>> Hmmm, unfortunately this seems to break my build ...
>
>> I think your commit needs to be fixed based on the following documentation:
>>
>> https://gcc.gnu.org/onlinedocs/cpp/_005f_005fhas_005fattribute.html#g_t_005f_005fhas_005fattribute
>
> Agreed. I tested on https://godbolt.org/ with a GCC version < 5 and
> updating to what's mentioned in the GCC manual works fine.  What I had
> did not.
>
> Thanks for the report.
>
> I pushed a fix.

The Clang documentation¹ suggest an even neater solution, which would
eliminate the repetitive empty pg_attribute_foo #defines in the trailing
#else/#endif block in commit 1fa22a43a56e1fe44c7bb3a3d5ef31be5bcac41d:

#ifndef __has_attribute
#define __has_attribute(x) 0
#endif

[1] http://clang.llvm.org/docs/LanguageExtensions.html#has-attribute

- ilmari
-- 
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live with
 the consequences of."  -- Skud's Meta-Law




Re: Online verification of checksums

2020-11-23 Thread Michael Paquier
On Mon, Nov 23, 2020 at 10:35:54AM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
>> It seems reasonable to me to rely on checksums only.
>> 
>> As for retry, I think that API for concurrent I/O will be complicated.
>> Instead, we can introduce a function to read the page directly from shared
>> buffers after PAGE_RETRY_THRESHOLD attempts. It looks like a bullet-proof
>> solution to me. Do you see any possible problems with it?

It seems to me that you are missing the point here.  It is not
necessary to read a page from shared buffers.  What is necessary is to
make sure that there is zero concurrent I/O activity in shared buffers
while a page is getting checked on disk, giving the insurance that
there is zero risk of having a torn page for a check for anything
working with shared buffers.  You could do that only on a retry if we
found a page where there was a checksum mismatch, meaning that the
page we either torn or currupted, but need an extra verification
anyway.

> We might end up reading pages back in that have been evicted, for one
> thing, which doesn't seem great, and this also seems likely to be
> awkward for cases which aren't using the replication protocol, unless
> every process maintains a connection to PG the entire time, which also
> doesn't seem great.

I don't quite see a problem in checking pages that have been just
evicted if we are able to detect faster that a page is corrupted,
because the initial check may fail because a page was torn, meaning
that it was in the middle of an eviction, but the page could also be
corrupted, meaning also that it was *not* torn, and would fail a retry
where we should make sure that there is no s_b concurrent activity.
So in the worst case of seeing you make the detection of a corrupted
page faster.

Please note that Andres also mentioned about the potential need to
worry about table AMs that call directly smgrwrite(), bypassing shared
buffers.  The only cases in-core where it is used are related to init
forks when an unlogged relation gets created, where it would not
matter if you are doing a page check while holding a database
transaction as the newly-created relation would not be visible yet,
but it would matter in the case of base backups doing direct page
lookups.  Fun.

> Also- what is the point of reading the page from shared buffers
> anyway..?  All we need to do is prove that the page will be rewritten
> during WAL replay.  If we can prove that, we don't actually care what
> the contents of the page are.  We certainly can't calculate the
> checksum on a page we plucked out of shared buffers since we only
> calculate the checksum when we go to write the page out.

A LSN-based check makes the thing tricky.  How do you make sure that
pd_lsn is not itself broken?  It could be perfectly possible that a
random on-disk corruption makes pd_lsn seen as having a correct value,
still the rest of the page is borked.
--
Michael


signature.asc
Description: PGP signature


Re: Online verification of checksums

2020-11-23 Thread Michael Paquier
On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> * Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
>> Yes and this is a tricky part. Until you have explained it in your latest
>> message, I wasn't sure how we can distinct concurrent update from a page
>> header corruption. Now I agree that if page LSN updated and increased
>> between rereads, it is safe enough to conclude that we have some concurrent
>> load.
> 
> Even in this case, it's almost free to compare the LSN to the starting
> backup LSN, and to the current LSN position, and make sure it's
> somewhere between the two.  While that doesn't entirely eliminite the
> possibility that the page happened to get corrupted *and* return a
> different result on subsequent reads *and* that it was corrupted in such
> a way that the LSN ended up falling between the starting backup LSN and
> the current LSN, it's certainly reducing the chances of a false negative
> a fair bit.

FWIW, I am not much a fan of designs that are not bullet-proof by
design.  This reduces the odds of problems, sure, still it does not
discard the possibility of incorrect results, confusing users as well
as people looking at such reports.

>> To sum up, I agree with your proposal to reread the page and rely on
>> ascending LSNs. Can you submit a patch?
> 
> Probably would make sense to give Michael an opportunity to comment and
> get his thoughts on this, and for him to update the patch if he agrees.

I think that a LSN check would be a safe thing to do iff pd_checksum
is already checked first to make sure that the page contents are fine
to use.   Still, what's the point in doing a LSN check anyway if we
know that the checksum is valid?  Then on a retry if the first attempt
failed you also need the guarantee that there is zero concurrent I/O
activity while a page is rechecked (no need to do that unless the
initial page check doing a checksum match failed).  So the retry needs
to do some s_b interactions, but then comes the much trickier point of
concurrent smgrwrite() calls bypassing the shared buffers.

> As it relates to pgbackrest, we're currently contemplating having a
> higher level loop which, upon detecting any page with an invalid
> checksum, continues to scan to the end of that file and perform the
> compression, encryption, et al, but then loops back after we've
> completed that file and skips through the file again, re-reading those
> pages which didn't have a valid checksum the first time to see if their
> LSN has changed and is within the range of the backup.  This will
> certainly give more opportunity for the kernel to 'catch up', if needed,
> and give us an updated page without a random 100ms delay, and will also
> make it easier for us to, eventually, check and make sure the page was
> in the WAL that was been produced as part of the backup, to give us a
> complete guarantee that the contents of this page don't matter and that
> the failed checksum isn't a sign of latent storage corruption.

That would reduce the likelyhood of facing torn pages, still you
cannot fully discard the problem either as a same page may get changed
again once you loop over, no?  And what if a corruption has updated
pd_lsn on-disk?  Unlikely so, still possible.
--
Michael


signature.asc
Description: PGP signature


Re: enable_incremental_sort changes query behavior

2020-11-23 Thread James Coleman
On Mon, Nov 23, 2020 at 2:24 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > But I think that still leaves something missing: after all,
> > prepare_sort_from_pathkeys does know how to insert new target list
> > entries, so something either there (or in the caller/how its called)
> > has to be enforcing an apparently implicit rule about what point in
> > the tree it's safe to do such. Or even just no path generation had
> > ever considered it before (that feels unsatisfactory as an explanation
> > to me, because it feels more implicit than I'd like, but...)
>
> Hi guys,
>
> I hadn't been paying any attention to this thread, but Robert asked
> me to take a look.  A few comments:

Thanks for jumping in (and thanks Robert for asking for Tom to take a
look); I appreciate the input.

> 1. James was wondering, far upthread, why we would do projections
> pre-sort or post-sort.  I think the answers can be found by studying
> planner.c's make_sort_input_target(), which separates out what we want
> to do pre-sort and post-sort.  (There, the "sort" is envisioned as a
> standalone Sort node atop all the scan/join steps; but its decisions
> nonetheless constrain what will be considered for sorting that's
> implemented further down.)  It has a *very* long introductory comment
> laying out all the considerations.

That comment is helpful; I wish I'd discovered it sooner.

Does it imply we need to intentionally avoid SRFs also?

> 2. Robert is correct that under normal circumstances, the targetlists of
> both baserels and join rels contain only Vars.  Any computations we have
> to do are postponed to the final top-level targetlist, whether they are
> volatile or not.  The fact that this policy is applied regardless of
> volatility may explain why James isn't seeing volatility checks where he
> expected them.  The core (and, I think, only) exception to this is that
> an expression that is a sort or group key has to be evaluated earlier.
> But even then, it won't be pushed down as far as the reltargets of any
> scan or join relations; it'll be computed after the final join.

Is that primarily a project policy? Or a limitation of our current
planner (just can't push things down/carry the results back up as much
as we'd like)? Or something else? In particular, it seems plausible
there are cases where pushing down the sort on a non-Var expression to
the base rel could improve plans, so I'm wondering if there's a reason
to intentionally avoid that in the long or short run (or both).

> 3. If we have a volatile sort/group key, that constrains the set of plans
> we can validly generate, because the key expression mustn't be evaluated
> redundantly.  With the addition of parallelism, a non-parallel-safe
> sort/group key expression likewise constrains the set of valid plans,
> even if it's not considered volatile.

This makes a lot of sense (just unfortunate we had to re-discover it).

> 4. In the past, the only way we'd consider non-Var sort keys in any way
> during scan/join planning was if (a) a relation had an expression index
> matching a requested sort key; of course, that's a-fortiori going to be
> a non-volatile expression.  Or (b) we needed to sort in order to provide
> suitable input for a merge join ... but again, volatile expressions
> aren't considered candidates for mergejoin.  So there basically wasn't
> any need to consider sorting by volatiles below the top level sort/group
> processing, and that again contributes to why you don't see explicit
> tests in that area.  I'm not sure offhand whether the parallel-query
> patches added anything to guard against nonvolatile-but-parallel-unsafe
> sort expressions.  If not, there might be live bugs there independently
> of incremental sort; or that might be unreachable because of overall
> limitations on parallelism.

Interesting: so merge joins are an example of us pushing down sorts,
which I assume means (part of) the answer to my question on (2) is
that there's nothing inherently wrong/broken with evaluating
expressions lower down the tree as long as we're careful about what is
safe/unsafe with respect to volatility and parallelism?

> 5. While I've not dug far enough into the code to identify the exact
> issue, the impression I have about this bug and the parallel-unsafe
> subplan bug is that the incremental sort code is generating Paths
> without due regard to point 3.  If it is making paths based on explicit
> sorts for final-stage pathkeys, independently of either expression
> indexes or mergejoin requirements, that could well explain why it's
> got problems that weren't seen before.

Yes, that's correct. Tomas already pushed the fix for the volatility
case, and there's a new thread [1] for the parallel safety concerns.

> 6. I did look at ebb7ae839, and I suspect that the last part of
> find_em_expr_usable_for_sorting_rel(), ie the search of the reltarget
> list, is dead code.  There is no case where a volatile expression would
> appear there, per point 2.  The co

Re: deferred primary key and logical replication

2020-11-23 Thread Amit Kapila
On Tue, Nov 24, 2020 at 3:04 AM Anastasia Lubennikova
 wrote:
>
> On 27.10.2020 13:46, Amit Kapila wrote:
> > On Sun, Oct 25, 2020 at 9:39 PM Euler Taveira
> >  wrote:
> >> On Mon, 5 Oct 2020 at 08:34, Amit Kapila  wrote:
> >>> On Mon, May 11, 2020 at 2:41 AM Euler Taveira
> >>>  wrote:
>  Hi,
> 
>  While looking at an old wal2json issue, I stumbled on a scenario that a 
>  table
>  with a deferred primary key is not updatable in logical replication. 
>  AFAICS it
>  has been like that since the beginning of logical decoding and seems to 
>  be an
>  oversight while designing logical decoding.
> 
> >>> I am not sure if it is an oversight because we document that the index
> >>> must be non-deferrable, see "USING INDEX records the old values of the
> >>> columns covered by the named index, which must be unique, not partial,
> >>> not deferrable, and include only columns marked NOT NULL." in docs
> >>> [1].
> >>>
> >> Inspecting this patch again, I forgot to consider that 
> >> RelationGetIndexList()
> >> is called by other backend modules. Since logical decoding deals with 
> >> finished
> >> transactions, it is ok to use a deferrable primary key.
> >>
> > But starting PG-14, we do support logical decoding of in-progress
> > transactions as well.
> >
> >
> Commitfest entry status update.
> As far as I see, this patch needs some further work, so I move it to
> "Waiting on author".
>

I think this should be marked as "Returned with Feedback" as there is
no response to the feedback for a long time and also it is not very
clear if this possible.

-- 
With Regards,
Amit Kapila.




Re: POC: postgres_fdw insert batching

2020-11-23 Thread Tomas Vondra


On 11/23/20 3:17 AM, tsunakawa.ta...@fujitsu.com wrote:
> From: Tomas Vondra 
>> I don't think this is usable in practice, because a single session
>> may be using multiple FDW servers, with different implementations,
>> latency to the data nodes, etc. It's unlikely a single GUC value
>> will be suitable for all of them.
> 
> That makes sense.  The row size varies from table to table, so the
> user may want to tune this option to reduce memory consumption.
> 
> I think the attached patch has reflected all your comments.  I hope
> this will pass..
> 

Thanks - I didn't have time for a thorough review at the moment, so I
only skimmed through the diff and did a couple  very simple tests. And I
think overall it looks quite nice.

A couple minor comments/questions:

1) We're calling it "batch_size" but the API function is named
postgresGetMaxBulkInsertTuples(). Perhaps we should rename the function
to postgresGetModifyBatchSize()? That has the advantage it'd work if we
ever add support for batching to UPDATE/DELETE.

2) Do we have to lookup the batch_size in create_foreign_modify (in
server/table options)? I'd have expected to look it up while planning
the modify and then pass it through the list, just like the other
FdwModifyPrivateIndex stuff. But maybe that's not possible.

3) That reminds me - should we show the batching info on EXPLAIN? That
seems like a fairly interesting thing to show to the user. Perhaps
showing the average batch size would also be useful? Or maybe not, we
create the batches as large as possible, with the last one smaller.

4) It seems that ExecInsert executes GetMaxBulkInsertTuples() over and
over for every tuple. I don't know it that has measurable impact, but it
seems a bit excessive IMO. I don't think we should support the batch
size changing during execution (seems tricky).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Online verification of checksums

2020-11-23 Thread Stephen Frost
Greetings,

On Mon, Nov 23, 2020 at 20:28 Michael Paquier  wrote:

> On Mon, Nov 23, 2020 at 05:28:52PM -0500, Stephen Frost wrote:
> > * Anastasia Lubennikova (a.lubennik...@postgrespro.ru) wrote:
> >> Yes and this is a tricky part. Until you have explained it in your
> latest
> >> message, I wasn't sure how we can distinct concurrent update from a page
> >> header corruption. Now I agree that if page LSN updated and increased
> >> between rereads, it is safe enough to conclude that we have some
> concurrent
> >> load.
> >
> > Even in this case, it's almost free to compare the LSN to the starting
> > backup LSN, and to the current LSN position, and make sure it's
> > somewhere between the two.  While that doesn't entirely eliminite the
> > possibility that the page happened to get corrupted *and* return a
> > different result on subsequent reads *and* that it was corrupted in such
> > a way that the LSN ended up falling between the starting backup LSN and
> > the current LSN, it's certainly reducing the chances of a false negative
> > a fair bit.
>
> FWIW, I am not much a fan of designs that are not bullet-proof by
> design.  This reduces the odds of problems, sure, still it does not
> discard the possibility of incorrect results, confusing users as well
> as people looking at such reports.


Let’s be clear about this- our checksums are, themselves, far from
bulletproof, regardless of all of our other efforts.  They are not
foolproof against any corruption, and certainly not even close to being
sufficient for guarantees you’d expect in, say, encryption integrity.  We
cannot say with certainty that a page which passes checksum validation
isn’t corrupted in some way.  A page which doesn’t pass checksum validation
may be corrupted or may be torn and we aren’t 100% of that either, but we
can work to try and make a sensible call about which it is.

>> To sum up, I agree with your proposal to reread the page and rely on
> >> ascending LSNs. Can you submit a patch?
> >
> > Probably would make sense to give Michael an opportunity to comment and
> > get his thoughts on this, and for him to update the patch if he agrees.
>
> I think that a LSN check would be a safe thing to do iff pd_checksum
> is already checked first to make sure that the page contents are fine
> to use.   Still, what's the point in doing a LSN check anyway if we
> know that the checksum is valid?  Then on a retry if the first attempt
> failed you also need the guarantee that there is zero concurrent I/O
> activity while a page is rechecked (no need to do that unless the
> initial page check doing a checksum match failed).  So the retry needs
> to do some s_b interactions, but then comes the much trickier point of
> concurrent smgrwrite() calls bypassing the shared buffers.


I agree that the LSN check isn’t interesting if the page passes the
checksum validation.  I do think we can look at the LSN and make reasonable
inferences based off of it even if the checksum doesn’t validate- in
particular, in my experience at least, the result of a read, without any
intervening write, is very likely to be the same if performed multiple
times quickly even if there is latent storage corruption- due to cache’ing,
if nothing else.  What’s interesting about the LSN check is that we are
specifically looking to see if it *changed* in a reasonable and predictable
manner, and that it was replaced with a new yet reasonable value. The
chances of that happening due to latent storage corruption is vanishingly
small.

> As it relates to pgbackrest, we're currently contemplating having a
> > higher level loop which, upon detecting any page with an invalid
> > checksum, continues to scan to the end of that file and perform the
> > compression, encryption, et al, but then loops back after we've
> > completed that file and skips through the file again, re-reading those
> > pages which didn't have a valid checksum the first time to see if their
> > LSN has changed and is within the range of the backup.  This will
> > certainly give more opportunity for the kernel to 'catch up', if needed,
> > and give us an updated page without a random 100ms delay, and will also
> > make it easier for us to, eventually, check and make sure the page was
> > in the WAL that was been produced as part of the backup, to give us a
> > complete guarantee that the contents of this page don't matter and that
> > the failed checksum isn't a sign of latent storage corruption.
>
> That would reduce the likelyhood of facing torn pages, still you
> cannot fully discard the problem either as a same page may get changed
> again once you loop over, no?  And what if a corruption has updated
> pd_lsn on-disk?  Unlikely so, still possible.


We surely don’t care about a page which has been changed multiple times by
PG during the backup, since all those changes will be, by definition, in
the WAL, no?  Therefore, one loop to see that the value of the LSN
*changed*, meaning something wrote something new there, wi

Re: abstract Unix-domain sockets

2020-11-23 Thread Michael Paquier
On Fri, Nov 20, 2020 at 04:06:43PM +0100, Peter Eisentraut wrote:
> I think we are getting a bit sidetracked here with the message wording. The
> reason I looked at this was that "remove socket file and retry" is never an
> appropriate action with abstract sockets.  And on further analysis, it is
> never an appropriate action with any Unix-domain socket (because with file
> system namespace sockets, you never get an EADDRINUSE, so it's dead code).
> So my proposal here is to just delete that line from the hint and leave the
> rest the same.

Reading again this thread, +1 on that.
--
Michael


signature.asc
Description: PGP signature


Re: should INSERT SELECT use a BulkInsertState?

2020-11-23 Thread Justin Pryzby
On Mon, Nov 02, 2020 at 12:45:51PM -0600, Justin Pryzby wrote:
> On Mon, Nov 02, 2020 at 07:53:45AM +0100, Luc Vlaming wrote:
> > On 30.10.20 05:51, Justin Pryzby wrote:
> > > On Thu, Oct 22, 2020 at 01:29:53PM +0100, Simon Riggs wrote:
> > > > On Fri, 16 Oct 2020 at 22:05, Justin Pryzby  
> > > > wrote:
> > > > 
> > > > > > > I made this conditional on BEGIN BULK/SET bulk, so I'll solicit 
> > > > > > > comments on that.
> > > > 
> > > > I think it would be better if this was self-tuning. So that we don't
> > > > allocate a bulkinsert state until we've done say 100 (?) rows
> > > > inserted.
> > > 
> > > I made it an optional, non-default behavior in response to the legitimate
> > > concern for performance regression for the cases where a loader needs to 
> > > be as
> > > fast as possible - as compared with our case, where we want instead to 
> > > optimize
> > > for our reports by making the loaders responsible for their own writes, 
> > > rather
> > > than leaving behind many dirty pages, and clobbering the cache, too.
> > > 
> > > Also, INSERT SELECT doesn't immediately help us (telsasoft), since we use
> > > INSERT .. VALUES () .. ON CONFLICT.  This would handle that case, which is
> > > great, even though that wasn't a design goal.  It could also be an 
> > > integer GUC
> > > to allow configuring the size of the ring buffer.
> > > 
> > > > You should also use table_multi_insert() since that will give further
> > > > performance gains by reducing block access overheads. Switching from
> > > > single row to multi-row should also only happen once we've loaded a
> > > > few rows, so we don't introduce overahads for smaller SQL statements.
> > > 
> > > Good idea...multi_insert (which reduces the overhead of individual 
> > > inserts) is
> > > mostly independent from BulkInsert state (which uses a ring-buffer to 
> > > avoid
> > > dirtying the cache).  I made this 0002.
> > > 
> > > This makes INSERT SELECT several times faster, and not clobber the cache 
> > > too.

 - Rebased on Heikki's copy.c split;
 - Rename structures without "Copy" prefix;
 - Move MultiInsert* from copyfrom.c to (tentatively) nodeModifyTable.h;
 - Move cur_lineno and transition_capture into MultiInsertInfo;

This switches to multi insert after a configurable number of tuples.
If set to -1, that provides the historic behavior that bulk inserts
can leave behind many dirty buffers.  Perhaps that should be the default.

I guess this shouldn't be in copy.h or in commands/* at all.
It'll be included by both: commands/copyfrom_internal.h and
executor/nodeModifyTable.h.  Maybe it should go in util or lib...
I don't know how to do it without including executor.h, which seems
to be undesirable.

-- 
Justin
>From d67c82e870c640e6f4ba25b3da5acf54df7165d2 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 8 May 2020 02:17:32 -0500
Subject: [PATCH v6 1/3] Allow INSERT SELECT to use a BulkInsertState

---
 src/backend/executor/nodeModifyTable.c | 22 --
 src/backend/parser/gram.y  |  7 ++-
 src/backend/tcop/utility.c |  4 
 src/backend/utils/misc/guc.c   | 11 +++
 src/include/executor/nodeModifyTable.h |  2 ++
 src/include/nodes/execnodes.h  |  3 +++
 src/include/parser/kwlist.h|  1 +
 src/test/regress/expected/insert.out   | 23 +++
 src/test/regress/sql/insert.sql| 13 +
 9 files changed, 83 insertions(+), 3 deletions(-)

diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 29e07b7228..26ff964105 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -72,6 +72,8 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			   ResultRelInfo *targetRelInfo,
 			   TupleTableSlot *slot,
 			   ResultRelInfo **partRelInfo);
+/* guc */
+bool insert_in_bulk = false;
 
 /*
  * Verify that the tuples to be produced by INSERT or UPDATE match the
@@ -594,7 +596,7 @@ ExecInsert(ModifyTableState *mtstate,
 			table_tuple_insert_speculative(resultRelationDesc, slot,
 		   estate->es_output_cid,
 		   0,
-		   NULL,
+		   NULL, /* Bulk insert not supported */
 		   specToken);
 
 			/* insert index entries for tuple */
@@ -631,10 +633,17 @@ ExecInsert(ModifyTableState *mtstate,
 		}
 		else
 		{
+			if (proute && mtstate->prevResultRelInfo != resultRelInfo)
+			{
+if (mtstate->bistate)
+	ReleaseBulkInsertStatePin(mtstate->bistate);
+mtstate->prevResultRelInfo = resultRelInfo;
+			}
+
 			/* insert the tuple normally */
 			table_tuple_insert(resultRelationDesc, slot,
 			   estate->es_output_cid,
-			   0, NULL);
+			   0, mtstate->bistate);
 
 			/* insert index entries for tuple */
 			if (resultRelInfo->ri_NumIndices > 0)
@@ -2232,6 +2241,9 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 	mtstate->mt_arowma

Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Alvaro Herrera
On 2020-Nov-23, Tom Lane wrote:

> I never cared that much for "is_log_level_output" either.  Thinking
> about renaming it to "should_output_to_log()", and then the new function
> would be "should_output_to_client()".

Ah, that sounds nicely symmetric and grammatical.

Thanks!




Re: "as quickly as possible" (was: remove spurious CREATE INDEX CONCURRENTLY wait)

2020-11-23 Thread Michael Paquier
On Mon, Nov 23, 2020 at 06:13:17PM -0500, Tom Lane wrote:
> Alvaro Herrera  writes:
>> Please feel free to go ahead, including the change to ProcSleep.
> 
> Will do.

Thank you both for 450c823 and 789b938.
--
Michael


signature.asc
Description: PGP signature


RE: Disable WAL logging to speed up data loading

2020-11-23 Thread osumi.takami...@fujitsu.com
Hi

On Monday, Nov 23, 2020 12:17 PM Tsunakawa, Takayuki 
 wrote:
> PREPARE TRANSACTION is the same as COMMIT in that it persists
> transaction updates.  A crash during wal_level = none loses both of them.
> So, I don't think PREPARE TRANSACTION needs special treatment.
Yeah, I got it. That makes sense.
Attached is the one I removed the special treatment.


> Does PREPARE TRANSACTION complete successfully?  I remember
> Fujii-san said it PANICed if --enable-asserts is used in configure.
Yes. That assertion failure Fujii-San reported in the past
was solved by having rmid != RM_XACT_ID in XLogInsert()
to add WAL generation for transaction completes.
That I missed the condition was the cause but fine now.

Plus, PREPARE TRANSACTION and COMMIT PREPARED
works successfully at present when no happening occurs.

Best,
Takamichi Osumi



disable_WAL_logging_v04.patch
Description: disable_WAL_logging_v04.patch


Re: Strange behavior with polygon and NaN

2020-11-23 Thread Kyotaro Horiguchi
At Sat, 21 Nov 2020 17:33:53 -0500, Tom Lane  wrote in 
> I went ahead and pushed 0001 and 0003 (the latter in two parts), since
> they didn't seem particularly controversial to me.  Just to keep the
> cfbot from whining, here's a rebased version of 0002.

I didn't noticed that inf == inf sould be true (in IEEE754).

# (inf - inf == 0) => false but (inf == inf + 0) == false is somewhat
# uneasy but, yes, it's the standare we are basing on.

So, I agree that the changes of line_construct() and line_(inv)sl()
looks good to me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-11-23 Thread Etsuro Fujita
On Mon, Nov 23, 2020 at 5:39 PM Andrey Lepikhov
 wrote:
> On 11/23/20 7:49 AM, tsunakawa.ta...@fujitsu.com wrote:
> > Could I or my colleague continue this patch in a few days?  It looks it's 
> > stalled over one month.
>
> I don't found any problems with this patch that needed to be corrected.
> It is wait for actions from committers side, i think.

I'm planning to review this patch.  I think it would be better for
another pair of eyes to take a look at it, though.

Best regards,
Etsuro Fujita




Re: Terminate the idle sessions

2020-11-23 Thread David G. Johnston
On Mon, Nov 23, 2020 at 5:02 PM kuroda.hay...@fujitsu.com <
kuroda.hay...@fujitsu.com> wrote:

> No one have any comments, patch tester says OK, and I think this works
> well.
> I changed status to "Ready for Committer."
>
Some proof-reading:

v8-0001

Documentation:

My suggestion wasn't taken for the first note paragraph (review/author
disagreement) and the current has the following issues:

"if you use some connection-pooling" software doesn't need the word "some"
Don't substitute "pg" for the name of the product, PostgreSQL.
The word "used" is a more stylistic dislike, but "connected to using
postgres_fdw" would be a better choice IMO.

Code (minor, but if you are in there anyway):

(5) turn off ... timeout (there are now two, timeouts should be plural)

v8-0002 - No suggestions

David J.


Re: Use macros for calculating LWLock offset

2020-11-23 Thread Michael Paquier
On Fri, Nov 20, 2020 at 03:25:50PM +0900, Michael Paquier wrote:
> I agree that this makes this code a bit cleaner, so let's use those
> macros.  Others may have some comments here, so let's wait a bit
> first.

Got this one committed as of d03d754.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-11-23 Thread Bharath Rupireddy
Thanks for the review comments.

On Mon, Nov 23, 2020 at 9:57 PM Alexey Kondratov
 wrote:
>
> > v1-0001-postgres_fdw-function-to-discard-cached-connections.patch
>
> This patch looks pretty straightforward for me, but there are some
> things to be addressed IMO:
>
> +   server = GetForeignServerByName(servername, true);
> +
> +   if (server != NULL)
> +   {
>
> Yes, you return a false if no server was found, but for me it worth
> throwing an error in this case as, for example, dblink does in the
> dblink_disconnect().
>

dblink_disconnect() "Returns status, which is always OK (since any
error causes the function to throw an error instead of returning)."
This behaviour doesn't seem okay to me.

Since we throw true/false, I would prefer to throw a warning(with a
reason) while returning false over an error.

>
> + result = disconnect_cached_connections(FOREIGNSERVEROID,
> +hashvalue,
> +false);
>
> +   if (all || (!all && cacheid == FOREIGNSERVEROID &&
> +   entry->server_hashvalue == hashvalue))
> +   {
> +   if (entry->conn != NULL &&
> +   !all && cacheid == FOREIGNSERVEROID &&
> +   entry->server_hashvalue == hashvalue)
>
> These conditions look bulky for me. First, you pass FOREIGNSERVEROID to
> disconnect_cached_connections(), but actually it just duplicates 'all'
> flag, since when it is 'FOREIGNSERVEROID', then 'all == false'; when it
> is '-1', then 'all == true'. That is all, there are only two calls of
> disconnect_cached_connections(). That way, it seems that we should keep
> only 'all' flag at least for now, doesn't it?
>

I added cachid as an argument to disconnect_cached_connections() for
reusability. Say, someone wants to use it with a user mapping then
they can pass cacheid USERMAPPINGOID, hash value of user mapping. The
cacheid == USERMAPPINGOID && entry->mapping_hashvalue == hashvalue can
be added to disconnect_cached_connections().

>
> Second, I think that we should just rewrite this if statement in order
> to simplify it and make more readable, e.g.:
>
> if ((all || entry->server_hashvalue == hashvalue) &&
> entry->conn != NULL)
> {
> disconnect_pg_server(entry);
> result = true;
> }
>

Yeah. I will add a cacheid check and change it to below.

 if ((all || (cacheid == FOREIGNSERVEROID &&
entry->server_hashvalue == hashvalue)) &&
 entry->conn != NULL)
 {
 disconnect_pg_server(entry);
 result = true;
 }

>
> +   if (all)
> +   {
> +   hash_destroy(ConnectionHash);
> +   ConnectionHash = NULL;
> +   result = true;
> +   }
>
> Also, I am still not sure that it is a good idea to destroy the whole
> cache even in 'all' case, but maybe others will have a different
> opinion.
>

I think we should. When we disconnect all the connections, then no
point in keeping the connection cache hash data structure. If required
it gets created at the next first foreign server usage in the same
session. And also, hash_destroy() frees up memory context unlike
hash_search with HASH_REMOVE, so we can save a bit of memory.

> >
> > v1-0002-postgres_fdw-add-keep_connections-GUC-to-not-cache-connections.patch
> >
>
> +   entry->changing_xact_state) ||
> +   (entry->used_in_current_xact &&
> +   !keep_connections))
>
> I am not sure, but I think, that instead of adding this additional flag
> into ConnCacheEntry structure we can look on entry->xact_depth and use
> local:
>
> bool used_in_current_xact = entry->xact_depth > 0;
>
> for exactly the same purpose. Since we set entry->xact_depth to zero at
> the end of xact, then it was used if it is not zero. It is set to 1 by
> begin_remote_xact() called by GetConnection(), so everything seems to be
> fine.
>

I missed this. Thanks, we can use the local variable as you suggested.
I will change it.

>
> Otherwise, both patches seem to be working as expected. I am going to
> have a look on the last two patches a bit later.
>

Thanks. I will work on the comments so far and post updated patches soon.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: bug in pageinspect's "tuple data" feature

2020-11-23 Thread Michael Paquier
On Mon, Nov 23, 2020 at 09:11:26AM +0200, Heikki Linnakangas wrote:
> On 21/11/2020 21:32, Alvaro Herrera wrote:
>> This is pretty unhelpful; it would be better not to try to print the
>> data instead of dying.  With that, at least you can know where the
>> problem is.
>> 
>> This was introduced in d6061f83a166 (2015).  Proposed patch to fix it
>> (by having the code print a null "data" instead of dying) is attached.
> 
> Null seems misleading. Maybe something like "invalid", or print a warning?

How did you get into this state to begin with?  get_raw_page() uses
ReadBufferExtended() which gives some level of protection already, so
shouldn't it be better to return an ERROR with ERRCODE_DATA_CORRUPTED
and the block involved?
--
Michael


signature.asc
Description: PGP signature


RE: [POC] Fast COPY FROM command for the table with foreign partitions

2020-11-23 Thread tsunakawa.ta...@fujitsu.com
Andrey-san, Fujita-san,

From: Etsuro Fujita 
> On Mon, Nov 23, 2020 at 5:39 PM Andrey Lepikhov
>  wrote:
> > On 11/23/20 7:49 AM, tsunakawa.ta...@fujitsu.com wrote:
> > > Could I or my colleague continue this patch in a few days?  It looks it's
> stalled over one month.
> >
> > I don't found any problems with this patch that needed to be corrected.
> > It is wait for actions from committers side, i think.
> 
> I'm planning to review this patch.  I think it would be better for
> another pair of eyes to take a look at it, though.


There are the following two issues left untouched.

https://www.postgresql.org/message-id/TYAPR01MB2990DC396B338C98F27C8ED3FE1F0%40TYAPR01MB2990.jpnprd01.prod.outlook.com


Regards
Takayuki Tsunakawa



Re: Strange behavior with polygon and NaN

2020-11-23 Thread Kyotaro Horiguchi
At Fri, 20 Nov 2020 15:57:46 -0500, Tom Lane  wrote in 
> I spent some more time looking at this patch.
> 
> Some experimentation shows that the changes around bounding box
> calculation (ie float8_min_nan() and its call sites) seem to be
> completely pointless: removing them doesn't change any of the regression
> results.  Nor does using float8_min_nan() in the other two bounding-box
> calculations I'd asked about.  So I think we should just drop that set
> of changes and stick with the rule that bounding box upper and lower
> values are sorted as per float.h comparison rules.  This isn't that hard
> to work with: if you want to know whether any NaNs are in the box, test
> the upper limit (and *not* the lower limit) for isnan().  Moreover, even
> if we wanted a different coding rule, we really can't have it because we
> will still need to work with existing on-disk values that have bounding
> boxes computed the old way.

Actually that changes the result since that code gives a shortcut of
checking NaNs in the object coordinates. I don't think that the it is
pointless to avoid full calculations that are performed only to find
NaNs are involved, if bounding box check is meaningful.

> I don't much like anything about float8_coef_mul().  In the first place,
> FPzero() is an ugly, badly designed condition that we should be trying
> to get rid of not add more dependencies on.  In the second place, it's
> really unclear why allowing 0 times Inf to be something other than NaN
> is a good idea, and it's even less clear why allowing small-but-not-zero
> times Inf to be zero rather than Inf is a good idea.  In the third
> place, the asymmetry between the two inputs looks more like a bug than
> something we should actually want.

I have the same feeling on the function, but I concluded that
coefficients and coordinates should be regarded as different things in
the practical standpoint.

For example, consider Ax + By + C == 0, if B is 0.0, we can remove the
second term from the equation, regardless of the value of y, of course
even if it were inf. that is, The function imitates that kind of
removals.

> After some time spent staring at the specific case of line_closept_point
> and its callers, I decided that the real problems there are twofold.
> First, the API, or at least the callers' interpretation of this
> undocumented point, is that whether the distance is undefined (NaN) is
> equivalent to whether the closest point is undefined.  This is not so;
> in some cases we know that the distance is infinite even though we can't
> calculate a unique closest point.  Second, it's not making any attempt
> to eliminate undefined cases up front.  We can do that pretty easily
> by plugging the point's coordinates into the line's equation Ax+By+C
> and seeing whether we get a NaN.  The attached 0002 is a subset patch
> that just fixes these two issues, and I like the results it produces.

Actually the code reacts to some "problem" cases in a "wrong" way:

+* If it is unclear whether the point is on the line or not, then the
+* results are ill-defined.  This eliminates cases where any of the 
given
+* coordinates are NaN, as well as cases where infinite coordinates give
+* rise to Inf - Inf, 0 * Inf, etc.
+*/
+   if (unlikely(isnan(float8_pl(float8_pl(float8_mul(line->A, point->x),
+   
   float8_mul(line->B, point->y)),
+line->C

| postgres=# select point(1e+300, 'Infinity') <-> line('{1,0,5}');
|  ?column? 
| --
|   NaN

Aren't our guts telling that is 1e+300?  You might be thinking to put
some special case handling into that path (as mentioned below?), but
otherwise it yeildsa "wrong" result.  The reason for the expectation
is that we assume that "completely vertical" lines have a constant x
value regardless of the y coordinate.  That is the reason for the
float8_coef_mul() function.

> I wonder now whether the problems around line_interpt_line() and the
> other intersection-ish functions wouldn't be better handled in a similar
> way, by making changes to their API specs to be clearer about what
> happens with NaNs and trying to eliminate ill-defined cases explicitly.
> I've not tried to code that though.

One of the "ill-defined" cases is the zero-coefficient issue.  The
asymmetric multiply function "fixes" it, at least.  Of course it could
be open-coded instead of being as a function that looks as if having
some general interpretation.

> Changing pg_hypot() the way you've done here is right out.  See the
> comment for the function: what it is doing now is per all the relevant
> standards, and your patch breaks that.  It's extremely unlikely that
> doing it differently from IEEE and POSIX is a good idea.

Mmm. Ok, I agree to that.

> Attached are the same old 0001 (adding the extra point_tbl entry)
> and a small 0002 that fixe

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-11-23 Thread Andrey Lepikhov




On 11/24/20 9:27 AM, tsunakawa.ta...@fujitsu.com wrote:

Andrey-san, Fujita-san,

From: Etsuro Fujita 

On Mon, Nov 23, 2020 at 5:39 PM Andrey Lepikhov
 wrote:

On 11/23/20 7:49 AM, tsunakawa.ta...@fujitsu.com wrote:

Could I or my colleague continue this patch in a few days?  It looks it's

stalled over one month.


I don't found any problems with this patch that needed to be corrected.
It is wait for actions from committers side, i think.


I'm planning to review this patch.  I think it would be better for
another pair of eyes to take a look at it, though.



There are the following two issues left untouched.

https://www.postgresql.org/message-id/TYAPR01MB2990DC396B338C98F27C8ED3FE1F0%40TYAPR01MB2990.jpnprd01.prod.outlook.com


I disagree with your opinion about changing the interface of the 
ExecRelationAllowsMultiInsert routine. If you insist on the need for 
this change, we need another opinion.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Terminate the idle sessions

2020-11-23 Thread Li Japin
Hi, Kuroda

Thank for your review.

> On Nov 24, 2020, at 8:01 AM, kuroda.hay...@fujitsu.com wrote:
> 
> No one have any comments, patch tester says OK, and I think this works well.
> I changed status to "Ready for Committer."
> 
> Hayato Kuroda
> FUJITSU LIMITED
> 
> -Original Message-
> From: kuroda.hay...@fujitsu.com  
> Sent: Friday, November 20, 2020 11:05 AM
> To: 'japin' 
> Cc: David G. Johnston ; Kyotaro Horiguchi 
> ; Thomas Munro ; 
> bharath.rupireddyforpostg...@gmail.com; pgsql-hackers@lists.postgresql.org
> Subject: RE: Terminate the idle sessions
> 
> Dear Li,
> 
>> Thanks! Add the comment for idle-session timeout.
> 
> I confirmed it. OK.
> 
> 
> I don't have any comments anymore. If no one has,
> I will change the status few days later.
> Other comments or suggestions to him?
> 
> Best Regards,
> Hayato Kuroda
> FUJITSU LIMITED
> 

--
Best regards
Japin Li



  1   2   >