Re: partition tree inspection functions

2018-10-04 Thread Michael Paquier
On Thu, Oct 04, 2018 at 05:18:07PM +0900, Michael Paquier wrote:
> So it seems that I am clearly outvoted here ;)
> 
> Okay, let's do as you folks propose.

And attached is a newer version with this isleaf stuff and the previous
feedback from Amit integrated, as long as I recall about it.  The
version is indented, and tutti-quanti.  Does that look fine to everybody
here?

The CF bot should normally pick up this patch, the previous garbage seen
in one of the tests seems to come from the previous implementation which
used strings...
--
Michael
From 11bf47ee3137b6a15699bfda50b0904f2e10a415 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Fri, 5 Oct 2018 14:41:17 +0900
Subject: [PATCH] Add pg_partition_tree to display information about partitions

This new function is useful to display a full tree of partitions with a
partitioned table given in output, and avoids the need of any complex
WITH RECURSIVE when looking at partition trees which are multi-level
deep.

It returns a set of records, one for each partition, containing the
partition OID, its immediate parent OID, if the relation is a leaf in
the tree and its level in the partition tree with given table considered
as root, beginning at zero for the root, and incrementing by one each
time the scan goes one level down.

Author: Amit Langote
Reviewed-by: Jesper Pedersen, Michael Paquier
Discussion: https://postgr.es/m/8d00e51a-9a51-ad02-d53e-ba6bf50b2...@lab.ntt.co.jp
---
 doc/src/sgml/func.sgml   |  42 ++
 src/backend/utils/adt/Makefile   |   4 +-
 src/backend/utils/adt/partitionfuncs.c   | 137 +++
 src/include/catalog/catversion.h |   2 +-
 src/include/catalog/pg_proc.dat  |   9 ++
 src/test/regress/expected/partition_info.out |  67 +
 src/test/regress/parallel_schedule   |   2 +-
 src/test/regress/serial_schedule |   1 +
 src/test/regress/sql/partition_info.sql  |  42 ++
 9 files changed, 302 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/partitionfuncs.c
 create mode 100644 src/test/regress/expected/partition_info.out
 create mode 100644 src/test/regress/sql/partition_info.sql

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f984d069e1..be315aaabb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20197,6 +20197,48 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 The function returns the number of new collation objects it created.

 
+   
+Partitioning Information Functions
+
+ 
+  Name Return Type Description
+ 
+
+  
+  
+   pg_partition_tree(oid)
+   setof record
+   
+List information about table in a partition tree for a given
+partitioned table, which consists of one row for each partition and
+table itself.  Information provided includes the OID of the partition,
+the OID of its immediate parent, if the partition is a leaf and its
+level in the hierarchy.  The value of level begins at
+0 for the input table in its role as the root of
+the partition tree, 1 for its partitions,
+2 for their partitions, and so on.
+   
+  
+ 
+
+   
+
+   
+To check the total size of the data contained in
+measurement table described in
+, one could use the
+following query:
+   
+
+
+=# SELECT pg_size_pretty(sum(pg_relation_size(relid))) AS total_size
+ FROM pg_partition_tree('measurement'::regclass);
+ total_size 
+
+ 24 kB
+(1 row)
+
+
   
 
   
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 4b35dbb8bb..132ec7620c 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -20,8 +20,8 @@ OBJS = acl.o amutils.o arrayfuncs.o array_expanded.o array_selfuncs.o \
 	jsonfuncs.o like.o lockfuncs.o mac.o mac8.o misc.o nabstime.o name.o \
 	network.o network_gist.o network_selfuncs.o network_spgist.o \
 	numeric.o numutils.o oid.o oracle_compat.o \
-	orderedsetaggs.o pg_locale.o pg_lsn.o pg_upgrade_support.o \
-	pgstatfuncs.o \
+	orderedsetaggs.o partitionfuncs.o pg_locale.o pg_lsn.o \
+	pg_upgrade_support.o pgstatfuncs.o \
 	pseudotypes.o quote.o rangetypes.o rangetypes_gist.o \
 	rangetypes_selfuncs.o rangetypes_spgist.o rangetypes_typanalyze.o \
 	regexp.o regproc.o ri_triggers.o rowtypes.o ruleutils.o \
diff --git a/src/backend/utils/adt/partitionfuncs.c b/src/backend/utils/adt/partitionfuncs.c
new file mode 100644
index 00..c1743524c8
--- /dev/null
+++ b/src/backend/utils/adt/partitionfuncs.c
@@ -0,0 +1,137 @@
+/*-
+ *
+ * partitionfuncs.c
+ *	  Functions for accessing partition-related metadata
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ 

Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2018-10-04 Thread Masahiko Sawada
On Fri, Oct 5, 2018 at 12:16 PM Michael Paquier  wrote:
>
> On Wed, Oct 03, 2018 at 04:37:29PM +0900, Masahiko Sawada wrote:
> > Thank you for the comment. Attached the updated patch.
>
> So, I have come back to this stuff, and finished with the attached
> instead, so as the assertion is in a single place.  I find that
> clearer.  The comments have also been improved.  Thoughts?

Thank you! The patch looks good to me.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Multiple primary key on partition table?

2018-10-04 Thread amul sul
On Thu, Oct 4, 2018 at 8:25 PM Alvaro Herrera  wrote:
>
> On 2018-Oct-01, Rajkumar Raghuwanshi wrote:
>
> > On Tue, Sep 18, 2018 at 11:20 AM amul sul  wrote:
> >
> > > Here is the complete patch proposes the aforesaid fix with regression 
> > > test.
> >
> > Thanks, This worked for me.
>
> Yeah, looks good to me, pushed.  I added one more regression test to
> ensure that the PRIMARY KEY clause in the partition is still accepted if
> the parent does not have one.
>

Thanks a lot, for enhancing regression and committing the fix.

Regards,
Amul



Assertion failure with ALTER TABLE ATTACH PARTITION with log_min_messages >= DEBUG1

2018-10-04 Thread Michael Paquier
Hi all,

Running installcheck on an instance with log_min_messages = DEBUG1, I
can bump into the following assertion failure:
#2  0x56145231e82c in ExceptionalCondition
(conditionName=0x56145258ae0b "!(strvalue != ((void *)0))",
errorType=0x56145258adfb "FailedAssertion",
fileName=0x56145258adf0 "snprintf.c", lineNumber=440) at assert.c:54
[...]
#7  0x56145231f518 in errmsg (fmt=0x5614524dac60 "validating foreign
key constraint \"%s\"") at elog.c:796
#8  0x561451f6ab54 in validateForeignKeyConstraint (conname=0x0,
rel=0x7f12833ca750, pkrel=0x7f12833cc468, pkindOid=36449,
constraintOid=36466) at tablecmds.c:8566
#9  0x561451f61589 in ATRewriteTables (parsetree=0x561453bde5e0,
wqueue=0x7ffe8f1d55e8, lockmode=8) at tablecmds.c:4549

Looking at the stack trace there is this log in
validateForeignKeyConstraint:
ereport(DEBUG1,
(errmsg("validating foreign key constraint \"%s\"", conname)));

However conname is set to NULL in this code path.

This test case allows to reproduce easily the failure:
CREATE TABLE fk_notpartitioned_pk (a int, b int, PRIMARY KEY (a, b));
CREATE TABLE fk_partitioned_fk (b int, a int) PARTITION BY RANGE (a, b);
CREATE TABLE fk_partitioned_fk_1 (b int, a int);
ALTER TABLE fk_partitioned_fk ADD FOREIGN KEY (a, b) REFERENCES
fk_notpartitioned_pk;
-- crash
ALTER TABLE fk_partitioned_fk ATTACH PARTITION fk_partitioned_fk_1 FOR
  VALUES FROM (1,1) TO (2,2);

From what I can see the problem comes from CloneForeignKeyConstraint
which forgets to assign the constraint name when cloning the FK
definition.  While looking at the ATTACH PARTITION code, I have noticed
that a variable gets overridden, which is in my opinion bad style.  So
the problem is rather close to what Tom has fixed in 3d0f68dd it seems.

Attached is a patch for all that, with which installcheck-world passes
for me.  I am surprised this was not noticed before, the recent snprintf
stanza is nicely helping, and this would need to be back-patched down to
v11.

Thanks,
--
Michael
diff --git a/src/backend/catalog/pg_constraint.c b/src/backend/catalog/pg_constraint.c
index 6781b00c6e..2063abb8ae 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -574,6 +574,7 @@ CloneForeignKeyConstraints(Oid parentId, Oid relationId, List **cloned)
 
 		fkconstraint = makeNode(Constraint);
 		/* for now this is all we need */
+		fkconstraint->conname = pstrdup(NameStr(constrForm->conname));
 		fkconstraint->fk_upd_action = constrForm->confupdtype;
 		fkconstraint->fk_del_action = constrForm->confdeltype;
 		fkconstraint->deferrable = constrForm->condeferrable;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c145385f84..7df1fc2a76 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14275,21 +14275,21 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd)
 			   RelationGetRelid(attachrel), );
 	foreach(l, cloned)
 	{
-		ClonedConstraint *cloned = lfirst(l);
+		ClonedConstraint *clonedcon = lfirst(l);
 		NewConstraint *newcon;
 		Relation	clonedrel;
 		AlteredTableInfo *parttab;
 
-		clonedrel = relation_open(cloned->relid, NoLock);
+		clonedrel = relation_open(clonedcon->relid, NoLock);
 		parttab = ATGetQueueEntry(wqueue, clonedrel);
 
 		newcon = (NewConstraint *) palloc0(sizeof(NewConstraint));
-		newcon->name = cloned->constraint->conname;
+		newcon->name = clonedcon->constraint->conname;
 		newcon->contype = CONSTR_FOREIGN;
-		newcon->refrelid = cloned->refrelid;
-		newcon->refindid = cloned->conindid;
-		newcon->conid = cloned->conid;
-		newcon->qual = (Node *) cloned->constraint;
+		newcon->refrelid = clonedcon->refrelid;
+		newcon->refindid = clonedcon->conindid;
+		newcon->conid = clonedcon->conid;
+		newcon->qual = (Node *) clonedcon->constraint;
 
 		parttab->constraints = lappend(parttab->constraints, newcon);
 


signature.asc
Description: PGP signature


Re: Segfault when creating partition with a primary key and sql_drop trigger exists

2018-10-04 Thread Tom Lane
"Jonathan S. Katz"  writes:
> On 10/4/18 8:34 PM, Michael Paquier wrote:
>> I am suggesting to fix the issue after RC1 is released, but before GA.

> That approach would mean we would require an RC2, which would further
> delay the GA.

Not sure about that.  Alvaro seems to think there's a generic problem
in event trigger processing, which if true, was likely there pre-v11.
I don't think that patches that get back-patched further than 11
need to restart the RC clock.

> Ideally it would be great if this was fixed for RC1. However, given the
> choice of pushing the release out further vs. saving the fix for .1
> which would be relatively soon, I would vote for the latter.

There are definitely good calendar reasons to hold to the schedule of
RC1 next week and GA the week after.  I'd only want to blow up that
plan if we hit something that is both very bad and very hard to fix.

However, if we have a fix that we believe in that's available post-RC1,
I'm not sure I see why it's better to sit on it till after GA.

regards, tom lane



Re: Segfault when creating partition with a primary key and sql_drop trigger exists

2018-10-04 Thread Jonathan S. Katz
On 10/4/18 8:34 PM, Michael Paquier wrote:
> On Thu, Oct 04, 2018 at 04:54:45PM -0700, Andres Freund wrote:
>> Are you suggesting we fix after RC1, or delay RC1?  I'm not 100% sure
>> I'm parsing your sentence correctly.
> 
> I am suggesting to fix the issue after RC1 is released, but before GA.

That approach would mean we would require an RC2, which would further
delay the GA.

Based on our release process and various schedules, any delays to the GA
date at this point would push the release well into mid-November. Part
of the reason that we selected the current proposed date for the GA is
that we would have the .1 available 3 weeks after the release during the
regularly scheduled cumulative update.

Ideally it would be great if this was fixed for RC1. However, given the
choice of pushing the release out further vs. saving the fix for .1
which would be relatively soon, I would vote for the latter.

Jonathan



signature.asc
Description: OpenPGP digital signature


Re: pgsql: Improve autovacuum logging for aggressive and anti-wraparound ru

2018-10-04 Thread Michael Paquier
On Wed, Oct 03, 2018 at 04:37:29PM +0900, Masahiko Sawada wrote:
> Thank you for the comment. Attached the updated patch.

So, I have come back to this stuff, and finished with the attached
instead, so as the assertion is in a single place.  I find that
clearer.  The comments have also been improved.  Thoughts?
--
Michael
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8996d366e9..e73b548ac3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -249,6 +249,10 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	if (options & VACOPT_DISABLE_PAGE_SKIPPING)
 		aggressive = true;
 
+	/* vacuum preventing wraparound must be aggressive */
+	Assert((params->is_wraparound && aggressive) ||
+		   !params->is_wraparound);
+
 	vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
 	vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
@@ -376,10 +380,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 			initStringInfo();
 			if (params->is_wraparound)
 			{
-if (aggressive)
-	msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
-else
-	msgfmt = _("automatic vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
+/* autovacuum preventing wraparound has to be aggressive */
+msgfmt = _("automatic aggressive vacuum to prevent wraparound of table \"%s.%s.%s\": index scans: %d\n");
 			}
 			else
 			{


signature.asc
Description: PGP signature


Re: Odd 9.4, 9.3 buildfarm failure on s390x

2018-10-04 Thread Tom Lane
Thomas Munro  writes:
> I suppose someone might mistake this for a function that converts -42
> to 42... would something like INVERT_COMPARE_RESULT() be better?

I have no particular allegiance to the macro name; it's just the
first idea that came to mind.

regards, tom lane



Re: Odd 9.4, 9.3 buildfarm failure on s390x

2018-10-04 Thread Thomas Munro
On Fri, Oct 5, 2018 at 3:12 PM Tom Lane  wrote:
> Here's a draft patch against HEAD for this.

+ * Invert the sign of a qsort-style comparison result, ie, exchange negative
+ * and positive integer values, being careful not to get the wrong answer
+ * for INT_MIN.  The argument should be an integral variable.
+ */
+#define INVERT_SIGN(var) \
+((var) = ((var) < 0) ? 1 : -(var))

I suppose someone might mistake this for a function that converts -42
to 42... would something like INVERT_COMPARE_RESULT() be better?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Odd 9.4, 9.3 buildfarm failure on s390x

2018-10-04 Thread Tom Lane
I wrote:
> Andres Freund  writes:
>> On 2018-10-01 12:13:57 -0400, Tom Lane wrote:
>>> (2) Drop the restriction.  This'd require at least changing the
>>> DESC correction, and maybe other things.  I'm not sure what the
>>> odds would be of finding everyplace we need to check.

>> (2) seems more maintainable to me (or perhaps less unmaintainable). It's
>> infrastructure, rather than every datatype + support out there...

> I guess we could set up some testing infrastructure: hack int4cmp
> and/or a couple other popular comparators so that they *always*
> return INT_MIN, 0, or INT_MAX, and then see what falls over.

Here's a draft patch against HEAD for this.

I looked for problem spots by (a) testing with the STRESS_SORT_INT_MIN
option I added in nbtcompare.c, (b) grepping for "x = -x" type code,
and (c) grepping for "return -x" type code.  (b) and (c) found several
places that (a) didn't, which does not give me a warm feeling about
whether I have found quite everything.

I changed a couple of places where things might've been safe but
I didn't feel like chasing the calls to prove it (e.g. imath.c),
and contrariwise I left a *very* small number of places alone
because they were inverting the result of a specific function
that is defined to return 1/0/-1 and nothing else.

regards, tom lane

diff --git a/contrib/pgcrypto/imath.c b/contrib/pgcrypto/imath.c
index cd528bf..ca4995a 100644
--- a/contrib/pgcrypto/imath.c
+++ b/contrib/pgcrypto/imath.c
@@ -1254,11 +1254,9 @@ mp_int_compare(mp_int a, mp_int b)
 		 * If they're both zero or positive, the normal comparison applies; if
 		 * both negative, the sense is reversed.
 		 */
-		if (sa == MP_ZPOS)
-			return cmp;
-		else
-			return -cmp;
-
+		if (sa != MP_ZPOS)
+			INVERT_SIGN(cmp);
+		return cmp;
 	}
 	else
 	{
@@ -1314,10 +1312,9 @@ mp_int_compare_value(mp_int z, int value)
 	{
 		cmp = s_vcmp(z, value);
 
-		if (vsign == MP_ZPOS)
-			return cmp;
-		else
-			return -cmp;
+		if (vsign != MP_ZPOS)
+			INVERT_SIGN(cmp);
+		return cmp;
 	}
 	else
 	{
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index 8bd0bad..c16825e 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -228,11 +228,8 @@
   B, A
   = B,
   or A 
-  B, respectively.  The function must not
-  return INT_MIN for the A
-   B case,
-  since the value may be negated before being tested for sign.  A null
-  result is disallowed, too.
+  B, respectively.
+  A null result is disallowed: all values of the data type must be comparable.
   See src/backend/access/nbtree/nbtcompare.c for
   examples.
  
diff --git a/src/backend/access/nbtree/nbtcompare.c b/src/backend/access/nbtree/nbtcompare.c
index b1855e8..6f2ad23 100644
--- a/src/backend/access/nbtree/nbtcompare.c
+++ b/src/backend/access/nbtree/nbtcompare.c
@@ -22,11 +22,10 @@
  *
  *	The result is always an int32 regardless of the input datatype.
  *
- *	Although any negative int32 (except INT_MIN) is acceptable for reporting
- *	"<", and any positive int32 is acceptable for reporting ">", routines
+ *	Although any negative int32 is acceptable for reporting "<",
+ *	and any positive int32 is acceptable for reporting ">", routines
  *	that work on 32-bit or wider datatypes can't just return "a - b".
- *	That could overflow and give the wrong answer.  Also, one must not
- *	return INT_MIN to report "<", since some callers will negate the result.
+ *	That could overflow and give the wrong answer.
  *
  *	NOTE: it is critical that the comparison function impose a total order
  *	on all non-NULL values of the data type, and that the datatype's
@@ -44,13 +43,31 @@
  *	during an index access won't be recovered till end of query.  This
  *	primarily affects comparison routines for toastable datatypes;
  *	they have to be careful to free any detoasted copy of an input datum.
+ *
+ *	NOTE: we used to forbid comparison functions from returning INT_MIN,
+ *	but that proves to be too error-prone because some platforms' versions
+ *	of memcmp() etc can return INT_MIN.  As a means of stress-testing
+ *	callers, this file can be compiled with STRESS_SORT_INT_MIN defined
+ *	to cause many of these functions to return INT_MIN or INT_MAX instead of
+ *	their customary -1/+1.  For production, though, that's not a good idea
+ *	since users or third-party code might expect the traditional results.
  *-
  */
 #include "postgres.h"
 
+#include 
+
 #include "utils/builtins.h"
 #include "utils/sortsupport.h"
 
+#ifdef STRESS_SORT_INT_MIN
+#define A_LESS_THAN_B		INT_MIN
+#define A_GREATER_THAN_B	INT_MAX
+#else
+#define A_LESS_THAN_B		(-1)
+#define A_GREATER_THAN_B	1
+#endif
+
 
 Datum
 btboolcmp(PG_FUNCTION_ARGS)
@@ -95,11 +112,11 @@ btint4cmp(PG_FUNCTION_ARGS)
 	int32		b = PG_GETARG_INT32(1);
 
 	if (a > b)
-		PG_RETURN_INT32(1);
+		PG_RETURN_INT32(A_GREATER_THAN_B);
 	else if (a == b)
 		PG_RETURN_INT32(0);
 	else
-		PG_RETURN_INT32(-1);
+		

Re: executor relation handling

2018-10-04 Thread Amit Langote
On 2018/10/05 5:59, Tom Lane wrote:
> Amit Langote  writes:
>> I've rebased the remaining patches.  I broke down one of the patches into
>> 2 and re-ordered the patches as follows:
> 
>> 0001: introduces a function that opens range table relations and maintains
>> them in an array indexes by RT index
> 
>> 0002: introduces a new field in EState that's an array of RangeTblEntry
>> pointers and revises macros used in the executor that access RTEs to
>> return them from the array (David Rowley co-authored this one)
> 
> I've pushed 0001 and 0002 with mostly cosmetic changes.

Thanks a lot.

> One thing I
> wanted to point out explicitly, though, is that I found this bit of 0002
> to be a seriously bad idea:
> 
> --- a/src/include/nodes/execnodes.h
> +++ b/src/include/nodes/execnodes.h
> @@ -20,6 +20,7 @@
>  #include "executor/instrument.h"
>  #include "lib/pairingheap.h"
>  #include "nodes/params.h"
> +#include "nodes/parsenodes.h"
>  #include "nodes/plannodes.h"
>  #include "utils/hsearch.h"
>  #include "utils/queryenvironment.h"
> 
> Please do not add #includes of fundamental headers to other fundamental
> headers without clearing it with somebody.  There's little enough
> structure to our header collection now.  I don't want to end up in a
> situation where effectively the entire header set gets pulled into
> every .c file, or worse that we have actual reference loops in the
> headers.  (This is not an academic problem; somebody actually created
> such a loop awhile back.  Cleaning it up, by the time we'd recognized
> the problem, was really painful.)

Okay, sorry about that.  I was slightly nervous that I had to do it when
doing it, but forgot to mention that explicitly in the commit message or
the email.

>> 0003: moves result relation and ExecRowMark initialization out of InitPlan
>> and into ExecInit* routines of respective nodes
> 
> I am finding myself pretty unconvinced by this one; it seems like mostly
> a random reallocation of responsibility with little advantage.  The
> particular thing that brought me to a screeching halt was seeing that
> the patch removed ExecFindRowMark, despite the fact that that's part
> of our advertised FDW API (see fdwhandler.sgml), and it didn't provide
> any alternative way for an FDW to find out at runtime whether it's
> subject to a row locking requirement.
> 
> I thought for a minute about just leaving the function in place, but
> that wouldn't work because both nodeLockRows and nodeModifyTable are
> written so that they find^H^H^Hbuild their rowmarks only after recursing
> to initialize their child plan nodes; so a child node that tried to use
> ExecFindRowMark during ExecInitNode would get the wrong answer.  Of
> course, we could consider changing the order of operations during
> initialization of those node types, but I'm not really seeing a compelling
> reason why we should whack things around that much.
> 
> So I'm inclined to just omit 0003.  AFAICS this would only mean that
> we couldn't drop the global PlanRowMarks list from PlannedStmt, which
> does not bother me much.

To be honest, I too had begun to fail to see the point of this patch since
yesterday.  In fact, getting this one to pass make check-world took a bit
of head-scratching due to its interaction with EvalPlanQuals EState
building, so I was almost to drop it from the series.  But I felt that it
might still be a good idea to get rid of what was described as special
case code.  Reading your argument against it based on how it messes up
some of the assumptions regarding ExecRowMark design, I'm willing to let
go of this one as more or less a cosmetic improvement.

>> 0005: teaches planner to remove PlanRowMarks corresponding to dummy relations
> 
> I'm not entirely sold on the value of that either?

If you look at the first email on this thread, you can see that trimming
the PlanRowMarks list down to just the ones needed during execution
results in non-trivial improvement in SELECT FOR SHARE on partitioned
tables with large number of partitions:


Speedup is more pronounced with a benchmark that needs RowMarks, because
one of the patches (0003) removes overhead around handling them.

$ cat /tmp/select-lt-for-share.sql
select * from lt where b = 999 for share;

master

tps = 94.095985 (excluding connections establishing)
tps = 93.955702 (excluding connections establishing)



patch 0003 (prune PlanRowMarks)

tps = 712.544029 (excluding connections establishing)
tps = 717.540052 (excluding connections establishing)


But on reflection, this seems more like adding special case code to the
planner just for this, rather than solving the more general problem of
initializing *any* partition information after pruning, being discussed
over at "speeding up planning with partitions" thread [1].  So, I don't
mind dropping this one too.


So, that leaves us with only the patch that revises the plan node fields
in light of having relieved the executor of doing any locking of its own
in *some* cases.  That 

Re: [HACKERS] proposal: schema variables

2018-10-04 Thread Pavel Stehule
st 3. 10. 2018 v 1:01 odesílatel Thomas Munro 
napsal:

> On Sun, Sep 30, 2018 at 11:20 AM Pavel Stehule 
> wrote:
> > I hope so now, there are almost complete functionality. Please, check it.
>
> Hi Pavel,
>
> FYI there is a regression test failure on Windows:
>
> plpgsql ... FAILED
>
> *** 4071,4077 
> end;
> $$ language plpgsql;
> select stacked_diagnostics_test();
> - NOTICE: sqlstate: 22012, message: division by zero, context:
> [PL/pgSQL function zero_divide() line 4 at RETURN <- SQL statement
> "SELECT zero_divide()" <- PL/pgSQL function stacked_diagnostics_test()
> line 6 at PERFORM]
> + NOTICE: sqlstate: 42702, message: column reference "v" is ambiguous,
> context: [PL/pgSQL function zero_divide() line 4 at RETURN <- SQL
> statement "SELECT zero_divide()" <- PL/pgSQL function
> stacked_diagnostics_test() line 6 at PERFORM]
>
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.15234


please, check attached patch

Thank you for report

Pavel


>
> --
> Thomas Munro
> http://www.enterprisedb.com
>


schema-variables-20181004-02.patch.gz
Description: application/gzip


TAP tests for pg_verify_checksums

2018-10-04 Thread Michael Paquier
Hi all,

The topic of $subject has been discussed a bit times, resulting in a
couple of patches on the way:
https://www.postgresql.org/message-id/20180830200258.gg15...@paquier.xyz
https://www.postgresql.org/message-id/cabuevezekrwpegk2o-ov4z2mjft6cu8clfa-v1s1j4z8x7w...@mail.gmail.com

However nothing has actually happened.  Based on the previous feedback,
attached is an updated patch to do the actual job.

Magnus Hagander and Michael Banck need to be credited as authors, as
this patch is roughly a merge of what they proposed.

Thoughts?
--
Michael
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 5dc629fd5e..610887e5d4 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -8,7 +8,7 @@ use Fcntl ':mode';
 use File::stat qw{lstat};
 use PostgresNode;
 use TestLib;
-use Test::More tests => 18;
+use Test::More tests => 21;
 
 my $tempdir = TestLib::tempdir;
 my $xlogdir = "$tempdir/pgxlog";
@@ -58,6 +58,12 @@ mkdir $datadir;
 			"check PGDATA permissions");
 	}
 }
+
+# Control file should tell that data checksums are disabled by default.
+command_like(['pg_controldata', $datadir],
+			 qr/Data page checksum version:.*0/,
+			 'checksums are disabled in control file');
+
 command_ok([ 'initdb', '-S', $datadir ], 'sync only');
 command_fails([ 'initdb', $datadir ], 'existing data directory');
 
diff --git a/src/bin/pg_verify_checksums/.gitignore b/src/bin/pg_verify_checksums/.gitignore
index d1dcdaf0dd..0e5e569a54 100644
--- a/src/bin/pg_verify_checksums/.gitignore
+++ b/src/bin/pg_verify_checksums/.gitignore
@@ -1 +1,3 @@
 /pg_verify_checksums
+
+/tmp_check/
diff --git a/src/bin/pg_verify_checksums/Makefile b/src/bin/pg_verify_checksums/Makefile
index d16261571f..cfe4ab1b8b 100644
--- a/src/bin/pg_verify_checksums/Makefile
+++ b/src/bin/pg_verify_checksums/Makefile
@@ -34,3 +34,9 @@ uninstall:
 clean distclean maintainer-clean:
 	rm -f pg_verify_checksums$(X) $(OBJS)
 	rm -rf tmp_check
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
diff --git a/src/bin/pg_verify_checksums/t/001_basic.pl b/src/bin/pg_verify_checksums/t/001_basic.pl
new file mode 100644
index 00..1fa2e12db2
--- /dev/null
+++ b/src/bin/pg_verify_checksums/t/001_basic.pl
@@ -0,0 +1,8 @@
+use strict;
+use warnings;
+use TestLib;
+use Test::More tests => 8;
+
+program_help_ok('pg_verify_checksums');
+program_version_ok('pg_verify_checksums');
+program_options_handling_ok('pg_verify_checksums');
diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl
new file mode 100644
index 00..7423595181
--- /dev/null
+++ b/src/bin/pg_verify_checksums/t/002_actions.pl
@@ -0,0 +1,52 @@
+# Do basic sanity checks supported by pg_verify_checksums using
+# an initialized cluster.
+
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 8;
+
+# Initialize node with checksums enabled.
+my $node = get_new_node('node_checksum');
+$node->init(extra => ['--data-checksums']);
+my $pgdata = $node->data_dir;
+
+# Control file should know that checksums are disabled.
+command_like(['pg_controldata', $pgdata],
+	 qr/Data page checksum version:.*1/,
+		 'checksums enabled in control file');
+
+# Checksums pass on a newly-created cluster
+command_ok(['pg_verify_checksums',  '-D', $pgdata],
+		   "checksum checks done and passing");
+
+# Checks cannot happen for an online cluster
+$node->start;
+command_fails(['pg_verify_checksums',  '-D', $pgdata],
+			  "checksum checks not done");
+
+# Create table to corrupt and get its relfilenode
+$node->safe_psql('postgres',
+	"SELECT a INTO corrupt1 FROM generate_series(1,1) AS a;
+	ALTER TABLE corrupt1 SET (autovacuum_enabled=false);");
+
+my $file_corrupted = $node->safe_psql('postgres',
+	"SELECT pg_relation_filepath('corrupt1')");
+
+# Set page header and block size
+my $pageheader_size = 24;
+my $block_size = $node->safe_psql('postgres', 'SHOW block_size;');
+$node->stop;
+
+# Time to create a corruption
+open my $file, '+<', "$pgdata/$file_corrupted";
+seek($file, $pageheader_size, 0);
+syswrite($file, '\0\0\0\0\0\0\0\0\0');
+close $file;
+
+$node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata],
+		  1,
+		  [qr/Bad checksums:  1/s],
+		  [qr/checksum verification failed/s],
+		  'pg_checksums reports checksum mismatch');


signature.asc
Description: PGP signature


Re: Tid scan improvements

2018-10-04 Thread Edmund Horner
On Wed, 3 Oct 2018 at 17:36, David Rowley  wrote:
> I know commit fest is over, but I made a pass of this to hopefully
> provide a bit of guidance so that it's closer for the November 'fest.

Hi David.  Thanks for the review.  It's fairly thorough and you must
have put some time into it -- I really appreciate it.


> I've only done light testing on the patch and it does seem to work,
> but there are a few things that I think should be changed. Most
> importantly #11 below I think needs to be done. That might overwrite
> some of the items that come before it in the list as you likely will
> have to pull some of code which I mention out out due to changing #11.
> I've kept them around anyway just in case some of it remains.

> 1. Could wrap for tables > 16TB. Please use double. See index_pages_fetched()
> 2. Should multiply by nseqpages, not add.
> 3. Should be double.

I agree with these three.


> 4. Comment needs updated to mention what the new code does in
> heapgettup() and heapgettup_pagemode()
>
> +
>   /* start from last page of the scan */
> - if (scan->rs_startblock > 0)
> - page = scan->rs_startblock - 1;
> + if (scan->rs_numblocks == InvalidBlockNumber)
> + {
> + if (scan->rs_startblock > 0)
> + page = scan->rs_startblock - 1;
> + else
> + page = scan->rs_nblocks - 1;
> + }
>   else
> - page = scan->rs_nblocks - 1;
> + page = scan->rs_startblock + scan->rs_numblocks - 1;
> +

I'm thinking that, as they don't depend on the others, the heapam.c
changes should be a separate preparatory patch?

The heap scan code has to support things like synchonised scans and
parallel scans, but as far as I know, its support for scanning
subranges is currently used only for building BRIN indexes.  I found
that although I could specify a subrange with heap_setscanlimits, I
could not scan backward over it, because the original version of the
above code would start at the end of the whole table.

I'm not especially comfortable with this understanding of heapam, so
close review would be appreciated.

I note that there's a lot of common code in heapgettup and
heapgettup_pagemode, which my changes add to.  It might be worth
trying to factor out somehow.


> 5. Variables should be called "inclusive". We use "strict" to indicate
> an operator comparison cannot match NULL values.
>
> + bool strict; /* Indicates < rather than <=, or > rather */
> + bool strict2; /* than >= */
>
> Don't break the comment like that. If you need more space don't end
> the comment and use a new line and tab the next line out to match the
> * of the first line.

> 8. Many instances of the word "strict" are used to mean "inclusive".
> Can you please change all of them.

I don't mind renaming it.  I took "strict" from "strictly greater/less
than" but I knew it was confusable with the other usages of "strict".


> 9. Confusing comment:
>
> + * If the expression was non-strict (<=) and the offset is 0, then just
> + * pretend it was strict, because offset 0 doesn't exist and we may as
> + * well exclude that block.
>
> Shouldn't this be, "If the operator is non-inclusive, then since TID
> offsets are 1-based, for simplicity, we can just class the expression
> as inclusive.", or something along those lines.

Ok, I'll try to reword it along those lines.


> I think I'm going to stop here as changing this going to cause quite a
> bit of churn.
>
> but one more...

> 12. I think the changes to selfuncs.c to get the selectivity estimate
> is a fairly credible idea, but I think it also needs to account for
> offsets. You should be able to work out the average number of items
> per page with rel->tuples / rel->pages and factor that in to get a
> better estimate for cases like:
>
> postgres=# explain analyze select ctid,* from t1 where ctid <= '(0,200)';
>   QUERY PLAN
> ---
>  Tid Scan on t1  (cost=0.00..5.00 rows=1 width=10) (actual
> time=0.025..0.065 rows=200 loops=1)
>TID Cond: (ctid <= '(0,200)'::tid)
>  Planning Time: 0.081 ms
>  Execution Time: 0.088 ms
> (4 rows)
>
> You can likely add on "(offset / avg_tuples_per_page) / rel->pages" to
> the selectivity and get a fairly accurate estimate... at least when
> there are no dead tuples in the heap

I think the changes to selfuncs.c could also be a separate patch?

I'll try to include the offset in the selectivity too.

Related -- what should the selectivity be on an empty table?  My code has:

/* If the relation's empty, we're going to read all of it. */
if (vardata->rel->pages == 0)
return 1.0;

(which needs rewording, since selectivity isn't always about reading).
Is 1.0 the right thing to return?


> 6. Why not pass the TidExpr into MakeTidOpExprState() and have it set
> the type instead of repeating code
> 7. It's not very obvious why the following Assert() can't fail. [...]
> I had to hunt around quite a bit to see that
> 

Re: pg_ls_tmpdir()

2018-10-04 Thread Bossart, Nathan
On 10/4/18, 7:31 PM, "Michael Paquier"  wrote:
> Committed this way with a catalog version bump.

Thanks!

Nathan



Re: snprintf assert is broken by plpgsql #option dump

2018-10-04 Thread Pavel Stehule
čt 4. 10. 2018 v 23:57 odesílatel Tom Lane  napsal:

> I wrote:
> > Pavel Stehule  writes:
> >> I found two parts
>
> > Thanks for the report, will push something.
>
> On closer look, I'm not sure that these are the only places that are
> assuming that any PLpgSQL_variable struct has a refname.  What seems
> like a safer answer is to make sure they all do, more or less as
> attached.
>


+1

Pavel


> regards, tom lane
>
>


Re: Early WIP/PoC for inlining CTEs

2018-10-04 Thread Andrew Gierth
> "Andreas" == Andreas Karlsson  writes:

 > On 10/03/2018 05:57 PM, David Fetter wrote:
 >> Is there any meaningful distinction between "inlining," by which I
 >> mean converting to a subquery, and predicate pushdown, which
 >> would happen at least for a first cut, at the rewrite stage?

Yes.

 Andreas> Sorry, but I do not think I understand your question. The
 Andreas> ability to push down predicates is just one of the potential
 Andreas> benefits from inlining.

Consider the difference between (in the absence of CTE inlining):

-- inline subquery with no optimization barrier (qual may be pushed down)
select * from (select x from y) s where x=1;

-- inline subquery with optimization barrier (qual not pushed down)
select * from (select x from y offset 0) s where x=1;

-- CTE with materialization
with s as (select x from y) select * from s where x=1;

-- 
Andrew (irc:RhodiumToad)



Re: Segfault when creating partition with a primary key and sql_drop trigger exists

2018-10-04 Thread Michael Paquier
On Thu, Oct 04, 2018 at 04:54:45PM -0700, Andres Freund wrote:
> Are you suggesting we fix after RC1, or delay RC1?  I'm not 100% sure
> I'm parsing your sentence correctly.

I am suggesting to fix the issue after RC1 is released, but before GA.
--
Michael


signature.asc
Description: PGP signature


Re: pg_ls_tmpdir()

2018-10-04 Thread Michael Paquier
On Thu, Oct 04, 2018 at 02:23:34PM +0200, Laurenz Albe wrote:
> Bossart, Nathan wrote:
>> Alright, here's an updated patch.
> 
> Looks, good; marking as "ready for committer".

Like Tom, I think it is less dirty to use the two-function approach.
Committed this way with a catalog version bump.
--
Michael


signature.asc
Description: PGP signature


Re: Segfault when creating partition with a primary key and sql_drop trigger exists

2018-10-04 Thread Andres Freund
Hi,

On 2018-10-05 08:29:29 +0900, Michael Paquier wrote:
> On Thu, Oct 04, 2018 at 06:04:49PM -0400, Tom Lane wrote:
> > Alvaro Herrera  writes:
> >> I'm tied up in something else at the moment so can't spend more time on
> >> it, but I hope to have time to give it a look over the weekend.
> > 
> > Keep in mind that RC1 is scheduled to wrap Monday afternoon ...
> 
> ...  Which is little time to work on a complete fix.  Surely we could
> wait for GA for a fix?  I don't find nice to release v11 with this bug.

Are you suggesting we fix after RC1, or delay RC1?  I'm not 100% sure
I'm parsing your sentence correctly.

Greetings,

Andres Freund



Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

2018-10-04 Thread Thomas Munro
On Thu, Aug 3, 2017 at 1:41 PM Robert Haas  wrote:
> On Wed, Aug 2, 2017 at 9:33 PM, Peter Eisentraut
>  wrote:
> > On 8/2/17 16:52, Robert Haas wrote:
> >> I actually don't think it's that unreasonable to get notified when
> >> system-wide processes like the autovacuum launcher or the logical
> >> replication launcher start or stop.
> >
> > But we got rid of those start messages recently due to complaints.
>
> Yeah, true.  I'm just talking about what *I* think.  :-)

I still think the current situation is non-ideal.  I don't have a
strong view on whether some or all system-wide processes should say
hello and goodbye explicitly in the log, but I do think we need a way
to make that not look like an error condition, and ideally without
special cases for known built-in processes.

I looked into this a bit today, while debugging an extension that runs
a background worker.  Here are some assorted approaches to shutdown:

0.  The default SIGTERM handler for bgworkers is bgworker_die(), which
generates a FATAL ereport "terminating background worker \"%s\" due to
administrator command", directly in the signal handler (hmm, see also
recent discussions about the legality of similar code in quickdie()).

1.  Some processes install their own custom SIGTERM handler that sets
a flag, and use that to break out of their main loop and exit quietly.
Example: autovacuum.c, or the open-source pg_cron extension, and
probably many other things that just want a nice quiet shutdown.

2.  Some processes install the standard SIGTERM handler die(), and
then use CHECK_FOR_INTERRUPTS() to break out of their main loop.  By
default this looks like "FATAL:  terminating connection due to
administrator command".  This approach is effectively required if the
main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
loop.  For example, a "launcher"-type process that calls
WaitForBackgroundWorkerStartup() could hang forever if it used
approach 1 (ask me how I know).

3.  Some system processes generally use approach 2, but have a special
case in ProcessInterrupts() to suppress or alter the usual FATAL
message or behaviour.  This includes the logical replication launcher.

A couple of thoughts:

*  Extensions that need to use die() (or something else that would be
compatible with CFI() wait loops) should ideally have a way to control
how ProcessInterrupts() reports this to the user, since shutdown is an
expected condition for long-lived processes.

*  We really should get rid of that "exited with exit code 1".

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Segfault when creating partition with a primary key and sql_drop trigger exists

2018-10-04 Thread Michael Paquier
On Thu, Oct 04, 2018 at 06:04:49PM -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
>> I'm tied up in something else at the moment so can't spend more time on
>> it, but I hope to have time to give it a look over the weekend.
> 
> Keep in mind that RC1 is scheduled to wrap Monday afternoon ...

...  Which is little time to work on a complete fix.  Surely we could
wait for GA for a fix?  I don't find nice to release v11 with this bug.
--
Michael


signature.asc
Description: PGP signature


Re: Segfault when creating partition with a primary key and sql_drop trigger exists

2018-10-04 Thread Tom Lane
Alvaro Herrera  writes:
> I'm tied up in something else at the moment so can't spend more time on
> it, but I hope to have time to give it a look over the weekend.

Keep in mind that RC1 is scheduled to wrap Monday afternoon ...

regards, tom lane



Re: Segfault when creating partition with a primary key and sql_drop trigger exists

2018-10-04 Thread Alvaro Herrera
On 2018-Oct-03, Michael Paquier wrote:

> Okay.  I have spent more time on this issue, and I have been able to
> integrate a test in the existing event_trigger.sql which is able to
> reproduce the reported failure.  Attached is what I am finishing with.
> 
> I still want to do more testing on it, and the day is ending here.

I looked at this, and I think that this particular crash you're fixing
is just a symptom of a larger problem in the event trigger alter table
handling -- I noticed while playing a bit with
src/test/modules/test_ddl_decoding (modify sql/alter_table.sql to create
a partitioned table with nullable columns and a partition, then alter
table to add a primary key).

I'm tied up in something else at the moment so can't spend more time on
it, but I hope to have time to give it a look over the weekend.

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: snprintf assert is broken by plpgsql #option dump

2018-10-04 Thread Tom Lane
I wrote:
> Pavel Stehule  writes:
>> I found two parts

> Thanks for the report, will push something.

On closer look, I'm not sure that these are the only places that are
assuming that any PLpgSQL_variable struct has a refname.  What seems
like a safer answer is to make sure they all do, more or less as
attached.

regards, tom lane

diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index 721234d..59460d2 100644
*** a/src/pl/plpgsql/src/pl_comp.c
--- b/src/pl/plpgsql/src/pl_comp.c
*** build_row_from_vars(PLpgSQL_variable **v
*** 1896,1901 
--- 1896,1903 
  
  	row = palloc0(sizeof(PLpgSQL_row));
  	row->dtype = PLPGSQL_DTYPE_ROW;
+ 	row->refname = "(unnamed row)";
+ 	row->lineno = -1;
  	row->rowtupdesc = CreateTemplateTupleDesc(numvars, false);
  	row->nfields = numvars;
  	row->fieldnames = palloc(numvars * sizeof(char *));
diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 574234d..4552638 100644
*** a/src/pl/plpgsql/src/pl_exec.c
--- b/src/pl/plpgsql/src/pl_exec.c
*** exec_stmt_call(PLpgSQL_execstate *estate
*** 2205,2210 
--- 2205,2211 
  
  			row = palloc0(sizeof(*row));
  			row->dtype = PLPGSQL_DTYPE_ROW;
+ 			row->refname = "(unnamed row)";
  			row->lineno = -1;
  			row->varnos = palloc(sizeof(int) * FUNC_MAX_ARGS);
  
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index b59869a..68e399f 100644
*** a/src/pl/plpgsql/src/pl_gram.y
--- b/src/pl/plpgsql/src/pl_gram.y
*** decl_cursor_args :
*** 613,618 
--- 613,619 
  
  		new = palloc0(sizeof(PLpgSQL_row));
  		new->dtype = PLPGSQL_DTYPE_ROW;
+ 		new->refname = "(unnamed row)";
  		new->lineno = plpgsql_location_to_lineno(@1);
  		new->rowtupdesc = NULL;
  		new->nfields = list_length($2);
*** read_into_scalar_list(char *initial_name
*** 3526,3531 
--- 3527,3533 
  
  	row = palloc0(sizeof(PLpgSQL_row));
  	row->dtype = PLPGSQL_DTYPE_ROW;
+ 	row->refname = "(unnamed row)";
  	row->lineno = plpgsql_location_to_lineno(initial_location);
  	row->rowtupdesc = NULL;
  	row->nfields = nfields;
*** make_scalar_list1(char *initial_name,
*** 3560,3565 
--- 3562,3568 
  
  	row = palloc0(sizeof(PLpgSQL_row));
  	row->dtype = PLPGSQL_DTYPE_ROW;
+ 	row->refname = "(unnamed row)";
  	row->lineno = lineno;
  	row->rowtupdesc = NULL;
  	row->nfields = 1;
diff --git a/src/pl/plpgsql/src/plpgsql.h b/src/pl/plpgsql/src/plpgsql.h
index 4a4c7cb..f6c35a5 100644
*** a/src/pl/plpgsql/src/plpgsql.h
--- b/src/pl/plpgsql/src/plpgsql.h
*** typedef struct PLpgSQL_var
*** 326,332 
   * Note that there's no way to name the row as such from PL/pgSQL code,
   * so many functions don't need to support these.
   *
!  * refname, isconst, notnull, and default_val are unsupported (and hence
   * always zero/null) for a row.  The member variables of a row should have
   * been checked to be writable at compile time, so isconst is correctly set
   * to false.  notnull and default_val aren't applicable.
--- 326,337 
   * Note that there's no way to name the row as such from PL/pgSQL code,
   * so many functions don't need to support these.
   *
!  * That also means that there's no real name for the row variable, so we
!  * conventionally set refname to "(unnamed row)".  We could leave it NULL,
!  * but it's too convenient to be able to assume that refname is valid in
!  * all variants of PLpgSQL_variable.
!  *
!  * isconst, notnull, and default_val are unsupported (and hence
   * always zero/null) for a row.  The member variables of a row should have
   * been checked to be writable at compile time, so isconst is correctly set
   * to false.  notnull and default_val aren't applicable.


Re: snprintf assert is broken by plpgsql #option dump

2018-10-04 Thread Tom Lane
Pavel Stehule  writes:
> There are new assert
> Assert(strvalue != NULL);
> probably all refname usage inside plpgsql dump functions has problem with
> it.

This isn't so much a "new assert" as a modeling of the fact that some
printf implementations dump core on a null string pointer, and have done
so since the dawn of time.

Now that we only use snprintf.c in HEAD, it'd be possible to consider
modeling glibc's behavior instead, ie instead of the Assert do

if (strvalue == NULL)
strvalue = "(null)";

I do not think this would be a good idea though, at least not till all
release branches that *don't* always use snprintf.c are out of support.
Every assert failure that we find here is a live bug in the back
branches, even if it happens not to occur on $your-favorite-platform.

Even once that window elapses, I'd not be especially on board with
snprintf.c papering over such cases.  They're bugs really.

> I found two parts

Thanks for the report, will push something.

regards, tom lane



Re: executor relation handling

2018-10-04 Thread Tom Lane
Amit Langote  writes:
> I've rebased the remaining patches.  I broke down one of the patches into
> 2 and re-ordered the patches as follows:

> 0001: introduces a function that opens range table relations and maintains
> them in an array indexes by RT index

> 0002: introduces a new field in EState that's an array of RangeTblEntry
> pointers and revises macros used in the executor that access RTEs to
> return them from the array (David Rowley co-authored this one)

I've pushed 0001 and 0002 with mostly cosmetic changes.  One thing I
wanted to point out explicitly, though, is that I found this bit of 0002
to be a seriously bad idea:

--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
+#include "nodes/parsenodes.h"
 #include "nodes/plannodes.h"
 #include "utils/hsearch.h"
 #include "utils/queryenvironment.h"

Please do not add #includes of fundamental headers to other fundamental
headers without clearing it with somebody.  There's little enough
structure to our header collection now.  I don't want to end up in a
situation where effectively the entire header set gets pulled into
every .c file, or worse that we have actual reference loops in the
headers.  (This is not an academic problem; somebody actually created
such a loop awhile back.  Cleaning it up, by the time we'd recognized
the problem, was really painful.)

> 0003: moves result relation and ExecRowMark initialization out of InitPlan
> and into ExecInit* routines of respective nodes

I am finding myself pretty unconvinced by this one; it seems like mostly
a random reallocation of responsibility with little advantage.  The
particular thing that brought me to a screeching halt was seeing that
the patch removed ExecFindRowMark, despite the fact that that's part
of our advertised FDW API (see fdwhandler.sgml), and it didn't provide
any alternative way for an FDW to find out at runtime whether it's
subject to a row locking requirement.

I thought for a minute about just leaving the function in place, but
that wouldn't work because both nodeLockRows and nodeModifyTable are
written so that they find^H^H^Hbuild their rowmarks only after recursing
to initialize their child plan nodes; so a child node that tried to use
ExecFindRowMark during ExecInitNode would get the wrong answer.  Of
course, we could consider changing the order of operations during
initialization of those node types, but I'm not really seeing a compelling
reason why we should whack things around that much.

So I'm inclined to just omit 0003.  AFAICS this would only mean that
we couldn't drop the global PlanRowMarks list from PlannedStmt, which
does not bother me much.

> 0005: teaches planner to remove PlanRowMarks corresponding to dummy relations

I'm not entirely sold on the value of that either?

regards, tom lane



snprintf assert is broken by plpgsql #option dump

2018-10-04 Thread Pavel Stehule
Hi

Today I had broken plpgsql_check tests aganst PostgreSQL 12. After command

create or replace function ml_trg()
returns trigger as $$
#option dump
declare
begin
  if TG_OP = 'INSERT' then
if NEW.status_from IS NULL then
  begin
-- performance issue only
select status into NEW.status_from
   from pa
  where pa_id = NEW.pa_id;
-- nonexist target value
select status into NEW.status_from_xxx
   from pa
  where pa_id = NEW.pa_id;
  exception
when DATA_EXCEPTION then
  new.status_from := 'DE';
  end;
end if;
  end if;
  if TG_OP = 'DELETE' then return OLD; else return NEW; end if;
exception
  when OTHERS then
NULL;
if TG_OP = 'DELETE' then return OLD; else return NEW; end if;
end;
$$ language plpgsql;

I got backtrace:
Program received signal SIGABRT, Aborted.
0x7f2c140e653f in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x7f2c140e653f in raise () from /lib64/libc.so.6
#1  0x7f2c140d0895 in abort () from /lib64/libc.so.6
#2  0x008b7903 in ExceptionalCondition
(conditionName=conditionName@entry=0xb00d3b "!(strvalue != ((void *)0))",
errorType=errorType@entry=0x906034 "FailedAssertion",
fileName=fileName@entry=0xb00d30 "snprintf.c",
lineNumber=lineNumber@entry=674) at assert.c:54
#3  0x008ff368 in dopr (target=target@entry=0x7fff4ff6aad0,
format=0x7f2bfe4b0de3 " fields",
format@entry=0x7f2bfe4b0dda "ROW %-16s fields",
args=args@entry=0x7fff4ff6ab18)
at snprintf.c:674
#4  0x008ff6b7 in pg_vfprintf (stream=,
fmt=fmt@entry=0x7f2bfe4b0dda
"ROW %-16s fields",
args=args@entry=0x7fff4ff6ab18) at snprintf.c:261
#5  0x008ff7ff in pg_printf (fmt=fmt@entry=0x7f2bfe4b0dda "ROW
%-16s fields") at snprintf.c:286
#6  0x7f2bfe4a7c67 in plpgsql_dumptree (func=func@entry=0x26811e8) at
pl_funcs.c:1676
#7  0x7f2bfe49a136 in do_compile (fcinfo=fcinfo@entry=0x7fff4ff6afa0,
procTup=procTup@entry=0x7f2bfe4ba100, function=0x26811e8,
function@entry=0x0, hashkey=hashkey@entry=0x7fff4ff6ad20,
forValidator=forValidator@entry=true) at pl_comp.c:794
#8  0x7f2bfe49a27a in plpgsql_compile (fcinfo=fcinfo@entry=0x7fff4ff6afa0,
forValidator=forValidator@entry=true) at pl_comp.c:224
#9  0x7f2bfe497126 in plpgsql_validator (fcinfo=) at
pl_handler.c:504
#10 0x008c03df in OidFunctionCall1Coll
(functionId=functionId@entry=13392,
collation=collation@entry=0, arg1=arg1@entry=40960)
at fmgr.c:1418
#11 0x00563444 in ProcedureCreate (procedureName=,
procNamespace=procNamespace@entry=2200,
replace=, returnsSet=,
returnType=, proowner=16384, languageObjectId=13393,
languageValidator=13392,
prosrc=0x264feb8 "\n#option

There are new assert

Assert(strvalue != NULL);

probably all refname usage inside plpgsql dump functions has problem with
it.

I found two parts

diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c
index b93f866223..d97997c001 100644
--- a/src/pl/plpgsql/src/pl_funcs.c
+++ b/src/pl/plpgsql/src/pl_funcs.c
@@ -1519,7 +1519,7 @@ dump_execsql(PLpgSQL_stmt_execsql *stmt)
dump_ind();
printf("INTO%s target = %d %s\n",
   stmt->strict ? " STRICT" : "",
-  stmt->target->dno, stmt->target->refname);
+  stmt->target->dno, stmt->target->refname ?
stmt->target->refname : "null");
}
dump_indent -= 2;
 }
@@ -1673,7 +1673,7 @@ plpgsql_dumptree(PLpgSQL_function *func)
PLpgSQL_row *row = (PLpgSQL_row *) d;
int i;

-   printf("ROW %-16s fields", row->refname);
+   printf("ROW %-16s fields", row->refname ? row->refname
: "null");
for (i = 0; i < row->nfields; i++)
{
printf(" %s=var %d", row->fieldnames[i],

Regards

Pavel


Re: executor relation handling

2018-10-04 Thread Tom Lane
Robert Haas  writes:
> On Thu, Oct 4, 2018 at 3:28 PM Tom Lane  wrote:
>> What we've determined so far in this thread is that workers *do* get
>> their own locks (or did before yesterday), but I'd been supposing that
>> that was accidental not intentional.

> Nope, that was intentional.

Fair enough --- in which case, the patch series we're working on here
was broken to suppose that it could get away with removing that.

As I said, I'll make sure the locking is back before I finish with this.

regards, tom lane



Re: Procedure calls are not tracked in pg_stat_user_functions / track_functions

2018-10-04 Thread Andres Freund
Hi,

On 2018-10-04 12:15:28 -0700, Lukas Fittl wrote:
> Hi all,
> 
> It seems that currently procedures do not get tracked when track_functions
> is enabled, which means one needs to resort to other workarounds in order
> to monitor procedure calls/runtime.
> 
> To illustrate:
> 
> =# SHOW track_functions;
> ┌─┐
> │ track_functions │
> ├─┤
> │ all │
> └─┘
> (1 row)
> 
> =# CALL abc();
> CALL
> 
> =# SELECT def();
> ┌─┐
> │ def │
> ├─┤
> │ │
> └─┘
> (1 row)
> 
> =# SELECT * FROM pg_stat_user_functions;
> ┌─[ RECORD 1 ]┐
> │ funcid │ 75223  │
> │ schemaname │ public │
> │ funcname   │ def│
> │ calls  │ 1  │
> │ total_time │ 3.222  │
> │ self_time  │ 3.222  │
> └┴┘
> 
> Was this intentional, or an oversight?
> 
> If welcome, I would be happy to work on a patch. Whilst slightly confusing
> in terms of naming, we could just track this together with functions, since
> one can always join with pg_proc to determine whether something is a
> function or a procedure.

Yea, that sounds wrong / not ideal to me.  I think we should just fix
this, should be easy enough.

- Andres



Re: Poor plan when using EXISTS in the expression list

2018-10-04 Thread Pierre Ducroquet
On Thursday, October 4, 2018 4:46:26 PM CEST Geoff Winkless wrote:
> On Thu, 4 Oct 2018 at 13:11, Pierre Ducroquet  wrote:
> > Our developpers ORM (Django's) sadly can not use EXISTS in the where
> > clauses
> > without having it in the expression part of the SELECT statement.
> 
> I don't know if this will be helpful to you (and I appreciate there's still
> the underlying PG issue), but there's a suggestion here that you can work
> around this using .extra()
> 
> https://stackoverflow.com/a/38880144/321161

Sure this helps when you know the trap and don't use the Exist support in 
Django, but this still mean any developer with Django may create a query that, 
on small volumes, will be a bit slow, and will blow up on big volumes. We 
sadly can not monitor every piece of code written by developers or imported in 
the dependencies.





Re: executor relation handling

2018-10-04 Thread Robert Haas
On Thu, Oct 4, 2018 at 3:28 PM Tom Lane  wrote:
> I'm possibly confused, but I thought that the design of parallel query
> involved an expectation that workers didn't need to get their own locks.

You are, indeed, confused.  A heck of a lot of effort went into making
sure that the workers COULD take their own locks, and into trying to
make sure that didn't break anything.  That effort may or may not have
been entirely successful, but I'm pretty sure that having them NOT
take locks is going to be a lot worse.

> What we've determined so far in this thread is that workers *do* get
> their own locks (or did before yesterday), but I'd been supposing that
> that was accidental not intentional.

Nope, that was intentional.

> In any case, I definitely intend that they will be getting their own
> locks again after the dust has settled.  Panic not.

/me unloads metaphorical bazooka.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: executor relation handling

2018-10-04 Thread Andres Freund
On 2018-10-04 12:34:44 -0700, Andres Freund wrote:
> Hi,
>
> On 2018-10-04 15:27:59 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > I've not really followed this thread, and just caught up to here.  It
> > > seems entirely unacceptable to not acquire locks on workers to me.
> > > Maybe I'm missing something, but why do/did the patches in this thread
> > > require that / introduce that? We didn't have that kind of concept
> > > before, no?  The group locking stuff should rely / require that kind of
> > > thing, no?
> >
> > I'm possibly confused, but I thought that the design of parallel query
> > involved an expectation that workers didn't need to get their own
> > locks.
>
> Not as far as I'm aware of - but I'm not exactly the expert
> there. There's an exception that some lock classes don't conflict
> between the leader and the workers - that's group locking
> (a1c1af2a1f60). But the locks still have to be acquired, and I think
> it's quite dangerous not to do so.  The group locking logic is required
> because otherwise it'd be trivial to get into deadlocks, and some of the
> restrictions around parallel query are required to make that safe.

Re-read docs + code just to make sure.  Here's the relevant readme parts:

src/backend/access/transam/README.parallel

To prevent unprincipled deadlocks when running in parallel mode, this code
also arranges for the leader and all workers to participate in group
locking.  See src/backend/storage/lmgr/README for more details.


src/backend/storage/lmgr/README:

Group Locking
-

As if all of that weren't already complicated enough, PostgreSQL now supports
parallelism (see src/backend/access/transam/README.parallel), which means that
we might need to resolve deadlocks that occur between gangs of related
processes rather than individual processes.  This doesn't change the basic
deadlock detection algorithm very much, but it makes the bookkeeping more
complicated.

We choose to regard locks held by processes in the same parallel group as
non-conflicting.  This means that two processes in a parallel group can hold a
self-exclusive lock on the same relation at the same time, or one process can
acquire an AccessShareLock while the other already holds AccessExclusiveLock.
This might seem dangerous and could be in some cases (more on that below), but
if we didn't do this then parallel query would be extremely prone to
self-deadlock.  For example, a parallel query against a relation on which the
leader already had AccessExclusiveLock would hang, because the workers would
try to lock the same relation and be blocked by the leader; yet the leader
can't finish until it receives completion indications from all workers.  An
undetected deadlock results.  This is far from the only scenario where such a
problem happens.  The same thing will occur if the leader holds only
AccessShareLock, the worker seeks AccessShareLock, but between the time the
leader attempts to acquire the lock and the time the worker attempts to
acquire it, some other process queues up waiting for an AccessExclusiveLock.
In this case, too, an indefinite hang results.

...


So yes, locks are expected to be acquired in workers.

Greetings,

Andres Freund



Re: executor relation handling

2018-10-04 Thread Andres Freund
Hi,

On 2018-10-04 15:27:59 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I've not really followed this thread, and just caught up to here.  It
> > seems entirely unacceptable to not acquire locks on workers to me.
> > Maybe I'm missing something, but why do/did the patches in this thread
> > require that / introduce that? We didn't have that kind of concept
> > before, no?  The group locking stuff should rely / require that kind of
> > thing, no?
> 
> I'm possibly confused, but I thought that the design of parallel query
> involved an expectation that workers didn't need to get their own
> locks.

Not as far as I'm aware of - but I'm not exactly the expert
there. There's an exception that some lock classes don't conflict
between the leader and the workers - that's group locking
(a1c1af2a1f60). But the locks still have to be acquired, and I think
it's quite dangerous not to do so.  The group locking logic is required
because otherwise it'd be trivial to get into deadlocks, and some of the
restrictions around parallel query are required to make that safe.


> What we've determined so far in this thread is that workers *do* get
> their own locks (or did before yesterday), but I'd been supposing that
> that was accidental not intentional.

I don't think it was accidental.

Greetings,

Andres Freund



Re: executor relation handling

2018-10-04 Thread Tom Lane
Andres Freund  writes:
> I've not really followed this thread, and just caught up to here.  It
> seems entirely unacceptable to not acquire locks on workers to me.
> Maybe I'm missing something, but why do/did the patches in this thread
> require that / introduce that? We didn't have that kind of concept
> before, no?  The group locking stuff should rely / require that kind of
> thing, no?

I'm possibly confused, but I thought that the design of parallel query
involved an expectation that workers didn't need to get their own locks.
What we've determined so far in this thread is that workers *do* get
their own locks (or did before yesterday), but I'd been supposing that
that was accidental not intentional.

In any case, I definitely intend that they will be getting their own
locks again after the dust has settled.  Panic not.

regards, tom lane



Procedure calls are not tracked in pg_stat_user_functions / track_functions

2018-10-04 Thread Lukas Fittl
Hi all,

It seems that currently procedures do not get tracked when track_functions
is enabled, which means one needs to resort to other workarounds in order
to monitor procedure calls/runtime.

To illustrate:

=# SHOW track_functions;
┌─┐
│ track_functions │
├─┤
│ all │
└─┘
(1 row)

=# CALL abc();
CALL

=# SELECT def();
┌─┐
│ def │
├─┤
│ │
└─┘
(1 row)

=# SELECT * FROM pg_stat_user_functions;
┌─[ RECORD 1 ]┐
│ funcid │ 75223  │
│ schemaname │ public │
│ funcname   │ def│
│ calls  │ 1  │
│ total_time │ 3.222  │
│ self_time  │ 3.222  │
└┴┘

Was this intentional, or an oversight?

If welcome, I would be happy to work on a patch. Whilst slightly confusing
in terms of naming, we could just track this together with functions, since
one can always join with pg_proc to determine whether something is a
function or a procedure.

Thanks,
Lukas

-- 
Lukas Fittl


Re: executor relation handling

2018-10-04 Thread Andres Freund
Hi,

On 2018-10-03 16:16:11 -0400, Tom Lane wrote:
> I wrote:
> > Amit Langote  writes:
> >> Should this check that we're not in a parallel worker process?
> 
> > Hmm.  I've not seen any failures in the parallel parts of the regular
> > regression tests, but maybe I'd better do a force_parallel_mode
> > run before committing.
> > In general, I'm not on board with the idea that parallel workers don't
> > need to get their own locks, so I don't really want to exclude parallel
> > workers from this check.  But if it's not safe for that today, fixing it
> > is beyond the scope of this particular patch.
> 
> So the place where that came out in the wash is the commit I just made
> (9a3cebeaa) to change the executor from taking table locks to asserting
> that somebody else took them already.  To make that work, I had to make
> both ExecOpenScanRelation and relation_open skip checking for lock-held
> if IsParallelWorker().
> 
> This makes me entirely uncomfortable with the idea that parallel workers
> can be allowed to not take any locks of their own.  There is no basis
> for arguing that we have field proof that that's safe, because *up to
> now, parallel workers in fact did take their own locks*.  And it seems
> unsafe on its face, because there's nothing that really guarantees that
> the parent process won't go away while children are still running.
> (elog(FATAL) seems like a counterexample, for instance.)

> I think that we ought to adjust parallel query to insist that children
> do take locks, and then revert the IsParallelWorker() exceptions I made
> here.  I plan to leave that point in abeyance till we've got the rest
> of these changes in place, though.  The easiest way to do it will
> doubtless change once we've centralized the executor's table-opening
> logic, so trying to code it right now seems like a waste of effort.

I've not really followed this thread, and just caught up to here.  It
seems entirely unacceptable to not acquire locks on workers to me.
Maybe I'm missing something, but why do/did the patches in this thread
require that / introduce that? We didn't have that kind of concept
before, no?  The group locking stuff should rely / require that kind of
thing, no?

Greetings,

Andres Freund



Changes in Brazil DST's period

2018-10-04 Thread Emílio B . Pedrollo
Today Brazil's president announced that the DST for all our time zones
would start at November 18th instead of November 4th
There's already a thread at IANA about it
https://mm.icann.org/pipermail/tz/2018-October/026921.html


*Emílio B. Pedrollo*
Full Stack Developer

+55 (55) 99134-7922


Re: Early WIP/PoC for inlining CTEs

2018-10-04 Thread David Fetter
On Thu, Oct 04, 2018 at 11:22:32AM +0200, Andreas Karlsson wrote:
> On 10/03/2018 05:57 PM, David Fetter wrote:
> >Is there any meaningful distinction between "inlining," by which I
> >mean converting to a subquery, and predicate pushdown, which
> >would happen at least for a first cut, at the rewrite stage?
> 
> Sorry, but I do not think I understand your question. The ability to push
> down predicates is just one of the potential benefits from inlining.

Oracle appears to have such a distinction, and it appears we don't.
https://medium.com/@hakibenita/be-careful-with-cte-in-postgresql-fca5e24d2119

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: TupleTableSlot abstraction

2018-10-04 Thread Amit Khandekar
I have only done the below two changes yet. After doing that and
rebasing with latest master, in the regression I got crashes, and I
suspect the reason being that I have used Virtual tuple slot for the
destination slot of execute_attr_map_slot(). I am analyzing it. I am
anyway attaching the patches (v12) to give you an idea of how I have
handled the below two items.

On Wed, 26 Sep 2018 at 05:09, Andres Freund  wrote:
>
> On 2018-09-04 18:35:34 +0530, Amit Khandekar wrote:
> > The attached v11 tar has the above set of changes.
>
> - I've pushed 0003 - although that commit seems to have included a lot
>   of things unrelated to the commit message. I guess two patches have
>   accidentally been merged?  Could you split out the part of the changes
>   that was mis-squashed?

Yeah, it indeed looks like it had unrelated things, mostly the changes
that moved the slot attribute functions into execTuples.c . I have
included this change as the very first patch in the patch series.


>> From 280eac5c6758061d0a46b7ef9dd324e9a23226b5 Mon Sep 17 00:00:00 2001
>> From: Ashutosh Bapat 
>> Date: Fri, 31 Aug 2018 10:53:42 +0530
>> Subject: [PATCH 08/14] Change tuple table slot creation routines to suite
>>  tuple table slot abstraction
>>
>> This change requires ExecInitResultTupleSlotTL, ExecInitScanTupleSlot,
>> ExecCreateScanSlotFromOuterPlan, ExecInitNullTupleSlot,
>> ExecInitExtraTupleSlot, MakeTupleTableSlot, ExecAllocTableSlot to
>> accept TupleTableSlotType as a new parameter. Change all their calls.
>>
>> Ashutosh Bapat and Andres Freund
>>
>> This by itself won't compile. Neither the tuple table slot abstraction
>> patch would compile and work without this change. Both of those need
>> to be committed together.
>
> I don't like this kind of split - all commits should individually
> compile. I think we should instead introduce dummy / empty structs for
>  etc, and add the parameters necessary to pass them
> through. And then move this patch to *before* the "core" abstract slot
> patch.  That way every commit, but the super verbose stuff is still
> split out.
>

Done. Moved this patch before the core one. In this patch, just
declared the global variables of type TupleTableSlotOps, without
initializing them.

I have tried to make sure individual patches compile successfully by
shuffling around the changes to the right patch, but there is one
particular patch that still gives error. Will fix that later.

I will handle the other review comments in the next patch series.


pg_abstract_tts_patches_v12.tar.gz
Description: GNU Zip compressed data


Re: Segfault when creating partition with a primary key and sql_drop trigger exists

2018-10-04 Thread Alvaro Herrera
On 2018-Oct-04, Alvaro Herrera wrote:

> I admit I'm surprised that your patch fixes the bug.  sql_drop was added
> before the command-stashing was added for pg_event_trigger_ddl_commands
> was added, and sql_drop only processes objects from the list passed to
> performMultipleDeletions, so adding the EventTriggerAlterTableStart() /
> End() calls should not affect it ... evidently I must be missing
> something here.

I think the explanation for this is that the problem has nothing to do
with sql_drop per se -- it's only that having a sql_drop trigger causes
the event trigger stuff to get invoked, and the bogus code involving
ddl_command_end (the one that's affected by the
EventTriggerAlterTableStart dance) is what crashes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Segfault when creating partition with a primary key and sql_drop trigger exists

2018-10-04 Thread Dmitry Dolgov
> On Wed, 3 Oct 2018 at 09:53, Michael Paquier  wrote:
>
> On Fri, Sep 28, 2018 at 12:17:00PM +0900, Michael Paquier wrote:
> > I think that Alvaro should definitely look at this patch to be sure, or
> > I could do it, but I would need to spend way more time on this and check
> > event trigger interactions.
> >
> > Anyway, I was struggling a bit regarding the location where adding a
> > regression test.  event_trigger.sql makes the most sense but in tests
> > for drops the objects are created before the event trigger is defined,
> > so that would need to move around so as the original problem is
> > reproducible.  Perhaps you have an idea for that?
>
> Okay.  I have spent more time on this issue, and I have been able to
> integrate a test in the existing event_trigger.sql which is able to
> reproduce the reported failure.  Attached is what I am finishing with.
>
> I still want to do more testing on it, and the day is ending here.
>
> Thoughts?

Sorry, couldn't answer your previous message, since was away. So far I don't
see any problems with your proposed patch.

> On Thu, 4 Oct 2018 at 17:22, Alvaro Herrera  wrote:
>
> I admit I'm surprised that your patch fixes the bug.  sql_drop was added
> before the command-stashing was added for pg_event_trigger_ddl_commands
> was added, and sql_drop only processes objects from the list passed to
> performMultipleDeletions, so adding the EventTriggerAlterTableStart() /
> End() calls should not affect it ... evidently I must be missing
> something here.
>
> Still looking.

I also find strange another part of this problem, namely why we ended up doing
AlterTableInternal at all, since I assumed that after MergeAttributes all
attnotnull should be merged with OR. But for example these would work:

CREATE TABLE collections_list_1
PARTITION OF collections_list
FOR VALUES IN (1);

CREATE TABLE collections_list_1
PARTITION OF collections_list (key, ts, collection_id not null, value)
FOR VALUES IN (1);

Looks like in MergeAttributes at the branch:

if (is_partition && list_length(saved_schema) > 0)

we override not null property of ColumnDef:

if (coldef->is_from_parent)
{
coldef->is_not_null = restdef->is_not_null;

It looks a bit confusing, so I wonder if it's how it should be?



Re: [HACKERS] Secondary index access optimizations

2018-10-04 Thread Konstantin Knizhnik



On 04.10.2018 12:19, David Rowley wrote:

On 4 October 2018 at 22:11, Konstantin Knizhnik
 wrote:

On 04.10.2018 06:19, David Rowley wrote:

Please, can you also add a test which tests this code which has a
partition with columns in a different order than it's parent. Having
an INT and a TEXT column is best as if the translations are done
incorrectly it's likely to result in a crash which will alert us to
the issue. It would be good to also verify the test causes a crash if
you temporarily put the code back to using the untranslated qual.

Thanks for working on this.


Thank you very much for detecting and fixing this problem.
I have checked that all changes in plan caused by this fix are correct.
Updated version of the patch is attached.

Can you add the test that I mentioned above?


Will the following test be enough:

-- check that columns for parent table are correctly mapped to child 
partition of their order doesn't match

create table paren (a int, b text) partition by range(a);
create table child_1 partition of paren for values from (0) to (10);
create table child_2 (b text, a int);
alter table paren attach partition child_2 for values from (10) to (20);
insert into paren values (generate_series(0,19), generate_series(100,119));

explain (costs off) select * from paren where a between 0 and 9;
explain (costs off) select * from paren where a between 10 and 20;
explain (costs off) select * from paren where a >= 5;
explain (costs off) select * from paren where a <= 15;

select count(*) from paren where a >= 5;
select count(*) from paren where a < 15;

drop table paren cascade;


--
If so, then updated patch is attached.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5f74d3b..b628ac7 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -37,6 +37,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/plancat.h"
 #include "optimizer/planner.h"
+#include "optimizer/predtest.h"
 #include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
@@ -1052,6 +1053,27 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 /* Restriction reduces to constant TRUE, so drop it */
 continue;
 			}
+
+			/*
+			 * For partitions, we may be able to eliminate some quals if
+			 * they're implied by the partition bound.
+			 */
+			if (childrel->partition_qual != NIL)
+			{
+Node	   *checkqual = copyObject(childqual);
+
+/*
+ * Since the partition_qual has all Vars stored as varno=1, we
+ * must convert all Vars of the childqual to have their varnos
+ * set to 1 so that predicate_implied_by can properly match
+ * implied quals.
+ */
+ChangeVarNodes(checkqual, childrel->relid, 1, 0);
+
+if (predicate_implied_by(list_make1(checkqual), childrel->partition_qual, false))
+	continue;
+			}
+
 			/* might have gotten an AND clause, if so flatten it */
 			foreach(lc2, make_ands_implicit((Expr *) childqual))
 			{
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 8369e3a..8cd9b06 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -450,7 +450,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	 */
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
-
+	else if (relation->rd_rel->relispartition)
+		rel->partition_qual = RelationGetPartitionQual(relation);
 	heap_close(relation, NoLock);
 
 	/*
diff --git a/src/common/zpq_stream.c b/src/common/zpq_stream.c
new file mode 100644
index 000..afd42e9
--- /dev/null
+++ b/src/common/zpq_stream.c
@@ -0,0 +1,386 @@
+#include "postgres_fe.h"
+#include "common/zpq_stream.h"
+#include "c.h"
+#include "pg_config.h"
+
+#if HAVE_LIBZSTD
+
+#include 
+#include 
+
+#define ZPQ_BUFFER_SIZE (8*1024)
+#define ZSTD_COMPRESSION_LEVEL 1
+
+struct ZpqStream
+{
+	ZSTD_CStream*  tx_stream;
+	ZSTD_DStream*  rx_stream;
+	ZSTD_outBuffer tx;
+	ZSTD_inBuffer  rx;
+	size_t tx_not_flushed; /* Amount of datas in internal zstd buffer */
+	size_t tx_buffered;/* Data which is consumed by zpq_read but not yet sent */
+	zpq_tx_functx_func;
+	zpq_rx_funcrx_func;
+	void*  arg;
+	char const*rx_error;/* Decompress error message */
+	size_t tx_total;
+	size_t tx_total_raw;
+	size_t rx_total;
+	size_t rx_total_raw;
+	char   tx_buf[ZPQ_BUFFER_SIZE];
+	char   rx_buf[ZPQ_BUFFER_SIZE];
+};
+
+ZpqStream*
+zpq_create(zpq_tx_func tx_func, zpq_rx_func rx_func, void *arg)
+{
+	ZpqStream* zs = (ZpqStream*)malloc(sizeof(ZpqStream));
+	zs->tx_stream = ZSTD_createCStream();
+	ZSTD_initCStream(zs->tx_stream, ZSTD_COMPRESSION_LEVEL);
+	

Re: Segfault when creating partition with a primary key and sql_drop trigger exists

2018-10-04 Thread Alvaro Herrera
I admit I'm surprised that your patch fixes the bug.  sql_drop was added
before the command-stashing was added for pg_event_trigger_ddl_commands
was added, and sql_drop only processes objects from the list passed to
performMultipleDeletions, so adding the EventTriggerAlterTableStart() /
End() calls should not affect it ... evidently I must be missing
something here.

Still looking.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Multiple primary key on partition table?

2018-10-04 Thread Alvaro Herrera
On 2018-Oct-01, Rajkumar Raghuwanshi wrote:

> On Tue, Sep 18, 2018 at 11:20 AM amul sul  wrote:
> 
> > Here is the complete patch proposes the aforesaid fix with regression test.
>
> Thanks, This worked for me.

Yeah, looks good to me, pushed.  I added one more regression test to
ensure that the PRIMARY KEY clause in the partition is still accepted if
the parent does not have one.

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: executor relation handling

2018-10-04 Thread Tom Lane
Amit Langote  writes:
> On 2018/10/04 5:16, Tom Lane wrote:
>> I think that we ought to adjust parallel query to insist that children
>> do take locks, and then revert the IsParallelWorker() exceptions I made
>> here.

> Maybe I'm missing something here, but isn't the necessary adjustment just
> that the relations are opened with locks if inside a parallel worker?

Yeah, that's one plausible way to fix it.  I hadn't wanted to prejudge
the best way before we finish the other changes, though.

> I've rebased the remaining patches.  I broke down one of the patches into
> 2 and re-ordered the patches as follows:

Thanks, will start looking at these today.

regards, tom lane



Re: SerializeParamList vs machines with strict alignment

2018-10-04 Thread Tom Lane
Amit Kapila  writes:
> All the affected members (gharial, chipmunk, anole) are happy.  It
> feels good to see chipmunk becoming green after so many days.

Yup.  I've marked this item fixed on the open-items list.

regards, tom lane



RE: [Proposal] Add accumulated statistics for wait event

2018-10-04 Thread Phil Florent
Hi,

It's the same logic with any polling system. An integration calculation using 
monte-carlo method with only a few points won't be accurate enough and can even 
be completely wrong etc.
Polling is OK to troubleshoot a problem on the fly but 2 points are not enough. 
A few seconds are needed to obtain good enough data, e.g 5-10 seconds of 
polling with a 0.1=>0.01s interval between 2 queries of the activity.
Polling a few seconds while the user is waiting is normally enough to say if a 
significant part of the waits are on the database. It's very important to know 
that. With 1 hour of accumulated statistics, a DBA will always see something to 
fix. But if the user waits 10 seconds on a particular screen and 1 second is 
spent on the database it often won't directly help.
Polling gives great information with postgreSQL 10 but it was already useful to 
catch top queries etc. in older versions.
I always check if activity is adequately reported by my tool using known cases. 
I want to be sure it will report adequately things in real-world 
troubleshooting sessions. Sometimes there are bugs in my tool, once there was 
an issue with postgres (pgstat_report_activty() was not called by workers in 
parallel index creation)

Best regards
Phil

De : Michael Paquier 
Envoyé : jeudi 4 octobre 2018 12:58
À : Phil Florent
Cc : Yotsunaga, Naoki; Tomas Vondra; pgsql-hackers@lists.postgresql.org
Objet : Re: [Proposal] Add accumulated statistics for wait event

On Thu, Oct 04, 2018 at 09:32:37AM +, Phil Florent wrote:
> I am a DB beginner, so please tell me. It says that you can find
> events that are bottlenecks in sampling, but as you saw above, you can
> not find events shorter than the sampling interval, right?

Yes, which is why it would be as simple as making the interval shorter,
still not too short so as it bloats the amount of information fetched
which needs to be stored and afterwards (perhaps) treated for analysis.
This gets rather close to signal processing.  A simple image is for
example, assuming that event A happens 100 times in an interval of 1s,
and event B only once in the same interval of 1s, then if the snapshot
interval is only 1s, then in the worst case A would be treated an equal
of B, which would be wrong.
--
Michael


Re: pg_ls_tmpdir()

2018-10-04 Thread Laurenz Albe
Bossart, Nathan wrote:
> >> AFAICT the cleanest way to do this in system_views.sql is to hard-code
> >> the pg_default tablespace OID in the DEFAULT expression.  So, it might
> >> be best to use the two function approach if we want pg_ls_tmpdir() to
> >> default to the pg_default tablespace.
> >
> > That would be pretty bletcherous, so +1 for just making two C functions.
> 
> Alright, here's an updated patch.

Looks, good; marking as "ready for committer".

Yours,
Laurenz Albe




Poor plan when using EXISTS in the expression list

2018-10-04 Thread Pierre Ducroquet
Hello

Our developpers ORM (Django's) sadly can not use EXISTS in the where clauses 
without having it in the expression part of the SELECT statement.
I was expecting it to create queries performing a bit worse than queries 
without this useless expression, but it turns out this trigger an extremely 
poor planning, with an additional Seq Scan of the table referenced in EXISTS.
Thus the query select a.*, exists (select * from b where a_id = a.id) from a 
where exists (select * from b where a_id = a.id); can be orders of magnitude 
slower than select a.* from a where exists (select * from b where a_id = 
a.id);

This has been reproduced on PostgreSQL 9.6 and 11 beta4.

Example :

test=> create table a (id serial primary key, b text);   
CREATE TABLE

test=> create table b (id serial primary key, a_id integer not null references 
a(id), c text);  
CREATE TABLE

test=> explain select a.* from a where exists (select * from b  where a_id = 
a.id); 
  QUERY PLAN   
---
 Hash Join  (cost=29.50..62.60 rows=635 width=36)
   Hash Cond: (a.id = b.a_id)
   ->  Seq Scan on a  (cost=0.00..22.70 rows=1270 width=36)
   ->  Hash  (cost=27.00..27.00 rows=200 width=4)
 ->  HashAggregate  (cost=25.00..27.00 rows=200 width=4)
   Group Key: b.a_id
   ->  Seq Scan on b  (cost=0.00..22.00 rows=1200 width=4)
(7 rows)

test=> explain select a.*, exists (select * from b where a_id = a.id) from a;
   QUERY PLAN
-
 Seq Scan on a  (cost=0.00..5314.37 rows=1270 width=37)
   SubPlan 1
 ->  Seq Scan on b  (cost=0.00..25.00 rows=6 width=0)
   Filter: (a_id = a.id)
   SubPlan 2
 ->  Seq Scan on b b_1  (cost=0.00..22.00 rows=1200 width=4)
(6 rows)

test=> explain select a.*, exists (select * from b where a_id = a.id)  from a 
where exists (select * from b  where a_id = a.id); 
  QUERY PLAN   
---
 Hash Join  (cost=29.50..2708.43 rows=635 width=37)
   Hash Cond: (a.id = b.a_id)
   ->  Seq Scan on a  (cost=0.00..22.70 rows=1270 width=36)
   ->  Hash  (cost=27.00..27.00 rows=200 width=4)
 ->  HashAggregate  (cost=25.00..27.00 rows=200 width=4)
   Group Key: b.a_id
   ->  Seq Scan on b  (cost=0.00..22.00 rows=1200 width=4)
   SubPlan 1
 ->  Seq Scan on b b_1  (cost=0.00..25.00 rows=6 width=0)
   Filter: (a_id = a.id)
   SubPlan 2
 ->  Seq Scan on b b_2  (cost=0.00..22.00 rows=1200 width=4)
(12 rows)


Thanks

 Pierre






Possible important data point on stats collection, wondering about possible improvement

2018-10-04 Thread Chris Travers
Hi;

System is PostgreSQL 10.5, all partitioning done the old way (via
inheritance).

Last month we had some performance issues caused by statistics being out of
date and the planner choosing the wrong index for a large number of
queries.  The proximal fix was to increase the stats target from 1000 to
1 and analyze which prevents the problem from continuing to manifest.
Looking back at the graphs though I notice a number of things which make me
think that maybe there might be ways of improving the situation without
increasing stats targets.

I wanted to bring up the question here and see if there were opportunities
to work together on improving the situation.

What I measured was the difference between the maximum value in the
statistics histogram and the maximum value in the table.  The relevant
field is a timestamp field, so calculating lag is straight-forward and
gives us a percentage of the table that is outside stats.

What I noticed was that the major analytics tables seem to fall into two
groups:
1.  One group has actual clear sampling issues as evidenced by the fact
that the difference in values swung wildly around.  One of these tables had
52 million rows spread between two partitions (1M and 51M respectively).
On this group I understand the need to set stats targets up.

2.  A second group saw a more mild increase in lag between max value
recorded in stats and max value in the db.  However what struck me about
this was that the lag seemed fairly linear.  In other words, the
fluctuations were within a relatively narrow range and the lag seemed to
grow linearly with time.  These tables were actually larger (one typical
one was 60M rows split between a partition of 51M rows and one of 9M rows).

The workload for the database in question is heavily update driven so I
would expect fewer sampling bias problems than might happen for insert-only
workloads.

The second case puzzles me.  I have been looking carefully into how the
stats collector works and I cannot find anything that could account for a
near-linear increase in statics missing recent data.  What starts out in a
60M row table with default stats targets (1000) ends up going about 10% off
over time.

What I am wondering is whether it would make any sense whatsoever to expand
the stats to include min and max values found in a scan, or whether it
would make more sense to try to help the planner extrapolate from existing
stats in a more efficient way.

-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


pg_upgrade failed with ERROR: null relpartbound for relation 18159 error.

2018-10-04 Thread Rajkumar Raghuwanshi
Hi,

I am getting ERROR:  null relpartbound for relation 18159 while doing
pg_upgrade from v11 to v11/master.

-- user-defined operator class in partition key
CREATE FUNCTION my_int4_sort(int4,int4) RETURNS int LANGUAGE sql
  AS $$ SELECT CASE WHEN $1 = $2 THEN 0 WHEN $1 > $2 THEN 1 ELSE -1 END; $$;
CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING btree AS
  OPERATOR 1 < (int4,int4), OPERATOR 2 <= (int4,int4),
  OPERATOR 3 = (int4,int4), OPERATOR 4 >= (int4,int4),
  OPERATOR 5 > (int4,int4), FUNCTION 1 my_int4_sort(int4,int4);
CREATE TABLE partkey_t (a int4) PARTITION BY RANGE (a test_int4_ops);
CREATE TABLE partkey_t_1 PARTITION OF partkey_t FOR VALUES FROM (0) TO
(1000);
INSERT INTO partkey_t VALUES (100);
INSERT INTO partkey_t VALUES (200);

--ran pg_upgrade failed with below error.
pg_restore: creating TABLE "public.partitioned"
pg_restore: creating TABLE "public.partkey_t"
pg_restore: creating TABLE "public.partkey_t_1"
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 485; 1259 18159 TABLE
partkey_t_1 edb
pg_restore: [archiver (db)] could not execute query: ERROR:  null
relpartbound for relation 18159
CONTEXT:  SQL function "my_int4_sort"
Command was:
-- For binary upgrade, must preserve pg_type oid
SELECT
pg_catalog.binary_upgrade_set_next_pg_type_oid('18161'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_type array oid
SELECT
pg_catalog.binary_upgrade_set_next_array_pg_type_oid('18160'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_class oids
SELECT
pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('18159'::pg_catalog.oid);

CREATE TABLE "public"."partkey_t_1" (
"a" integer
);

-- For binary upgrade, recreate inherited column.
UPDATE pg_catalog.pg_attribute
SET attislocal = false
WHERE attname = 'a'
  AND attrelid = '"public"."partkey_t_1"'::pg_catalog.regclass;

-- For binary upgrade, set up inheritance and partitioning this way.
ALTER TABLE ONLY "public"."partkey_t" ATTACH PARTITION
"public"."partkey_t_1" FOR VALUES FROM (0) TO (1000);

-- For binary upgrade, set heap's relfrozenxid and relminmxid
UPDATE pg_catalog.pg_class
SET relfrozenxid = '1915', relminmxid = '1'
WHERE oid = '"public"."partkey_t_1"'::pg_catalog.regclass;

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: Function to promote standby servers

2018-10-04 Thread Laurenz Albe
Michael Paquier wrote:
> > In that vein, I propose a function pg_promote() to promote
> > physical standby servers.
> 
> No fundamental issues from me regarding the concept of being able to
> trigger a promotion remotely, so +1.  Do we want this capability as well
> for fallback_promote?  My gut tells me no, and that we ought to drop
> this option at some point in the future, still that's worth mentioning.
> 
> You may want to move PROMOTE_SIGNAL_FILE in a position where xlogfuncs.c
> could use it.

Good, idea; updated patch attached.

Yours,
Laurenz Albe
From 63ba6c6f8088bea138e924b4180a08086d339eb5 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Thu, 4 Oct 2018 13:24:03 +0200
Subject: [PATCH] Add pg_promote() to promote standby servers

---
 doc/src/sgml/func.sgml | 20 +++
 doc/src/sgml/high-availability.sgml|  2 +-
 doc/src/sgml/recovery-config.sgml  |  3 +-
 src/backend/access/transam/xlog.c  |  2 --
 src/backend/access/transam/xlogfuncs.c | 48 ++
 src/include/access/xlog.h  |  4 +++
 src/include/catalog/pg_proc.dat|  4 +++
 7 files changed, 79 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4331bebc96..7beeaeacde 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -18596,6 +18596,9 @@ SELECT set_config('log_statement_stats', 'off', false);

 pg_terminate_backend

+   
+pg_promote
+   
 

 signal
@@ -18655,6 +18658,16 @@ SELECT set_config('log_statement_stats', 'off', false);
 however only superusers can terminate superuser backends.

   
+  
+   
+pg_promote()
+
+   boolean
+   Promote a physical standby server.  This function can only be
+called by superusers and will only have an effect when run on
+a standby server.
+   
+  
  
 

@@ -18692,6 +18705,13 @@ SELECT set_config('log_statement_stats', 'off', false);
 subprocess.

 
+   
+pg_promote() sends a signal to the server that causes it
+to leave standby mode.  Since sending signals is by nature asynchronous,
+successful execution of the function does not guarantee that the server has
+already been promoted.
+   
+
   
 
   
diff --git a/doc/src/sgml/high-availability.sgml b/doc/src/sgml/high-availability.sgml
index 8cb77f85ec..6014817d9e 100644
--- a/doc/src/sgml/high-availability.sgml
+++ b/doc/src/sgml/high-availability.sgml
@@ -1472,7 +1472,7 @@ synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
 

 To trigger failover of a log-shipping standby server,
-run pg_ctl promote or create a trigger
+run pg_ctl promote, call pg_promote(), or create a trigger
 file with the file name and path specified by the trigger_file
 setting in recovery.conf. If you're planning to use
 pg_ctl promote to fail over, trigger_file is
diff --git a/doc/src/sgml/recovery-config.sgml b/doc/src/sgml/recovery-config.sgml
index 92825fdf19..d06cd0b08e 100644
--- a/doc/src/sgml/recovery-config.sgml
+++ b/doc/src/sgml/recovery-config.sgml
@@ -439,7 +439,8 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
  
   Specifies a trigger file whose presence ends recovery in the
   standby.  Even if this value is not set, you can still promote
-  the standby using pg_ctl promote.
+  the standby using pg_ctl promote or calling
+  pg_promote().
   This setting has no effect if standby_mode is off.
  
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3025d0badb..bb422e9a13 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -81,8 +81,6 @@ extern uint32 bootstrap_data_checksum_version;
 /* File path names (all relative to $PGDATA) */
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
-#define PROMOTE_SIGNAL_FILE		"promote"
-#define FALLBACK_PROMOTE_SIGNAL_FILE "fallback_promote"
 
 
 /* User-settable parameters */
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 9731742978..ab37b03dee 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -35,6 +35,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 
+#include 
 
 /*
  * Store label file and tablespace map during non-exclusive backups.
@@ -697,3 +698,50 @@ pg_backup_start_time(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(xtime);
 }
+
+/*
+ * Promote a standby server.
+ *
+ * A result of "true" means that promotion has been initiated.
+ */
+Datum
+pg_promote(PG_FUNCTION_ARGS)
+{
+	FILE *promote_file;
+
+	if (!superuser())
+		ereport(ERROR,
+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("must be superuser to promote standby servers")));
+
+	if (!RecoveryInProgress() || !StandbyMode)
+		PG_RETURN_BOOL(false);
+
+	/* 

Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-10-04 Thread Michael Paquier
On Thu, Oct 04, 2018 at 10:06:06AM +0200, Daniel Gustafsson wrote:
> Yes, you are correct, the signalfuncs.h includes in 0001 are a rebase error
> from when I renamed the file.  They are not present in the v15 patch but got
> introduced in v16 when I clearly wasn’t caffeinated enough to rebase.

No problem, thanks for confirming.  I have worked a bit more on the
thing, reducing the headers of the new file to a bare minimum, and I
have committed 0001.
--
Michael


signature.asc
Description: PGP signature


Re: [Proposal] Add accumulated statistics for wait event

2018-10-04 Thread Michael Paquier
On Thu, Oct 04, 2018 at 09:32:37AM +, Phil Florent wrote:
> I am a DB beginner, so please tell me. It says that you can find
> events that are bottlenecks in sampling, but as you saw above, you can
> not find events shorter than the sampling interval, right?

Yes, which is why it would be as simple as making the interval shorter,
still not too short so as it bloats the amount of information fetched
which needs to be stored and afterwards (perhaps) treated for analysis.
This gets rather close to signal processing.  A simple image is for
example, assuming that event A happens 100 times in an interval of 1s,
and event B only once in the same interval of 1s, then if the snapshot
interval is only 1s, then in the worst case A would be treated an equal
of B, which would be wrong.
--
Michael


signature.asc
Description: PGP signature


Re: libpq compression

2018-10-04 Thread Konstantin Knizhnik



On 01.10.2018 09:49, Michael Paquier wrote:

On Mon, Aug 20, 2018 at 06:00:39PM +0300, Konstantin Knizhnik wrote:

New version of the patch is attached: I removed -Z options form pgbench and
psql and add checking that server and client are implementing the same
compression algorithm.

The patch had no reviews, and does not apply anymore, so it is moved to
next CF with waiting on author as status.
--
Michael


Rebased version of the patch is attached.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/configure b/configure
index 0448c6b..790ac2e 100755
--- a/configure
+++ b/configure
@@ -700,6 +700,7 @@ ELF_SYS
 EGREP
 GREP
 with_zlib
+with_zstd
 with_system_tzdata
 with_libxslt
 with_libxml
@@ -864,6 +865,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 enable_float4_byval
@@ -8377,6 +8379,86 @@ fi
 
 
 #
+# ZStd
+#
+
+
+
+# Check whether --with-zstd was given.
+if test "${with_zstd+set}" = set; then :
+  withval=$with_zstd;
+  case $withval in
+yes)
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --with-zstd option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  with_zstd=no
+
+fi
+
+
+
+
+if test "$with_zstd" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ZSTD_compress in -lzstd" >&5
+$as_echo_n "checking for ZSTD_compress in -lzstd... " >&6; }
+if ${ac_cv_lib_zstd_ZSTD_compress+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lzstd  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char ZSTD_compress ();
+int
+main ()
+{
+return ZSTD_compress ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_zstd_ZSTD_compress=yes
+else
+  ac_cv_lib_zstd_ZSTD_compress=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_zstd_ZSTD_compress" >&5
+$as_echo "$ac_cv_lib_zstd_ZSTD_compress" >&6; }
+if test "x$ac_cv_lib_zstd_ZSTD_compress" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBZSTD 1
+_ACEOF
+
+  LIBS="-lzstd $LIBS"
+
+else
+  as_fn_error $? "library 'zstd' is required for ZSTD support" "$LINENO" 5
+fi
+
+fi
+
+
+
+#
 # Elf
 #
 
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 06d909e..45bb061 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1135,6 +1135,17 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
  
 
+ 
+  compression
+  
+  
+Request compression of libpq traffic. If server is supporting compression, then all libpq messages send both from client to server and
+visa versa will be compressed. Right now compression algorithm is hardcoded: is it is either zlib (default), either zstd (if Postgres was
+configured with --with-zstd option). In both cases streaming mode is used.
+  
+  
+ 
+
  
   client_encoding
   
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index f0b2145..2330e54 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -92,6 +92,15 @@
such as COPY.
   
 
+  
+Is is possible to compress protocol data to reduce traffic and speed-up client-server interaction.
+Compression is especialy useful for importing/exprorting data to/from database using COPY command
+and for replication (oth physical and logical). Also compression can reduce server response time
+in case of queries, requestion larger amount of data (for example returning JSON, BLOBs, text,...)
+Right now compression algorithm is hardcoded: is it is either zlib (default), either zstd (if Postgres was
+configured with --with-zstd option). In both cases streaming mode is used.
+  
+
  
   Messaging Overview
 
@@ -263,6 +272,18 @@
  
 
  
+  CompressionOk
+  
+   
+ Server acknowledge using compression for client-server communication protocol.
+ Compression can be requested by client by including "compression" option in connection string.
+ Right now compression algorithm is hardcoded, but in future client and server may negotiate to
+ choose proper compression algorithm.
+   
+  
+ 
+
+ 
   AuthenticationOk
   

@@ -3398,6 +3419,43 @@ AuthenticationSASLFinal (B)
 
 
 
+
+
+CompressionOk (B)
+
+
+
+
+
+
+
+Byte1('z')
+
+
+
+  Acknowledge use of compression for protocol data. After receiving this message bother server and client are switched to compression mode
+  and exchange compressed messages.
+
+
+

RE: [Proposal] Add accumulated statistics for wait event

2018-10-04 Thread Phil Florent
Hi,
I am a DB beginner, so please tell me. It says that you can find events that 
are bottlenecks in sampling, but as you saw above, you can not find events 
shorter than the sampling interval, right?

If an event occurs frequently and if it is reported in pg_stat_activity, you 
will catch it again and again while sampling, no matter it duration.
Hence you just need to

  *   Sample the sessions and consider the active ones. You need to know if 
they are waiting (PostgreSQL now provides detailed wait events) or if they are 
on the CPU
  *   Optionally collect information on the system context at the time of 
sampling (CPU, memory...), it can be provided by many tools like psutil python 
library for example

If the client application itself provides information it's even more 
interesting. With something like 
program/module/action/client_info/sofar/totalwork in application_name you are 
able to focus directly on different kind of activity. It can give you 
information like  "I/O waits are meaningful for my batch activity but not for 
my OLTP activity, if my goal is to improve response time for end users I have 
to consider that."

Best regards
Phil


De : Yotsunaga, Naoki 
Envoyé : jeudi 4 octobre 2018 10:31
À : 'Michael Paquier'; Phil Florent
Cc : Tomas Vondra; pgsql-hackers@lists.postgresql.org
Objet : RE: [Proposal] Add accumulated statistics for wait event

On Thu, July 26, 2018 at 1:25 AM, Michael Paquier wrote:
> Even if you have spiky workloads, sampling may miss those, but even with 
> adding counters for each event
> you would need to query the table holding the counters at an insane frequency 
> to be able to perhaps get
> something out of it as you need to do sampling of the counters as well to 
> extract deltas.

Hi, I was wondering why PostgreSQL did not have the number of wait events and 
wait time that other databases such as Oracle had as a function, and when I was 
looking for related threads, I got to this thread.

I am a DB beginner, so please tell me. It says that you can find events that 
are bottlenecks in sampling, but as you saw above, you can not find events 
shorter than the sampling interval, right?
If this short event has occurred quite a number of times and it was a 
considerable amount of time in total, can you solve this problem with sampling?
# I have asked, but I could not understand much of the discussion above and I 
do not know if such a case can exist.
Also, I think that it can be solved by higher the sampling frequency, but the 
load will be high, right? I think this method is not very practical.

Moreover, I think that it is not implemented due to the reason that sampling is 
good as described above and because it affects performance.
How about switching the function on / off and implementing with default off?
Do you care about performance degradation during bottleneck investigation?
When investigating the bottleneck, I think that it is better to find the cause 
even at the expense of performance.
# If you can detect with high frequency sampling, I think that it depends on 
whether the sampling or the function(the number of wait events and wait time) 
is high load.

Since I am a DB beginner, I think that it is saying strange things.
I am glad if you tell me.

-
Naoki Yotsunaga



Re: Early WIP/PoC for inlining CTEs

2018-10-04 Thread Andreas Karlsson

On 10/03/2018 05:57 PM, David Fetter wrote:

Is there any meaningful distinction between "inlining," by which I
mean converting to a subquery, and predicate pushdown, which
would happen at least for a first cut, at the rewrite stage?


Sorry, but I do not think I understand your question. The ability to 
push down predicates is just one of the potential benefits from inlining.


Andreas



Re: [HACKERS] Secondary index access optimizations

2018-10-04 Thread David Rowley
On 4 October 2018 at 22:11, Konstantin Knizhnik
 wrote:
> On 04.10.2018 06:19, David Rowley wrote:
>> Please, can you also add a test which tests this code which has a
>> partition with columns in a different order than it's parent. Having
>> an INT and a TEXT column is best as if the translations are done
>> incorrectly it's likely to result in a crash which will alert us to
>> the issue. It would be good to also verify the test causes a crash if
>> you temporarily put the code back to using the untranslated qual.
>>
>> Thanks for working on this.
>>
> Thank you very much for detecting and fixing this problem.
> I have checked that all changes in plan caused by this fix are correct.
> Updated version of the patch is attached.

Can you add the test that I mentioned above?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: New vacuum option to do only freezing

2018-10-04 Thread Masahiko Sawada
On Mon, Oct 1, 2018 at 7:20 PM Masahiko Sawada  wrote:
>
> Hi,
>
> Attached patch adds a new option FREEZE_ONLY to VACUUM command. This
> option is same as FREEZE option except for it disables reclaiming dead
> tuples. That is, with this option vacuum does pruning HOT chain,
> freezing live tuples and maintaining both visibility map and freespace
> map but does not collect dead tuples and invoke neither heap vacuum
> nor index vacuum. This option will be useful if user wants to prevent
> XID wraparound a table as quick as possible, especially when table is
> quite large and is about to XID wraparound. I think this usecase was
> mentioned in some threads but I couldn't find them.
>
> Currently this patch just adds the new option to VACUUM command but it
> might be good to make autovacuum use it when emergency vacuum is
> required.
>
> This is a performance-test result for FREEZE option and FREEZE_ONLY
> option. I've tested them on the table which is about 3.8GB table
> without indexes and randomly modified.
>
> * FREEZE
> INFO:  aggressively vacuuming "public.pgbench_accounts"
> INFO:  "pgbench_accounts": removed 5 row versions in 8 pages
> INFO:  "pgbench_accounts": found 5 removable, 3000 nonremovable
> row versions in 491804 out of 491804 pages
> DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 722
> There were 0 unused item pointers.
> Skipped 0 pages due to buffer pins, 0 frozen pages.
> 0 pages are entirely empty.
> CPU: user: 4.20 s, system: 16.47 s, elapsed: 50.28 s.
> VACUUM
> Time: 50301.262 ms (00:50.301)
>
> * FREEZE_ONLY
> INFO:  aggressively vacuuming "public.pgbench_accounts"
> INFO:  "pgbench_accounts": found 4 removable, 3000 nonremovable
> row versions in 491804 out of 491804 pages
> DETAIL:  freeze 3000 rows
> There were 0 unused item pointers.
> Skipped 0 pages due to buffer pins, 0 frozen pages.
> 0 pages are entirely empty.
> CPU: user: 3.10 s, system: 14.85 s, elapsed: 44.56 s.
> VACUUM
> Time: 44589.794 ms (00:44.590)
>
> Feedback is very welcome.
>

Added to the next commit fest.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Secondary index access optimizations

2018-10-04 Thread Konstantin Knizhnik



On 04.10.2018 06:19, David Rowley wrote:

On 12 September 2018 at 08:32, Konstantin Knizhnik
 wrote:

Also the patch proposed by you is much simple and does mostly the same. Yes,
it is not covering CHECK constraints,

I started to look at this and found a problem in regards to varno
during the predicate_implied_by() test.  The problem is that the
partition bound is always stored as varno=1 (For example, see how
get_qual_for_list() calls makeVar()). This causes the patch to fail in
cases where the partitioned table is not varno=1. You're also
incorrectly using rinfo->clause to pass to predicate_implied_by().
This is a problem because stored here have not been translated to have
the child varattnos.  childqual is the correct thing to use as that's
just been translated. You may have not used it as the varnos will have
been converted to the child's varno, which will never be varno=1, so
you might have found that not to work due to the missing code to
change the varnos to 1.

I've attached the diff for allpaths.c (only) that I ended up with to
make it work. This causes the output of many other regression test to
change, so you'll need to go over these and verify everything is
correct again.

Please, can you also add a test which tests this code which has a
partition with columns in a different order than it's parent. Having
an INT and a TEXT column is best as if the translations are done
incorrectly it's likely to result in a crash which will alert us to
the issue. It would be good to also verify the test causes a crash if
you temporarily put the code back to using the untranslated qual.

Thanks for working on this.


Thank you very much for detecting and fixing this problem.
I have checked that all changes in plan caused by this fix are correct.
Updated version of the patch is attached.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5f74d3b..b628ac7 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -37,6 +37,7 @@
 #include "optimizer/paths.h"
 #include "optimizer/plancat.h"
 #include "optimizer/planner.h"
+#include "optimizer/predtest.h"
 #include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
@@ -1052,6 +1053,27 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 /* Restriction reduces to constant TRUE, so drop it */
 continue;
 			}
+
+			/*
+			 * For partitions, we may be able to eliminate some quals if
+			 * they're implied by the partition bound.
+			 */
+			if (childrel->partition_qual != NIL)
+			{
+Node	   *checkqual = copyObject(childqual);
+
+/*
+ * Since the partition_qual has all Vars stored as varno=1, we
+ * must convert all Vars of the childqual to have their varnos
+ * set to 1 so that predicate_implied_by can properly match
+ * implied quals.
+ */
+ChangeVarNodes(checkqual, childrel->relid, 1, 0);
+
+if (predicate_implied_by(list_make1(checkqual), childrel->partition_qual, false))
+	continue;
+			}
+
 			/* might have gotten an AND clause, if so flatten it */
 			foreach(lc2, make_ands_implicit((Expr *) childqual))
 			{
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 8369e3a..8cd9b06 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -450,7 +450,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	 */
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
-
+	else if (relation->rd_rel->relispartition)
+		rel->partition_qual = RelationGetPartitionQual(relation);
 	heap_close(relation, NoLock);
 
 	/*
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 4f29d9f..67d7a41 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -1772,30 +1772,26 @@ explain (costs off) select * from list_parted where a is not null;
 -
  Append
->  Seq Scan on part_ab_cd
- Filter: (a IS NOT NULL)
->  Seq Scan on part_ef_gh
- Filter: (a IS NOT NULL)
->  Seq Scan on part_null_xy
  Filter: (a IS NOT NULL)
-(7 rows)
+(5 rows)
 
 explain (costs off) select * from list_parted where a in ('ab', 'cd', 'ef');
 QUERY PLAN
 --
  Append
->  Seq Scan on part_ab_cd
- Filter: ((a)::text = ANY ('{ab,cd,ef}'::text[]))
->  Seq Scan on part_ef_gh
  Filter: ((a)::text = ANY ('{ab,cd,ef}'::text[]))
-(5 rows)
+(4 rows)
 
 explain (costs off) select * from list_parted where a = 'ab' or a in (null, 'cd');
-  QUERY PLAN   

Re: speeding up planning with partitions

2018-10-04 Thread Amit Langote
Imai-san,

On 2018/10/04 17:11, Imai, Yoshikazu wrote:
> Hi, Amit!
> 
> On Thu, Sept 13, 2018 at 10:29 PM, Amit Langote wrote:
>> Attached is what I have at the moment.
> 
> I also do the code review of the patch.

Thanks a lot for your review.  I'm working on updating the patches as
mentioned upthread and should be able to post a new version sometime next
week.  I will consider your comments along with David's and Dilip's before
posting the new patch.

Regards,
Amit




RE: [Proposal] Add accumulated statistics for wait event

2018-10-04 Thread Yotsunaga, Naoki
On Thu, July 26, 2018 at 1:25 AM, Michael Paquier wrote:
> Even if you have spiky workloads, sampling may miss those, but even with 
> adding counters for each event 
> you would need to query the table holding the counters at an insane frequency 
> to be able to perhaps get
> something out of it as you need to do sampling of the counters as well to 
> extract deltas.

Hi, I was wondering why PostgreSQL did not have the number of wait events and 
wait time that other databases such as Oracle had as a function, and when I was 
looking for related threads, I got to this thread.

I am a DB beginner, so please tell me. It says that you can find events that 
are bottlenecks in sampling, but as you saw above, you can not find events 
shorter than the sampling interval, right?
If this short event has occurred quite a number of times and it was a 
considerable amount of time in total, can you solve this problem with sampling?
# I have asked, but I could not understand much of the discussion above and I 
do not know if such a case can exist.
Also, I think that it can be solved by higher the sampling frequency, but the 
load will be high, right? I think this method is not very practical.
 
Moreover, I think that it is not implemented due to the reason that sampling is 
good as described above and because it affects performance.
How about switching the function on / off and implementing with default off?
Do you care about performance degradation during bottleneck investigation?
When investigating the bottleneck, I think that it is better to find the cause 
even at the expense of performance.
# If you can detect with high frequency sampling, I think that it depends on 
whether the sampling or the function(the number of wait events and wait time) 
is high load.
 
Since I am a DB beginner, I think that it is saying strange things.
I am glad if you tell me.

-
Naoki Yotsunaga




Re: partition tree inspection functions

2018-10-04 Thread Michael Paquier
On Thu, Oct 04, 2018 at 04:53:02PM +0900, Amit Langote wrote:
> As mentioned in my other reply, that might be considered as asking the
> user to know inner details like relkind.  Also, if a database has many
> partitioned tables with lots of partitions, the pg_class join might get
> expensive.  OTOH, computing and returning it with other fields of
> pg_partition_tree is essentially free.

So it seems that I am clearly outvoted here ;)

Okay, let's do as you folks propose.
--
Michael


signature.asc
Description: PGP signature


RE: speeding up planning with partitions

2018-10-04 Thread Imai, Yoshikazu
Hi, Amit!

On Thu, Sept 13, 2018 at 10:29 PM, Amit Langote wrote:
> Attached is what I have at the moment.

I also do the code review of the patch.
I could only see a v3-0001.patch so far, so below are all about v3-0001.patch.

I am new to inheritance/partitioning codes and code review, so my review below 
might have some mistakes. If there are mistakes, please point out that kindly :)


v3-0001:
1. Is there any reason inheritance_make_rel_from_joinlist returns "parent" that 
is passed to it? I think we can refer parent in the caller even if 
inheritance_make_rel_from_joinlist returns nothing.

+static RelOptInfo *
+inheritance_make_rel_from_joinlist(PlannerInfo *root,
+  RelOptInfo 
*parent,
+  List 
*joinlist)
+{
...
+   return parent;
+}

2. Is there the possible case that IS_DUMMY_REL(child_joinrel) is true in 
inheritance_adjust_scanjoin_target()?

+inheritance_adjust_scanjoin_target(PlannerInfo *root,
...
+{
...
+   foreach(lc, root->inh_target_child_rels)
+   {
...
+   /*
+* If the child was pruned, its corresponding joinrel will be 
empty,
+* which we can ignore safely.
+*/
+   if (IS_DUMMY_REL(child_joinrel))
+   continue;

I think inheritance_make_rel_from_joinlist() doesn't make dummy joinrel for the 
child that was pruned.

+inheritance_make_rel_from_joinlist(PlannerInfo *root,
...
+{
...
+   foreach(lc, root->append_rel_list)
+   {
+   RelOptInfo *childrel;
...
+   /* Ignore excluded/pruned children. */
+   if (IS_DUMMY_REL(childrel))
+   continue;
...
+   /*
+* Save child joinrel to be processed later in
+* inheritance_adjust_scanjoin_target, which adjusts its paths 
to
+* be able to emit expressions in query's top-level target list.
+*/
+   root->inh_target_child_rels = 
lappend(root->inh_target_child_rels,
+   
  childrel);
+   }
+}

3.
Is the "root->parse->commandType != CMD_INSERT" required in:

@@ -2018,13 +1514,45 @@ grouping_planner(PlannerInfo *root, bool 
inheritance_update,
...
+   /*
+* For UPDATE/DELETE on an inheritance parent, we must generate 
and
+* apply scanjoin target based on targetlist computed using each
+* of the child relations.
+*/
+   if (parse->commandType != CMD_INSERT &&
+   current_rel->reloptkind == RELOPT_BASEREL &&
+   current_rel->relid == root->parse->resultRelation &&
+   root->simple_rte_array[current_rel->relid]->inh)
...

@@ -2137,92 +1665,123 @@ grouping_planner(PlannerInfo *root, bool 
inheritance_update,
final_rel->fdwroutine = current_rel->fdwroutine;
 
...
-   foreach(lc, current_rel->pathlist)
+   if (current_rel->reloptkind == RELOPT_BASEREL &&
+   current_rel->relid == root->parse->resultRelation &&
+   !IS_DUMMY_REL(current_rel) &&
+   root->simple_rte_array[current_rel->relid]->inh &&
+   parse->commandType != CMD_INSERT)


I think if a condition would be "current_rel->relid == 
root->parse->resultRelation && parse->commandType != CMD_INSERT" at the above 
if clause, elog() is called in the query_planner and planner don't reach above 
if clause.
Of course there is the case that query_planner returns the dummy joinrel and 
elog is not called, but in that case, current_rel->relid may be 0(current_rel 
is dummy joinrel) and root->parse->resultRelation may be not 0(a query is 
INSERT).

4. Can't we use define value(IS_PARTITIONED or something like IS_INHERITANCE?) 
to identify inheritance and partitioned table in below code? It was little 
confusing to me that which code is related to inheritance/partitioned when 
looking codes.

In make_one_rel():
+   if (root->parse->resultRelation &&
+   root->simple_rte_array[root->parse->resultRelation]->inh)
+   {
...
+   rel = inheritance_make_rel_from_joinlist(root, targetrel, 
joinlist);

In inheritance_make_rel_from_joinlist():
+   if (childrel->part_scheme != NULL)
+   childrel =
+   inheritance_make_rel_from_joinlist(root, 
childrel,
+   
   translated_joinlist);


I can't review inheritance_adjust_scanjoin_target() deeply, because it is 
difficult to me to understand fully codes about join processing.

--
Yoshikazu Imai



Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-10-04 Thread Daniel Gustafsson
> On 4 Oct 2018, at 09:59, Michael Paquier  wrote:
> 
> On Wed, Oct 03, 2018 at 12:09:54PM +1300, Thomas Munro wrote:
>> It looks like you missed another case that needs tolerance for late
>> signal delivery on Windows:
>> 
>> +select pg_cancel_backend(pg_backend_pid());
>> 
>> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.15263
> 
> Looking at 0001, why are the declarations needed in patch 0002 part of
> 0001 (see signalfuncs.h)?  I think that something like instead the
> attached is enough for this part.  Daniel, could you confirm?

Yes, you are correct, the signalfuncs.h includes in 0001 are a rebase error
from when I renamed the file.  They are not present in the v15 patch but got
introduced in v16 when I clearly wasn’t caffeinated enough to rebase.

cheers ./daniel


Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-10-04 Thread Michael Paquier
On Wed, Oct 03, 2018 at 12:09:54PM +1300, Thomas Munro wrote:
> It looks like you missed another case that needs tolerance for late
> signal delivery on Windows:
> 
> +select pg_cancel_backend(pg_backend_pid());
> 
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.15263

Looking at 0001, why are the declarations needed in patch 0002 part of
0001 (see signalfuncs.h)?  I think that something like instead the
attached is enough for this part.  Daniel, could you confirm?
--
Michael
diff --git a/src/backend/storage/ipc/Makefile b/src/backend/storage/ipc/Makefile
index 9dbdc26c9b..bf4619d5fd 100644
--- a/src/backend/storage/ipc/Makefile
+++ b/src/backend/storage/ipc/Makefile
@@ -8,8 +8,8 @@ subdir = src/backend/storage/ipc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = barrier.o dsm_impl.o dsm.o ipc.o ipci.o latch.o pmsignal.o procarray.o \
-	procsignal.o  shmem.o shmqueue.o shm_mq.o shm_toc.o sinval.o \
-	sinvaladt.o standby.o
+OBJS = signalfuncs.o barrier.o dsm_impl.o dsm.o ipc.o ipci.o latch.o \
+	pmsignal.o procarray.o procsignal.o shmem.o shmqueue.o shm_mq.o \
+	shm_toc.o sinval.o sinvaladt.o standby.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/ipc/signalfuncs.c b/src/backend/storage/ipc/signalfuncs.c
new file mode 100644
index 00..c09a047127
--- /dev/null
+++ b/src/backend/storage/ipc/signalfuncs.c
@@ -0,0 +1,216 @@
+/*-
+ *
+ * signalfuncs.c
+ *	  Functions for signalling backends
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/ipc/signalfuncs.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include 
+
+#include "catalog/pg_authid.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "postmaster/syslogger.h"
+#include "storage/ipc.h"
+#include "storage/pmsignal.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "utils/acl.h"
+#include "utils/builtins.h"
+
+/*
+ * Send a signal to another backend.
+ *
+ * The signal is delivered if the user is either a superuser or the same
+ * role as the backend being signaled. For "dangerous" signals, an explicit
+ * check for superuser needs to be done prior to calling this function.
+ *
+ * Returns 0 on success, 1 on general failure, 2 on normal permission error
+ * and 3 if the caller needs to be a superuser.
+ *
+ * In the event of a general failure (return code 1), a warning message will
+ * be emitted. For permission errors, doing that is the responsibility of
+ * the caller.
+ */
+#define SIGNAL_BACKEND_SUCCESS 0
+#define SIGNAL_BACKEND_ERROR 1
+#define SIGNAL_BACKEND_NOPERMISSION 2
+#define SIGNAL_BACKEND_NOSUPERUSER 3
+static int
+pg_signal_backend(int pid, int sig)
+{
+	PGPROC	   *proc = BackendPidGetProc(pid);
+
+	/*
+	 * BackendPidGetProc returns NULL if the pid isn't valid; but by the time
+	 * we reach kill(), a process for which we get a valid proc here might
+	 * have terminated on its own.  There's no way to acquire a lock on an
+	 * arbitrary process to prevent that. But since so far all the callers of
+	 * this mechanism involve some request for ending the process anyway, that
+	 * it might end on its own first is not a problem.
+	 */
+	if (proc == NULL)
+	{
+		/*
+		 * This is just a warning so a loop-through-resultset will not abort
+		 * if one backend terminated on its own during the run.
+		 */
+		ereport(WARNING,
+(errmsg("PID %d is not a PostgreSQL server process", pid)));
+		return SIGNAL_BACKEND_ERROR;
+	}
+
+	/* Only allow superusers to signal superuser-owned backends. */
+	if (superuser_arg(proc->roleId) && !superuser())
+		return SIGNAL_BACKEND_NOSUPERUSER;
+
+	/* Users can signal backends they have role membership in. */
+	if (!has_privs_of_role(GetUserId(), proc->roleId) &&
+		!has_privs_of_role(GetUserId(), DEFAULT_ROLE_SIGNAL_BACKENDID))
+		return SIGNAL_BACKEND_NOPERMISSION;
+
+	/*
+	 * Can the process we just validated above end, followed by the pid being
+	 * recycled for a new process, before reaching here?  Then we'd be trying
+	 * to kill the wrong thing.  Seems near impossible when sequential pid
+	 * assignment and wraparound is used.  Perhaps it could happen on a system
+	 * where pid re-use is randomized.  That race condition possibility seems
+	 * too unlikely to worry about.
+	 */
+
+	/* If we have setsid(), signal the backend's whole process group */
+#ifdef HAVE_SETSID
+	if (kill(-pid, sig))
+#else
+	if (kill(pid, sig))
+#endif
+	{
+		/* Again, just a warning to allow loops */
+		ereport(WARNING,
+(errmsg("could not send signal to process %d: %m", pid)));
+		return SIGNAL_BACKEND_ERROR;
+	}
+	return SIGNAL_BACKEND_SUCCESS;
+}
+
+/*
+ * Signal to cancel a backend process.  This is allowed if you are 

Re: partition tree inspection functions

2018-10-04 Thread Amit Langote
On 2018/10/04 9:27, Michael Paquier wrote:
> On Wed, Oct 03, 2018 at 08:12:59AM -0400, Jesper Pedersen wrote:
>> Removing isleaf would require extra round trips to the server to get
>> that information. So, I think we should keep it.
> 
> I don't really get your point about extra round trips with the server,
> and getting the same level of information is as simple as a join between
> the result set of pg_partition_tree() and pg_class (better to add schema
> qualification and aliases to relations by the way):
> =# SELECT relid::regclass,
>   parentrelid::regclass, level,
>   relkind != 'p' AS isleaf
>  FROM pg_partition_tree('ptif_test'::regclass), pg_class
>  WHERE oid = relid;
> relid| parentrelid | level | isleaf
> -+-+---+
>  ptif_test   | null| 0 | f
>  ptif_test0  | ptif_test   | 1 | f
>  ptif_test1  | ptif_test   | 1 | f
>  ptif_test2  | ptif_test   | 1 | t
>  ptif_test01 | ptif_test0  | 2 | t
>  ptif_test11 | ptif_test1  | 2 | t
> (6 rows)

As mentioned in my other reply, that might be considered as asking the
user to know inner details like relkind.  Also, if a database has many
partitioned tables with lots of partitions, the pg_class join might get
expensive.  OTOH, computing and returning it with other fields of
pg_partition_tree is essentially free.

Thanks,
Amit




Re: partition tree inspection functions

2018-10-04 Thread Amit Langote
On 2018/10/03 12:37, Michael Paquier wrote:
> On Mon, Oct 01, 2018 at 04:27:57PM +0900, Amit Langote wrote:
>> Yeah, maybe there is no reason to delay proceeding with
>> pg_partition_children which provides a useful functionality.
> 
> So, I have been looking at your patch, and there are a couple of things
> which could be improved.

Thanks for reviewing and updating the patch.

> Putting the new function pg_partition_children() in partition.c is a
> bad idea I think.  So instead I think that we should put that in a
> different location, say utils/adt/partitionfuncs.c.

Okay, sounds like a good idea.

> +   TupleDescInitEntry(tupdesc, (AttrNumber) 1, "relid",
> +  REGCLASSOID, -1, 0);
> +   TupleDescInitEntry(tupdesc, (AttrNumber) 2, "parentid",
> +  REGCLASSOID, -1, 0);
> REGCLASSOID is used mainly for casting, so instead let's use OIDOID like
> any other system function.

Check.

> +   values[2] = psprintf("%d", level);
> +   values[3] = psprintf("%c", relkind == RELKIND_PARTITIONED_TABLE ?
> +   'f' :
> +   't');
> Using Datum objects is more elegant in this context.

Agreed.  I think I'd just copied the psprintf code from some other function.

> isleaf is not particularly useful as it can just be fetched with a join
> on pg_class/relkind.  So I think that we had better remove it.

That's a bit imposing on the users to know about relkind, but maybe that's
okay.

> I have cleaned up a bit tests, removing duplicates and most of the
> things which touched the size of relations to have something more
> portable.

Thanks for that.

> We could have a flavor using a relation name in input with qualified
> names handled properly (see pg_get_viewdef_name for example), not sure
> if that's really mandatory so I left that out.

Having to always use the typecast (::regclass) may be a bit annoying, but
we can always add that version later.

> I have also added some
> comments here and there.  The docs could be worded a bit better still.
> 
> My result is the patch attached.  What do you think?

I looked at the updated patch and couldn't resist making some changes,
which see in the attached diff file.  Also attached is the updated patch.

Thanks,
Amit
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index d41c09b68b..6dfa3dc977 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20209,12 +20209,13 @@ postgres=# SELECT * FROM 
pg_walfile_name_offset(pg_stop_backup());

pg_partition_tree(oid)
setof record

-List information about a partition tree for the given partitioned
-table, consisting of one row for each partition in a tree.  The
-information available is the OID of the partition, the OID of its
-immediate partitioned table, and its level in the hierarchy,
-beginning at 0 for the top-most parent, and
-incremented by 1 for each level up.
+List information about table in a partition tree for a given
+partitioned table, which consists of one row for each partition and
+table itself.  Information provided includes the OID of the partition,
+the OID of its immediate parent, and its level in the hierarchy.
+The value of level begins at 0 for the input table
+in its role as the root of the partition tree, 1 for
+its partitions, 2 for their partitions, and so on.

   
  
diff --git a/src/backend/utils/adt/partitionfuncs.c 
b/src/backend/utils/adt/partitionfuncs.c
index fc0a904967..41c57cac2d 100644
--- a/src/backend/utils/adt/partitionfuncs.c
+++ b/src/backend/utils/adt/partitionfuncs.c
@@ -1,7 +1,7 @@
 /*-
  *
  * partitionfuncs.c
- *   Functions for accessing partitioning data
+ *   Functions for accessing partitioning related metadata
  *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
diff --git a/src/test/regress/sql/partition_info.sql 
b/src/test/regress/sql/partition_info.sql
index 2ad75882b0..d9d96a284c 100644
--- a/src/test/regress/sql/partition_info.sql
+++ b/src/test/regress/sql/partition_info.sql
@@ -17,11 +17,24 @@ SELECT relid::regclass, parentrelid::regclass, level,
 pg_relation_size(relid) = 0 AS is_empty
   FROM pg_partition_tree('ptif_test'::regclass);
 
--- children of the main tree
-SELECT relid::regclass, parentrelid::regclass, level
-  FROM pg_partition_tree('ptif_test0'::regclass);
-SELECT relid::regclass, parentrelid::regclass, level
-  FROM pg_partition_tree('ptif_test01'::regclass);
+-- check that it works correctly even if it's passed other tables in the tree
+
+-- passing an intermediate level partitioned table
+SELECT relid::regclass, parentrelid::regclass, level, relkind <> 'p' AS isleaf
+  FROM 

Re: Skylake-S warning

2018-10-04 Thread Ants Aasma
On Thu, Oct 4, 2018 at 9:50 AM Adrien Nayrat  wrote:
>
> On 10/3/18 11:29 PM, Daniel Wood wrote:
> > If running benchmarks or you are a customer which is currently impacted by
> > GetSnapshotData() on high end multisocket systems be wary of Skylake-S.
> >
> >
> > Performance differences of nearly 2X can be seen on select only pgbench due 
> > to
> > nothing else but unlucky choices for max_connections.  Scale 1000, 192 local
> > clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) 
> > system.
> > pgbench -S
>
> Could it be related to :
> https://www.postgresql.org/message-id/D2B9F2A20670C84685EF7D183F2949E2373E66%40gigant.nidsa.net
> ?


Unlikely. I understood from Daniel's email that profiling shows a
different hot-spot. In the cited .NET issue the problem was mostly due
to issuing PAUSE in a loop without attempting to grab the lock. In
PostgreSQL it's called only once per retry attempt.

Regards,
Ants Aasma
--
PostgreSQL Senior Consultant
www.cybertec-postgresql.com

Austria (HQ), Wiener Neustadt  |  Switzerland, Zürich  |  Estonia,
Tallinn  |  Uruguay, Montevideo
Facebook: www.fb.com/cybertec.postgresql
Twitter: www.twitter.com/PostgresSupport



Re: executor relation handling

2018-10-04 Thread Amit Langote
On 2018/10/04 5:16, Tom Lane wrote:
> I wrote:
>> Amit Langote  writes:
>>> Should this check that we're not in a parallel worker process?
> 
>> Hmm.  I've not seen any failures in the parallel parts of the regular
>> regression tests, but maybe I'd better do a force_parallel_mode
>> run before committing.
>> In general, I'm not on board with the idea that parallel workers don't
>> need to get their own locks, so I don't really want to exclude parallel
>> workers from this check.  But if it's not safe for that today, fixing it
>> is beyond the scope of this particular patch.
> 
> So the place where that came out in the wash is the commit I just made
> (9a3cebeaa) to change the executor from taking table locks to asserting
> that somebody else took them already.

Thanks for getting that done.

> To make that work, I had to make
> both ExecOpenScanRelation and relation_open skip checking for lock-held
> if IsParallelWorker().

Yeah, I had to do that to when rebasing the remaining patches.

> This makes me entirely uncomfortable with the idea that parallel workers
> can be allowed to not take any locks of their own.  There is no basis
> for arguing that we have field proof that that's safe, because *up to
> now, parallel workers in fact did take their own locks*.  And it seems
> unsafe on its face, because there's nothing that really guarantees that
> the parent process won't go away while children are still running.
> (elog(FATAL) seems like a counterexample, for instance.)
> 
> I think that we ought to adjust parallel query to insist that children
> do take locks, and then revert the IsParallelWorker() exceptions I made
> here.

Maybe I'm missing something here, but isn't the necessary adjustment just
that the relations are opened with locks if inside a parallel worker?

>  I plan to leave that point in abeyance till we've got the rest
> of these changes in place, though.  The easiest way to do it will
> doubtless change once we've centralized the executor's table-opening
> logic, so trying to code it right now seems like a waste of effort.

Okay.

I've rebased the remaining patches.  I broke down one of the patches into
2 and re-ordered the patches as follows:

0001: introduces a function that opens range table relations and maintains
them in an array indexes by RT index

0002: introduces a new field in EState that's an array of RangeTblEntry
pointers and revises macros used in the executor that access RTEs to
return them from the array (David Rowley co-authored this one)

0003: moves result relation and ExecRowMark initialization out of InitPlan
and into ExecInit* routines of respective nodes

0004: removes useless fields from certain planner nodes whose only purpose
has been to assist the executor lock relations in proper order

0005: teaches planner to remove PlanRowMarks corresponding to dummy relations

Thanks,
Amit
From ebef0d923ea8a1d5e458c9a60845bad68904cb52 Mon Sep 17 00:00:00 2001
From: "dgrow...@gmail.com" 
Date: Thu, 27 Sep 2018 16:14:41 +1200
Subject: [PATCH v12 1/5] Revise executor range table relation locking and
 opening/closing

All requests to open range table relations in the executor now go
through ExecRangeTableRelation(), which consults an array of Relation
pointers indexed by RT index, an arrangement which allows the executor
to have to heap_open any given range table relation only once.
Relations are closed by ExecEndPlan() instead of ExecEndNode.

This needed revising PartitionedRelPruneInfo node to contain the
partitioned table's RT index instead of OID.  With that change,
ExecCreatePartitionPruneState can use ExecRangeTableRelation described
above.
---
 contrib/postgres_fdw/postgres_fdw.c   |  4 --
 src/backend/executor/README   |  4 +-
 src/backend/executor/execMain.c   | 61 +--
 src/backend/executor/execPartition.c  | 34 ++---
 src/backend/executor/execUtils.c  | 60 +++---
 src/backend/executor/nodeAppend.c |  6 ---
 src/backend/executor/nodeBitmapHeapscan.c |  7 
 src/backend/executor/nodeCustom.c |  4 --
 src/backend/executor/nodeForeignscan.c|  4 --
 src/backend/executor/nodeIndexonlyscan.c  |  7 
 src/backend/executor/nodeIndexscan.c  |  7 
 src/backend/executor/nodeMergeAppend.c|  6 ---
 src/backend/executor/nodeModifyTable.c|  4 +-
 src/backend/executor/nodeSamplescan.c |  5 ---
 src/backend/executor/nodeSeqscan.c|  7 
 src/backend/executor/nodeTidscan.c|  5 ---
 src/backend/nodes/copyfuncs.c |  2 +-
 src/backend/nodes/outfuncs.c  |  2 +-
 src/backend/nodes/readfuncs.c |  2 +-
 src/backend/optimizer/plan/setrefs.c  | 30 +++
 src/backend/partitioning/partprune.c  |  5 +--
 src/include/executor/execPartition.h  |  1 -
 src/include/executor/executor.h   |  2 +-
 src/include/nodes/execnodes.h |  2 +
 src/include/nodes/plannodes.h   

Re: Skylake-S warning

2018-10-04 Thread Adrien Nayrat
On 10/3/18 11:29 PM, Daniel Wood wrote:
> If running benchmarks or you are a customer which is currently impacted by
> GetSnapshotData() on high end multisocket systems be wary of Skylake-S.
> 
> 
> Performance differences of nearly 2X can be seen on select only pgbench due to
> nothing else but unlucky choices for max_connections.  Scale 1000, 192 local
> clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) 
> system. 
> pgbench -S
> 

Hi,

Could it be related to :
https://www.postgresql.org/message-id/D2B9F2A20670C84685EF7D183F2949E2373E66%40gigant.nidsa.net
?





signature.asc
Description: OpenPGP digital signature