Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Masahiko Sawada
On Wed, Nov 22, 2017 at 2:56 PM, Craig Ringer  wrote:
> On 22 November 2017 at 12:15, Kyotaro HORIGUCHI
>  wrote:
>>
>> At Wed, 22 Nov 2017 12:57:34 +0900, Michael Paquier
>>  wrote in
>> 
>> > On Wed, Nov 22, 2017 at 11:49 AM, Craig Ringer 
>> > wrote:
>> > > On 20 November 2017 at 18:35, atorikoshi
>> > >  wrote:
>> > >> I put many queries into one transaction and made ReorderBuffer spill
>> > >> data to disk, and sent SIGKILL to postgres before the end of the
>> > >> transaction.
>> > >>
>> > >> After starting up postgres again, I observed the files spilled to
>> > >> data wasn't deleted.
>> > >
>> > > Since this can only happen  on crash exits, and the reorderbuffer data
>> > > is
>> > > useless after a decoding backend exits, why don't we just recursively
>> > > delete
>> > > the tree contents on Pg startup?
>> >
>> > +1. You would just need an extra step after say
>> > DeleteAllExportedSnapshotFiles() in startup.c. Looks saner to me do so
>> > so as well.
>>
>> The old files are being removed at startup by
>> StartupReorderBuffer.
>
>
> That seems to contradict the statement above, that "after starting up
> postgres again, I observed the files spilled to disk weren't deleted".
>
>>
>> At the time of assertion failure, the files are not of the
>> previous run, but they are created after reconnection from the
>> subscriber.
>
>
> Are you saying the problem files do not exist when we start up, but are
> created and then leaked after logical decoding resumes after an unclean
> startup?
>
> ... Yes, that does appear to be the case, per the original report:
>
> "7. After a while, we can see the files remaining.
>   (Immediately after starting publiser, we can not see these files.)"
>
> I was confused by "remaining". They're not remaining, they've been
> re-created.
>
> But if they're re-created, why are they not recreated correctly after an
> unclean shutdown? What persistent state is causing that? We should be
> clobbering saved reorder buffer temp files, snapshots, etc at startup. The
> slot state is pretty simple, it'd just be a bit behind.
>
> The key difference seems to be that we hard-kill the server so it can't
> write anything to clog. The xact is implicitly aborted, we never wrote any
> xlog record for a commit or abort. The problem is presumably with decoding
> xacts that were implicitly aborted by server crash, after we restart the
> server and resume decoding.
>
> The assertion failure reported is in ReorderBufferRestoreCleanup, which
> makes sense.
>
> Because there's no xlog record of the crash, we never set the buffer's
> final_lsn in ReorderBufferCommit or ReorderBufferAbort .

Yeah, that's the same as my analysis.

> Note the comment there:
>
>  * NB: Transactions handled here have to have actively aborted (i.e. have
>  * produced an abort record). Implicitly aborted transactions are handled
> via
>  * ReorderBufferAbortOld(); transactions we're just not interested in, but
>  * which have committed are handled in ReorderBufferForget().
>
> That's only called from DecodeStandbyOp in response to an xl_running_xacts.
>
> Here's the backtrace.
>
> Core was generated by `postgres: wal sender process postgres [local'.
> Program terminated with signal SIGABRT, Aborted.
> ...
> #2  0x008537b7 in ExceptionalCondition
> (conditionName=conditionName@entry=0x9fcdf6 "!(txn->final_lsn != 0)",
> errorType=errorType@entry=0x89bcb4 "FailedAssertion",
> fileName=fileName@entry=0x9fcd04 "reorderbuffer.c",
> lineNumber=lineNumber@entry=2576) at assert.c:54
> #3  0x006fec02 in ReorderBufferRestoreCleanup
> (rb=rb@entry=0x1b4c370, txn=txn@entry=0x1b5c3b8) at reorderbuffer.c:2576
> #4  0x00700693 in ReorderBufferCleanupTXN (rb=rb@entry=0x1b4c370,
> txn=txn@entry=0x1b5c3b8) at reorderbuffer.c:1145
> #5  0x00701516 in ReorderBufferAbortOld (rb=0x1b4c370,
> oldestRunningXid=558) at reorderbuffer.c:1730
> #6  0x006f5a47 in DecodeStandbyOp (ctx=0x1af9ce0,
> buf=buf@entry=0x7ffd11761200) at decode.c:325
> #7  0x006f65bf in LogicalDecodingProcessRecord (ctx=,
> record=) at decode.c:117
> #8  0x007098ab in XLogSendLogical () at walsender.c:2766
> #9  0x0070a875 in WalSndLoop (send_data=send_data@entry=0x709857
> ) at walsender.c:2134
> #10 0x0070b011 in StartLogicalReplication (cmd=cmd@entry=0x1a9cd68)
> at walsender.c:1101
> #11 0x0070b46f in exec_replication_command
> (cmd_string=cmd_string@entry=0x1afec30 "START_REPLICATION SLOT \"sub\"
> LOGICAL 0/0 (proto_version '1', publication_names '\"pub\"')") at
> walsender.c:1527
> #12 0x00758809 in PostgresMain (argc=,
> argv=argv@entry=0x1aab870, dbname=, username=)
> at postgres.c:4086
> #13 0x006e178d in BackendRun (port=port@entry=0x1aa3430) at
> postmaster.c:4357
> #14 0x006e35e9 in BackendStartup (port=port@entry=0x1aa3430) at
> postmaster.c:4029
> #15 0x006e39e3 in ServerLoop () at postmaster.c:1753
> #16 0x006e4b36 in PostmasterMain (argc=argc@

Re: Logical Replication and triggers

2017-11-21 Thread Thomas Rosenstein
On 22 November 2017 at 08:35, Robert Haas  
wrote:
On Tue, Nov 21, 2017 at 4:28 PM, Simon Riggs  
wrote:

I would have acted on this myself a few days back if I thought the
patch was correct, but I see multiple command counter increments
there, so suspect an alternate fix is correct.


Oh, well, I'm glad you said something.  I was actually thinking about
committing the patch Peter posted in
http://postgr.es/m/619c557d-93e6-1833-1692-b010b176f...@2ndquadrant.com
because it looks correct to me, but I'm certainly not an expert on
that code so I'll wait for your analysis.


OK, I've re-reviewed it and found it good, so have committed it.

--
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Thanks everyone for getting this done.



Re: [HACKERS] [WIP] Zipfian distribution in pgbench

2017-11-21 Thread Fabien COELHO


Note that the patch may interact with other patches which add functions to 
pgbench, so might need a rebase depending on the order in which the patch are 
applied.


Attached a minor rebase after 16827d4424.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index f6e93c3..f119ae6 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1093,6 +1093,14 @@ pgbench  options  d
an integer between 1 and 10
   
   
+   random_zipfian(lb, ub, parameter)
+   integer
+   Zipfian-distributed random integer in [lb, ub],
+  see below
+   random_zipfian(1, 10, 1.5)
+   an integer between 1 and 10
+  
+  
sqrt(x)
double
square root
@@ -1173,6 +1181,27 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
   of the Box-Muller transform.
  
 
+
+ 
+  random_zipfian generates an approximated bounded zipfian
+  distribution. For parameter in (0, 1), an
+  approximated algorithm is taken from
+  "Quickly Generating Billion-Record Synthetic Databases",
+  Jim Gray et al, SIGMOD 1994. For parameter
+  in (1, 1000), a rejection method is used, based on
+  "Non-Uniform Random Variate Generation", Luc Devroye, p. 550-551,
+  Springer 1986. The distribution is not defined when the parameter's
+  value is 1.0. The drawing performance is poor for parameter values
+  close and above 1.0 and on a small range.
+ 
+ 
+  parameter
+  defines how skewed the distribution is. The larger the parameter, the more
+  frequently values to the beginning of the interval are drawn.
+  The closer to 0 parameter is,
+  the flatter (more uniform) the access distribution.
+ 
+

 
   
diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y
index b3a2d9b..25d5ad4 100644
--- a/src/bin/pgbench/exprparse.y
+++ b/src/bin/pgbench/exprparse.y
@@ -191,6 +191,9 @@ static const struct
 	{
 		"random_exponential", 3, PGBENCH_RANDOM_EXPONENTIAL
 	},
+	{
+		"random_zipfian", 3, PGBENCH_RANDOM_ZIPFIAN
+	},
 	/* keep as last array element */
 	{
 		NULL, 0, 0
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index bd96eae..7ce6f60 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -95,7 +95,10 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define ZIPF_CACHE_SIZE	15		/* cache cells number */
+
 #define MIN_GAUSSIAN_PARAM		2.0 /* minimum parameter for gauss */
+#define MAX_ZIPFIAN_PARAM		1000	/* maximum parameter for zipfian */
 
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
@@ -331,6 +334,35 @@ typedef struct
 } CState;
 
 /*
+ * Cache cell for zipfian_random call
+ */
+typedef struct
+{
+	/* cell keys */
+	double		s;/* s - parameter of zipfan_random function */
+	int64		n;/* number of elements in range (max - min + 1) */
+
+	double		harmonicn;		/* generalizedHarmonicNumber(n, s) */
+	double		alpha;
+	double		beta;
+	double		eta;
+
+	uint64		last_used;		/* last used logical time */
+} ZipfCell;
+
+/*
+ * Zipf cache for zeta values
+ */
+typedef struct
+{
+	uint64		current;		/* counter for LRU cache replacement algorithm */
+
+	int			nb_cells;		/* number of filled cells */
+	int			overflowCount;	/* number of cache overflows */
+	ZipfCell	cells[ZIPF_CACHE_SIZE];
+} ZipfCache;
+
+/*
  * Thread state
  */
 typedef struct
@@ -342,6 +374,8 @@ typedef struct
 	unsigned short random_state[3]; /* separate randomness for each thread */
 	int64		throttle_trigger;	/* previous/next throttling (us) */
 	FILE	   *logfile;		/* where to log, or NULL */
+	ZipfCache	zipf_cache;		/* for thread-safe  zipfian random number
+ * generation */
 
 	/* per thread collected stats */
 	instr_time	start_time;		/* thread start time */
@@ -746,6 +780,137 @@ getPoissonRand(TState *thread, int64 center)
 	return (int64) (-log(uniform) * ((double) center) + 0.5);
 }
 
+/* helper function for getZipfianRand */
+static double
+generalizedHarmonicNumber(int64 n, double s)
+{
+	int			i;
+	double		ans = 0.0;
+
+	for (i = n; i > 1; i--)
+		ans += pow(i, -s);
+	return ans + 1.0;
+}
+
+/* set harmonicn and other parameters to cache cell */
+static void
+zipfSetCacheCell(ZipfCell * cell, int64 n, double s)
+{
+	double		harmonic2;
+
+	cell->n = n;
+	cell->s = s;
+
+	harmonic2 = generalizedHarmonicNumber(2, s);
+	cell->harmonicn = generalizedHarmonicNumber(n, s);
+
+	cell->alpha = 1.0 / (1.0 - s);
+	cell->beta = pow(0.5, s);
+	cell->eta = (1.0 - pow(2.0 / n, 1.0 - s)) / (1.0 - harmonic2 / cell->harmonicn);
+}
+
+/*
+ * search for cache cell with keys (n, s)
+ * and create new cell if it does not exist
+ */
+static ZipfCell *
+zipfFindOrCreateCacheCell(ZipfCache * cache, int64 n, double s)
+{
+	int			i,
+	

Re: With commit 4e5fe9ad19, range partition missing handling for the NULL partition key

2017-11-21 Thread Amit Langote
Hi Rushabh,

On 2017/11/22 13:45, Rushabh Lathia wrote:
> Consider the below test:
> 
> CREATE TABLE range_tab(a int, b int) PARTITION BY RANGE(a);
> CREATE TABLE range_tab_p1 PARTITION OF range_tab FOR VALUES FROM (minvalue)
> TO (10);
> CREATE TABLE range_tab_p2 PARTITION OF range_tab FOR VALUES FROM (10) TO
> (20);
> CREATE TABLE range_tab_p3 PARTITION OF range_tab FOR VALUES FROM (20) TO
> (maxvalue);
> 
> INSERT INTO range_tab VALUES(NULL, 10);
> 
> Above insert should fail with an error "no partition of relation found for
> row".
> 
> Looking further I found that, this behaviour is changed after below commit:
> 
> commit 4e5fe9ad19e14af360de7970caa8b150436c9dec
> Author: Robert Haas 
> Date:   Wed Nov 15 10:23:28 2017 -0500
> 
> Centralize executor-related partitioning code.
> 
> Some code is moved from partition.c, which has grown very quickly
> lately;
> splitting the executor parts out might help to keep it from getting
> totally out of control.  Other code is moved from execMain.c.  All is
> moved to a new file execPartition.c.  get_partition_for_tuple now has
> a new interface that more clearly separates executor concerns from
> generic concerns.
> 
> Amit Langote.  A slight comment tweak by me.
> 
> Before above commit insert with NULL partition key in the range partition
> was throwing a proper error.

Oops, good catch.

> Attaching patch to fix as well as regression test.

Thanks for the patch.  About the code, how about do it like the attached
instead?

Thanks,
Amit
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 9a44cceb22..a87dacc057 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -2536,11 +2536,10 @@ get_partition_for_tuple(Relation relation, Datum 
*values, bool *isnull)
 */
for (i = 0; i < key->partnatts; i++)
{
-   if (isnull[i] &&
-   
partition_bound_has_default(partdesc->boundinfo))
+   if (isnull[i])
{
range_partkey_has_null = true;
-   part_index = 
partdesc->boundinfo->default_index;
+   break;
}
}
 
@@ -2560,6 +2559,8 @@ get_partition_for_tuple(Relation relation, Datum *values, 
bool *isnull)
 */
part_index = 
partdesc->boundinfo->indexes[bound_offset + 1];
}
+   else
+   part_index = 
partdesc->boundinfo->default_index;
}
break;
 
diff --git a/src/test/regress/expected/insert.out 
b/src/test/regress/expected/insert.out
index 9d84ba4658..a0e3746e70 100644
--- a/src/test/regress/expected/insert.out
+++ b/src/test/regress/expected/insert.out
@@ -642,6 +642,10 @@ create table mcrparted2 partition of mcrparted for values 
from (10, 6, minvalue)
 create table mcrparted3 partition of mcrparted for values from (11, 1, 1) to 
(20, 10, 10);
 create table mcrparted4 partition of mcrparted for values from (21, minvalue, 
minvalue) to (30, 20, maxvalue);
 create table mcrparted5 partition of mcrparted for values from (30, 21, 20) to 
(maxvalue, maxvalue, maxvalue);
+-- null not allowed in range partition
+insert into mcrparted values (null, null, null);
+ERROR:  no partition of relation "mcrparted" found for row
+DETAIL:  Partition key of the failing row contains (a, abs(b), c) = (null, 
null, null).
 -- routed to mcrparted0
 insert into mcrparted values (0, 1, 1);
 insert into mcrparted0 values (0, 1, 1);
diff --git a/src/test/regress/sql/insert.sql b/src/test/regress/sql/insert.sql
index 791817ba50..1c4491a6a2 100644
--- a/src/test/regress/sql/insert.sql
+++ b/src/test/regress/sql/insert.sql
@@ -417,6 +417,9 @@ create table mcrparted3 partition of mcrparted for values 
from (11, 1, 1) to (20
 create table mcrparted4 partition of mcrparted for values from (21, minvalue, 
minvalue) to (30, 20, maxvalue);
 create table mcrparted5 partition of mcrparted for values from (30, 21, 20) to 
(maxvalue, maxvalue, maxvalue);
 
+-- null not allowed in range partition
+insert into mcrparted values (null, null, null);
+
 -- routed to mcrparted0
 insert into mcrparted values (0, 1, 1);
 insert into mcrparted0 values (0, 1, 1);


Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Craig Ringer
On 22 November 2017 at 12:15, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> At Wed, 22 Nov 2017 12:57:34 +0900, Michael Paquier <
> michael.paqu...@gmail.com> wrote in  tzwhp2oeervv7r...@mail.gmail.com>
> > On Wed, Nov 22, 2017 at 11:49 AM, Craig Ringer 
> wrote:
> > > On 20 November 2017 at 18:35, atorikoshi
> > >  wrote:
> > >> I put many queries into one transaction and made ReorderBuffer spill
> > >> data to disk, and sent SIGKILL to postgres before the end of the
> > >> transaction.
> > >>
> > >> After starting up postgres again, I observed the files spilled to
> > >> data wasn't deleted.
> > >
> > > Since this can only happen  on crash exits, and the reorderbuffer data
> is
> > > useless after a decoding backend exits, why don't we just recursively
> delete
> > > the tree contents on Pg startup?
> >
> > +1. You would just need an extra step after say
> > DeleteAllExportedSnapshotFiles() in startup.c. Looks saner to me do so
> > so as well.
>
> The old files are being removed at startup by
> StartupReorderBuffer.
>

That seems to contradict the statement above, that "after starting up
postgres again, I observed the files spilled to disk weren't deleted".


> At the time of assertion failure, the files are not of the
> previous run, but they are created after reconnection from the
> subscriber.
>

Are you saying the problem files do not exist when we start up, but are
created and then leaked after logical decoding resumes after an unclean
startup?

... Yes, that does appear to be the case, per the original report:

"7. After a while, we can see the files remaining.
  (Immediately after starting publiser, we can not see these files.)"

I was confused by "remaining". They're not remaining, they've been
re-created.

But if they're re-created, why are they not recreated correctly after an
unclean shutdown? What persistent state is causing that? We should be
clobbering saved reorder buffer temp files, snapshots, etc at startup. The
slot state is pretty simple, it'd just be a bit behind.

The key difference seems to be that we hard-kill the server so it can't
write anything to clog. The xact is implicitly aborted, we never wrote any
xlog record for a commit or abort. The problem is presumably with decoding
xacts that were implicitly aborted by server crash, after we restart the
server and resume decoding.

The assertion failure reported is in ReorderBufferRestoreCleanup, which
makes sense.

Because there's no xlog record of the crash, we never set the buffer's
final_lsn in ReorderBufferCommit or ReorderBufferAbort .

Note the comment there:

 * NB: Transactions handled here have to have actively aborted (i.e. have
 * produced an abort record). Implicitly aborted transactions are handled
via
 * ReorderBufferAbortOld(); transactions we're just not interested in, but
 * which have committed are handled in ReorderBufferForget().

That's only called from DecodeStandbyOp in response to an xl_running_xacts.

Here's the backtrace.

Core was generated by `postgres: wal sender process postgres [local'.
Program terminated with signal SIGABRT, Aborted.
...
#2  0x008537b7 in ExceptionalCondition
(conditionName=conditionName@entry=0x9fcdf6 "!(txn->final_lsn != 0)",
errorType=errorType@entry=0x89bcb4 "FailedAssertion",
fileName=fileName@entry=0x9fcd04 "reorderbuffer.c",
lineNumber=lineNumber@entry=2576) at assert.c:54
#3  0x006fec02 in ReorderBufferRestoreCleanup (rb=rb@entry=0x1b4c370,
txn=txn@entry=0x1b5c3b8) at reorderbuffer.c:2576
#4  0x00700693 in ReorderBufferCleanupTXN (rb=rb@entry=0x1b4c370,
txn=txn@entry=0x1b5c3b8) at reorderbuffer.c:1145
#5  0x00701516 in ReorderBufferAbortOld (rb=0x1b4c370,
oldestRunningXid=558) at reorderbuffer.c:1730
#6  0x006f5a47 in DecodeStandbyOp (ctx=0x1af9ce0,
buf=buf@entry=0x7ffd11761200)
at decode.c:325
#7  0x006f65bf in LogicalDecodingProcessRecord (ctx=, record=) at decode.c:117
#8  0x007098ab in XLogSendLogical () at walsender.c:2766
#9  0x0070a875 in WalSndLoop (send_data=send_data@entry=0x709857
) at walsender.c:2134
#10 0x0070b011 in StartLogicalReplication (cmd=cmd@entry=0x1a9cd68)
at walsender.c:1101
#11 0x0070b46f in exec_replication_command
(cmd_string=cmd_string@entry=0x1afec30 "START_REPLICATION SLOT \"sub\"
LOGICAL 0/0 (proto_version '1', publication_names '\"pub\"')") at
walsender.c:1527
#12 0x00758809 in PostgresMain (argc=,
argv=argv@entry=0x1aab870, dbname=, username=) at postgres.c:4086
#13 0x006e178d in BackendRun (port=port@entry=0x1aa3430) at
postmaster.c:4357
#14 0x006e35e9 in BackendStartup (port=port@entry=0x1aa3430) at
postmaster.c:4029
#15 0x006e39e3 in ServerLoop () at postmaster.c:1753
#16 0x006e4b36 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x1a7c5d0)
at postmaster.c:1361
#17 0x0065d093 in main (argc=3, argv=0x1a7c5d0) at main.c:228

So it's clear why we can call ReorderBufferCleanupOld with no fina

Re: Logical Replication and triggers

2017-11-21 Thread Simon Riggs
On 22 November 2017 at 08:35, Robert Haas  wrote:
> On Tue, Nov 21, 2017 at 4:28 PM, Simon Riggs  wrote:
>> I would have acted on this myself a few days back if I thought the
>> patch was correct, but I see multiple command counter increments
>> there, so suspect an alternate fix is correct.
>
> Oh, well, I'm glad you said something.  I was actually thinking about
> committing the patch Peter posted in
> http://postgr.es/m/619c557d-93e6-1833-1692-b010b176f...@2ndquadrant.com
> because it looks correct to me, but I'm certainly not an expert on
> that code so I'll wait for your analysis.

OK, I've re-reviewed it and found it good, so have committed it.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Issues with logical replication

2017-11-21 Thread Simon Riggs
On 4 October 2017 at 10:35, Petr Jelinek  wrote:
> On 02/10/17 18:59, Petr Jelinek wrote:
>>>
>>> Now fix the trigger function:
>>> CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS TRIGGER AS $$
>>> BEGIN
>>>   RETURN NEW;
>>> END $$ LANGUAGE plpgsql;
>>>
>>> And manually perform at master two updates inside one transaction:
>>>
>>> postgres=# begin;
>>> BEGIN
>>> postgres=# update pgbench_accounts set abalance=abalance+1 where aid=26;
>>> UPDATE 1
>>> postgres=# update pgbench_accounts set abalance=abalance-1 where aid=26;
>>> UPDATE 1
>>> postgres=# commit;
>>> 
>>>
>>> and in replica log we see:
>>> 2017-10-02 18:40:26.094 MSK [2954] LOG:  logical replication apply
>>> worker for subscription "sub" has started
>>> 2017-10-02 18:40:26.101 MSK [2954] ERROR:  attempted to lock invisible
>>> tuple
>>> 2017-10-02 18:40:26.102 MSK [2882] LOG:  worker process: logical
>>> replication worker for subscription 16403 (PID 2954) exited with exit
>>> code 1
>>>
>>> Error happens in trigger.c:
>>>
>>> #3  0x0069bddb in GetTupleForTrigger (estate=0x2e36b50,
>>> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tid=0x2dd08ac,
>>> lockmode=LockTupleNoKeyExclusive, newSlot=0x7ffc4420ec40) at
>>> trigger.c:3103
>>> #4  0x0069b259 in ExecBRUpdateTriggers (estate=0x2e36b50,
>>> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tupleid=0x2dd08ac,
>>> fdw_trigtuple=0x0, slot=0x2dd0240) at trigger.c:2748
>>> #5  0x006d2395 in ExecSimpleRelationUpdate (estate=0x2e36b50,
>>> epqstate=0x7ffc4420eda0, searchslot=0x2dd0358, slot=0x2dd0240)
>>> at execReplication.c:461
>>> #6  0x00820894 in apply_handle_update (s=0x7ffc442163b0) at
>>> worker.c:736
>>
>> We have locked the same tuple in RelationFindReplTupleByIndex() just
>> before this gets called and didn't get the same error. I guess we do
>> something wrong with snapshots. Will need to investigate more.
>>
>
> Okay, so it's not snapshot. It's the fact that we don't set the
> es_output_cid in replication worker which GetTupleForTrigger is using
> when locking the tuple. Attached one-liner fixes it.

Applied, thanks

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



With commit 4e5fe9ad19, range partition missing handling for the NULL partition key

2017-11-21 Thread Rushabh Lathia
Consider the below test:

CREATE TABLE range_tab(a int, b int) PARTITION BY RANGE(a);
CREATE TABLE range_tab_p1 PARTITION OF range_tab FOR VALUES FROM (minvalue)
TO (10);
CREATE TABLE range_tab_p2 PARTITION OF range_tab FOR VALUES FROM (10) TO
(20);
CREATE TABLE range_tab_p3 PARTITION OF range_tab FOR VALUES FROM (20) TO
(maxvalue);

INSERT INTO range_tab VALUES(NULL, 10);

Above insert should fail with an error "no partition of relation found for
row".

Looking further I found that, this behaviour is changed after below commit:

commit 4e5fe9ad19e14af360de7970caa8b150436c9dec
Author: Robert Haas 
Date:   Wed Nov 15 10:23:28 2017 -0500

Centralize executor-related partitioning code.

Some code is moved from partition.c, which has grown very quickly
lately;
splitting the executor parts out might help to keep it from getting
totally out of control.  Other code is moved from execMain.c.  All is
moved to a new file execPartition.c.  get_partition_for_tuple now has
a new interface that more clearly separates executor concerns from
generic concerns.

Amit Langote.  A slight comment tweak by me.

Before above commit insert with NULL partition key in the range partition
was throwing a proper error.

postgres@112171=#INSERT INTO range_tab VALUES(NULL, 10);
ERROR:  no partition of relation "range_tab" found for row
DETAIL:  Partition key of the failing row contains (a) = (null).

Looking at the code partition_bound_cmp(), before 4e5fe9ad19 commit there
was a condition for the null values:

 /*
 * No range includes NULL, so this will be accepted by
the
 * default partition if there is one, and otherwise
 * rejected.
 */
for (i = 0; i < key->partnatts; i++)
{
if (isnull[i] &&

partition_bound_has_default(partdesc->boundinfo))
{
range_partkey_has_null = true;
break;
}







*else if (isnull[i]){
*failed_at = parent;
*failed_slot = slot;result = -1;
goto error_exit;}*}

But after commit, condition for isnull is missing.  It doesn't look
intentional,
is it?

Attaching patch to fix as well as regression test.

Thanks,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 67d4c2a..b62e8f5 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -2541,6 +2541,11 @@ get_partition_for_tuple(Relation relation, Datum *values, bool *isnull)
 		range_partkey_has_null = true;
 		part_index = partdesc->boundinfo->default_index;
 	}
+	else if(isnull[i])
+	{
+		range_partkey_has_null = true;
+		part_index = -1;
+	}
 }
 
 if (!range_partkey_has_null)
diff --git a/src/test/regress/expected/insert.out b/src/test/regress/expected/insert.out
index 9d84ba4..a0e3746 100644
--- a/src/test/regress/expected/insert.out
+++ b/src/test/regress/expected/insert.out
@@ -642,6 +642,10 @@ create table mcrparted2 partition of mcrparted for values from (10, 6, minvalue)
 create table mcrparted3 partition of mcrparted for values from (11, 1, 1) to (20, 10, 10);
 create table mcrparted4 partition of mcrparted for values from (21, minvalue, minvalue) to (30, 20, maxvalue);
 create table mcrparted5 partition of mcrparted for values from (30, 21, 20) to (maxvalue, maxvalue, maxvalue);
+-- null not allowed in range partition
+insert into mcrparted values (null, null, null);
+ERROR:  no partition of relation "mcrparted" found for row
+DETAIL:  Partition key of the failing row contains (a, abs(b), c) = (null, null, null).
 -- routed to mcrparted0
 insert into mcrparted values (0, 1, 1);
 insert into mcrparted0 values (0, 1, 1);
diff --git a/src/test/regress/sql/insert.sql b/src/test/regress/sql/insert.sql
index 791817b..1c4491a 100644
--- a/src/test/regress/sql/insert.sql
+++ b/src/test/regress/sql/insert.sql
@@ -417,6 +417,9 @@ create table mcrparted3 partition of mcrparted for values from (11, 1, 1) to (20
 create table mcrparted4 partition of mcrparted for values from (21, minvalue, minvalue) to (30, 20, maxvalue);
 create table mcrparted5 partition of mcrparted for values from (30, 21, 20) to (maxvalue, maxvalue, maxvalue);
 
+-- null not allowed in range partition
+insert into mcrparted values (null, null, null);
+
 -- routed to mcrparted0
 insert into mcrparted values (0, 1, 1);
 insert into mcrparted0 values (0, 1, 1);


Re: [HACKERS] Support to COMMENT ON DATABASE CURRENT_DATABASE

2017-11-21 Thread Jing Wang
Hi Nathan,

Thanks for review comments.

Enclosed please find the patch which has been updated according to your
suggestion.

The CURRENT_DATABASE can be used as following SQL statements and people can
find information from sgml files:
1. COMMENT ON DATABASE CURRENT_DATABASE is ...
2. ALTER DATABASE CURRENT_DATABASE OWNER to ...
3. ALTER DATABASE CURRENT_DATABASE SET parameter ...
4. ALTER DATABASE CURRENT_DATABASE RESET parameter ...
5. ALTER DATABASE CURRENT_DATABASE RESET ALL
6. SELECT CURRENT_DATABASE
7. SECURITY LABEL ON DATABASE CURRENT_DATABASE

As your mentioned the database_name are also present in the
GRANT/REVOKE/ALTER ROLE, so a patch will be present later for supporting
CURRENT_DATABASE on these SQL statements.

Regards,
Jing Wang
Fujitsu Australia


comment_on_current_database_no_pgdump_v4.4.patch
Description: Binary data


comment_on_current_database_for_pgdump_v4.4.patch
Description: Binary data


Re: [HACKERS] Issues with logical replication

2017-11-21 Thread Craig Ringer
On 4 October 2017 at 07:35, Petr Jelinek 
wrote:

> On 02/10/17 18:59, Petr Jelinek wrote:
> >>
> >> Now fix the trigger function:
> >> CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS TRIGGER
> AS $$
> >> BEGIN
> >>   RETURN NEW;
> >> END $$ LANGUAGE plpgsql;
> >>
> >> And manually perform at master two updates inside one transaction:
> >>
> >> postgres=# begin;
> >> BEGIN
> >> postgres=# update pgbench_accounts set abalance=abalance+1 where aid=26;
> >> UPDATE 1
> >> postgres=# update pgbench_accounts set abalance=abalance-1 where aid=26;
> >> UPDATE 1
> >> postgres=# commit;
> >> 
> >>
> >> and in replica log we see:
> >> 2017-10-02 18:40:26.094 MSK [2954] LOG:  logical replication apply
> >> worker for subscription "sub" has started
> >> 2017-10-02 18:40:26.101 MSK [2954] ERROR:  attempted to lock invisible
> >> tuple
> >> 2017-10-02 18:40:26.102 MSK [2882] LOG:  worker process: logical
> >> replication worker for subscription 16403 (PID 2954) exited with exit
> >> code 1
> >>
> >> Error happens in trigger.c:
> >>
> >> #3  0x0069bddb in GetTupleForTrigger (estate=0x2e36b50,
> >> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tid=0x2dd08ac,
> >> lockmode=LockTupleNoKeyExclusive, newSlot=0x7ffc4420ec40) at
> >> trigger.c:3103
> >> #4  0x0069b259 in ExecBRUpdateTriggers (estate=0x2e36b50,
> >> epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tupleid=0x2dd08ac,
> >> fdw_trigtuple=0x0, slot=0x2dd0240) at trigger.c:2748
> >> #5  0x006d2395 in ExecSimpleRelationUpdate (estate=0x2e36b50,
> >> epqstate=0x7ffc4420eda0, searchslot=0x2dd0358, slot=0x2dd0240)
> >> at execReplication.c:461
> >> #6  0x00820894 in apply_handle_update (s=0x7ffc442163b0) at
> >> worker.c:736
> >
> > We have locked the same tuple in RelationFindReplTupleByIndex() just
> > before this gets called and didn't get the same error. I guess we do
> > something wrong with snapshots. Will need to investigate more.
> >
>
> Okay, so it's not snapshot. It's the fact that we don't set the
> es_output_cid in replication worker which GetTupleForTrigger is using
> when locking the tuple. Attached one-liner fixes it.
>

This seems like a clear-cut bug with a simple fix.

Lets get this committed, so we don't lose it. The rest of the thread is
going off into the weeds a bit issues unrelated to the original problem.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Kyotaro HORIGUCHI
At Wed, 22 Nov 2017 12:57:34 +0900, Michael Paquier  
wrote in 
> On Wed, Nov 22, 2017 at 11:49 AM, Craig Ringer  wrote:
> > On 20 November 2017 at 18:35, atorikoshi
> >  wrote:
> >> I put many queries into one transaction and made ReorderBuffer spill
> >> data to disk, and sent SIGKILL to postgres before the end of the
> >> transaction.
> >>
> >> After starting up postgres again, I observed the files spilled to
> >> data wasn't deleted.
> >
> > Since this can only happen  on crash exits, and the reorderbuffer data is
> > useless after a decoding backend exits, why don't we just recursively delete
> > the tree contents on Pg startup?
> 
> +1. You would just need an extra step after say
> DeleteAllExportedSnapshotFiles() in startup.c. Looks saner to me do so
> so as well.

The old files are being removed at startup by
StartupReorderBuffer.

At the time of assertion failure, the files are not of the
previous run, but they are created after reconnection from the
subscriber.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] More stats about skipped vacuums

2017-11-21 Thread Michael Paquier
On Wed, Nov 22, 2017 at 1:08 PM, Kyotaro HORIGUCHI
 wrote:
> At Wed, 22 Nov 2017 08:20:22 +0900, Michael Paquier 
>  wrote in 
> 
>> On Tue, Nov 21, 2017 at 4:09 PM, Kyotaro HORIGUCHI
>>  wrote:
>> > By the way I'm uneasy that the 'last_vacuum_index_scans' (and
>> > vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
>> > both VACUUM command and autovacuum, while last_vacuum and
>> > vacuum_count is mentioning only the command. Splitting it into
>> > vacuum/autovaccum seems nonsense but the name is confusing. Do
>> > you have any idea?
>>
>> Hm. I think that you should actually have two fields, one for manual
>> vacuum and one for autovacuum, because each is tied to respectively
>> maintenance_work_mem and autovacuum_work_mem. This way admins are able
>
> It's very convincing for me. Thanks for the suggestion.
>
>> to tune each one of those parameters depending on a look at
>> pg_stat_all_tables. So those should be named perhaps
>> last_vacuum_index_scans and last_autovacuum_index_scans?
>
> Agreed. I'll do so in the next version.

Thanks for considering the suggestion.

> # I forgot to add the version to the patch files...

Don't worry about that. That's not a problem for me I'll just keep
track of the last entry. With the room I have I'll keep focused on
0001 by the way. Others are of course welcome to look at 0002 and
onwards.
-- 
Michael



Re: [HACKERS] More stats about skipped vacuums

2017-11-21 Thread Kyotaro HORIGUCHI
Hello,

At Wed, 22 Nov 2017 08:20:22 +0900, Michael Paquier  
wrote in 
> On Tue, Nov 21, 2017 at 4:09 PM, Kyotaro HORIGUCHI
>  wrote:
> > By the way I'm uneasy that the 'last_vacuum_index_scans' (and
> > vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
> > both VACUUM command and autovacuum, while last_vacuum and
> > vacuum_count is mentioning only the command. Splitting it into
> > vacuum/autovaccum seems nonsense but the name is confusing. Do
> > you have any idea?
> 
> Hm. I think that you should actually have two fields, one for manual
> vacuum and one for autovacuum, because each is tied to respectively
> maintenance_work_mem and autovacuum_work_mem. This way admins are able

It's very convincing for me. Thanks for the suggestion.

> to tune each one of those parameters depending on a look at
> pg_stat_all_tables. So those should be named perhaps
> last_vacuum_index_scans and last_autovacuum_index_scans?

Agreed. I'll do so in the next version.

# I forgot to add the version to the patch files...

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Michael Paquier
On Wed, Nov 22, 2017 at 11:49 AM, Craig Ringer  wrote:
> On 20 November 2017 at 18:35, atorikoshi
>  wrote:
>> I put many queries into one transaction and made ReorderBuffer spill
>> data to disk, and sent SIGKILL to postgres before the end of the
>> transaction.
>>
>> After starting up postgres again, I observed the files spilled to
>> data wasn't deleted.
>
> Since this can only happen  on crash exits, and the reorderbuffer data is
> useless after a decoding backend exits, why don't we just recursively delete
> the tree contents on Pg startup?

+1. You would just need an extra step after say
DeleteAllExportedSnapshotFiles() in startup.c. Looks saner to me do so
so as well.
-- 
Michael



generic-msvc.h(91): error C2664

2017-11-21 Thread Vicky Vergara
Hello all


I am trying to compile pgRouting on appveyor CI when compiling with pg >= 9.5


I am getting the following error:


C:\Program Files\PostgreSQL\9.5\include\server\port/atomics/generic-msvc.h(91): 
error C2664: '__int64 _InterlockedCompareExchange64(volatile __int64 
*,__int64,__int64)' : cannot convert argument 1 from 'volatile uint64 *' to 
'volatile __int64 *' [C:\build\pgrouting\build\pg95\x64\src\tsp\tsp.vcxproj]


This error happens when C++ code that needs palloc is being compiled.

There is no problem when compiling C code.


On the documentation of the  _InterlockedExchangeAdd64 [1] there is no uint64 * 
type

This code [2] use [3] where uint64 is used


Maybe I am doing something wrong including some postgres files on these [4] & 
[5].


A sample compilation with this error is in [6] and there is no problem at all 
with pg 9.4 [7]


Any feedback would be greatly appreciated

Vicky


[1] 
https://docs.microsoft.com/en-us/cpp/intrinsics/interlockedexchangeadd-intrinsic-functions

[2] 
https://github.com/postgres/postgres/blob/REL9_5_STABLE/src/include/port/atomics/generic-msvc.h#L101

[3] 
https://github.com/postgres/postgres/blob/REL9_5_STABLE/src/include/port/atomics/generic-msvc.h#L43

[4] 
https://github.com/cvvergara/pgrouting/blob/appveyor-pg95/include/cpp_common/pgr_alloc.hpp

[5] 
https://github.com/cvvergara/pgrouting/blob/appveyor-pg95/include/c_common/postgres_connection.h

[6] https://ci.appveyor.com/project/cvvergara/pgrouting/build/2.6.1936#L415

[7] https://ci.appveyor.com/project/cvvergara/pgrouting/build/2.6.1934




Re: Logical Replication and triggers

2017-11-21 Thread Craig Ringer
On 22 November 2017 at 05:35, Robert Haas  wrote:

> On Tue, Nov 21, 2017 at 4:28 PM, Simon Riggs 
> wrote:
> > I would have acted on this myself a few days back if I thought the
> > patch was correct, but I see multiple command counter increments
> > there, so suspect an alternate fix is correct.
>
> Oh, well, I'm glad you said something.  I was actually thinking about
> committing the patch Peter posted in
> http://postgr.es/m/619c557d-93e6-1833-1692-b010b176f...@2ndquadrant.com
> because it looks correct to me, but I'm certainly not an expert on
> that code so I'll wait for your analysis.
>

I'll read over the patch and respond on that thread.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Logical Replication and triggers

2017-11-21 Thread Craig Ringer
On 22 November 2017 at 02:27, Robert Haas  wrote:

> On Sat, Nov 18, 2017 at 7:30 PM, Craig Ringer 
> wrote:
> > On 15 November 2017 at 21:12, Thomas Rosenstein
> >  wrote:
> >> I would like somebody to consider Petr Jelineks patch for worker.c from
> >> here
> >> (https://www.postgresql.org/message-id/619c557d-93e6-1833-
> 1692-b010b176ff77%402ndquadrant.com)
> >>
> >> I'm was facing the same issue with 10.1 and BEFORE INSERT OR UPDATE
> >> triggers.
> >
> > Please:
> >
> > - Apply it to current HEAD
> > - Test its functionality
> > - Report back on the patch thread
> > - Update the commitfest app with your results and sign on as a reviewer
> > - If you're able, read over the patch and make any comments you can
> >
> > "Somebody" needs to be you, if you want this functionality.
>
> You realize we're talking about a bug fix, right?  And for a feature
> that was developed and committed by your colleagues?
>

I did not realise it was a bug fix, and agree that changes things.

There was discussion at a similar time around people wanting extra features
for triggers and incorrectly assumed this was regarding that post.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Craig Ringer
On 20 November 2017 at 18:35, atorikoshi  wrote:

> Hi,
>
> I put many queries into one transaction and made ReorderBuffer spill
> data to disk, and sent SIGKILL to postgres before the end of the
> transaction.
>
> After starting up postgres again, I observed the files spilled to
> data wasn't deleted.


Since this can only happen  on crash exits, and the reorderbuffer data is
useless after a decoding backend exits, why don't we just recursively
delete the tree contents on Pg startup?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Kyotaro HORIGUCHI
Hi,

At Wed, 22 Nov 2017 10:10:27 +0900, Masahiko Sawada  
wrote in 
> >> Using last changing LSN might work but I'm afraid that that fails
> >> to remove the last snap file if the crash happens at the very
> >> start of a segment.
> 
> I think it works even in the case because the we can compute the
> largest WAL segment number that we need to remove by using the lsn of
> the last change in old transaction. Am I missing something?

I'm concerned by the window that can leave an empty file in
ReorderBufferSerializeTXN. But I had a closer look and found that
the snap files for old transactions are not of the previous run,
but rebuilt from WAL records after restart. So there cannot be an
empty file there.

I'm convinced that it is the proper way to deal with this problem.

> >> Anyway all files of the transaction is no longer useless at the
> >> time, but it seems that the last_lsn is required to avoid
> >> directory scanning at every transaction end.
> >>
> >> Letting ReorderBufferAbortOld scan the directory and determine
> >> the first and last LSN then set to the txn would work but it
> >> might be an overkill. Using the beginning LSN of the next segment
> >> of the last_change->lsn could surely work... really?
> >> (ReorderBufferRestoreCleanup doesn't complain on ENOENT.)
> >
> > Somehow I deleted exessively while editing. One more possible
> > solution is making ReorderBufferAbortOld take final LSN and
> > DecodeStandbyOp passes the LSN of XLOG_RUNNING_XACTS record to
> > it.
> >
> 
> Setting final_lsn in ReorderBufferAbortOld seems good to me but I'm
> not sure we can use the lsn of XLOG_RUNNING_XACTS record. Doesn't
> ReorderBufferRestoreCleanup() raise an error due to ENOENT if the wal

It no longer matters but the function does *not* raise an error
on ENOENT.

> segment having XLOG_RUNNING_XACTS records doesn't have any changes of
> the old transaction?

Since the transaction doesn't meet abort record any larger LSN
can work as final_lsn and the record is guaranteed to be so. But
anyway I agree that the last_change->lsn is more proper than it.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2017-11-21 Thread Masahiko Sawada
On Wed, Nov 22, 2017 at 5:25 AM, Robert Haas  wrote:
> On Mon, Nov 20, 2017 at 5:19 PM, Masahiko Sawada  
> wrote:
>> Attached updated version patch. I've moved only relation extension
>> locks out of heavy-weight lock as per discussion so far.
>>
>> I've done a write-heavy benchmark on my laptop; loading 24kB data to
>> one table using COPY by 1 client, for 10 seconds. The through-put of
>> patched is 10% better than current HEAD. The result of 5 times is the
>> following.
>>
>> - PATCHED -
>> tps = 178.791515 (excluding connections establishing)
>> tps = 176.522693 (excluding connections establishing)
>> tps = 168.705442 (excluding connections establishing)
>> tps = 158.158009 (excluding connections establishing)
>> tps = 161.145709 (excluding connections establishing)
>>
>> - HEAD -
>> tps = 147.079803 (excluding connections establishing)
>> tps = 149.079540 (excluding connections establishing)
>> tps = 149.082275 (excluding connections establishing)
>> tps = 148.255376 (excluding connections establishing)
>> tps = 145.542552 (excluding connections establishing)
>>
>> Also I've done a micro-benchmark; calling LockRelationForExtension and
>> UnlockRelationForExtension tightly in order to measure the number of
>> lock/unlock cycles per second. The result is,
>> PATCHED = 3.95892e+06 (cycles/sec)
>> HEAD = 1.15284e+06 (cycles/sec)
>> The patched is 3 times faster than current HEAD.
>>
>> Attached updated patch and the function I used for micro-benchmark.
>> Please review it.
>
> That's a nice speed-up.
>
> How about a preliminary patch that asserts that we never take another
> heavyweight lock while holding a relation extension lock?
>

Agreed. Also, since we disallow to holding more than one locks of
different relations at once I'll add an assertion for it as well.

I think we no longer need to pass the lock level to
UnloclRelationForExtension(). Now that relation extension lock will be
simple we can release the lock in the mode that we used to acquire
like LWLock.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-21 Thread Michael Paquier
On Tue, Nov 21, 2017 at 4:18 AM, Andres Freund  wrote:
> On 2017-11-13 19:03:41 -0800, Andres Freund wrote:
>> On 2017-11-03 07:53:30 -0700, Andres Freund wrote:
>> > Here's that patch.  I've stared at this some, and Robert did too. Robert
>> > mentioned that the commit message might need some polish and I'm not
>> > 100% sure about the error message texts yet.
>> >
>> > I'm not yet convinced that the new elog in vacuumlazy can never trigger
>> > - but I also don't think we want to actually freeze the tuple in that
>> > case.
>>
>> I'm fairly sure it could be triggered, therefore I've rewritten that.
>>
>> I've played around quite some with the attached patch. So far, after
>> applying the second patch, neither VACUUM nor VACUUM FULL / CLUSTER make
>> the situation worse for already existing corruption. HOT pruning can
>> change the exact appearance of existing corruption a bit, but I don't
>> think it can make the corruption meaningfully worse.  It's a bit
>> annoying and scary to add so many checks to backbranches but it kinda
>> seems required.  The error message texts aren't perfect, but these are
>> "should never be hit" type elog()s so I'm not too worried about that.
>>
>> Please review!
>
> Ping? Alvaro, it'd be good to get some input here.

Note that I will be able to jump on the ship after being released from
commit fest duties. This is likely a multi-day task for testing and
looking at it, and I am not the most knowledgeable human being with
this code.
-- 
Michael



Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Masahiko Sawada
On Wed, Nov 22, 2017 at 9:03 AM, Kyotaro HORIGUCHI
 wrote:
> At Tue, 21 Nov 2017 20:53:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20171121.205304.90315453.horiguchi.kyot...@lab.ntt.co.jp>
>> At Tue, 21 Nov 2017 20:27:25 +0900, atorikoshi 
>>  wrote in 
>> 
>> > Thanks for reviewing!
>> >
>> >
>> > On 2017/11/21 18:12, Masahiko Sawada wrote:
>> > > On Tue, Nov 21, 2017 at 3:48 PM, Masahiko Sawada
>> > >  wrote:
>> > >> On Mon, Nov 20, 2017 at 7:35 PM, atorikoshi
>> > >>  wrote:
>> > >>> Hi,
>> > >>>
>> > >>> I put many queries into one transaction and made ReorderBuffer spill
>> > >>> data to disk, and sent SIGKILL to postgres before the end of the
>> > >>> transaction.
>> > >>>
>> > >>> After starting up postgres again, I observed the files spilled to
>> > >>> data wasn't deleted.
>> > >>>
>> > >>> I think these files should be deleted because its transaction was no
>> > >>> more valid, so no one can use these files.
>> > >>>
>> > >>>
>> > >>> Below is a reproduction instructions.
>> > >>>
>> > >>> 
>> > >>> 1. Create table and publication at publiser.
>> > >>>
>> > >>>   @pub =# CREATE TABLE t1(
>> > >>>   id INT PRIMARY KEY,
>> > >>>   name TEXT);
>> > >>>
>> > >>>   @pub =# CREATE PUBLICATION pub FOR TABLE t1;
>> > >>>
>> > >>> 2. Create table and subscription at subscriber.
>> > >>>
>> > >>>   @sub =# CREATE TABLE t1(
>> > >>>   id INT PRIMARY KEY,
>> > >>>   name TEXT
>> > >>>   );
>> > >>>
>> > >>>   @sub =# CREATE SUBSCRIPTION sub
>> > >>>   CONNECTION 'host=[hostname] port=[port] dbname=[dbname]'
>> > >>>   PUBLICATION pub;
>> > >>>
>> > >>> 3. Put many queries into one transaction.
>> > >>>
>> > >>>   @pub =# BEGIN;
>> > >>> INSERT INTO t1
>> > >>> SELECT
>> > >>>   i,
>> > >>>   'aa'
>> > >>> FROM
>> > >>> generate_series(1, 100) as i;
>> > >>>
>> > >>> 4. Then we can see spilled files.
>> > >>>
>> > >>>   @pub $ ls -1 ${PGDATA}/pg_replslot/sub/
>> > >>> state
>> > >>> xid-561-lsn-0-100.snap
>> > >>> xid-561-lsn-0-200.snap
>> > >>> xid-561-lsn-0-300.snap
>> > >>> xid-561-lsn-0-400.snap
>> > >>> xid-561-lsn-0-500.snap
>> > >>> xid-561-lsn-0-600.snap
>> > >>> xid-561-lsn-0-700.snap
>> > >>> xid-561-lsn-0-800.snap
>> > >>> xid-561-lsn-0-900.snap
>> > >>>
>> > >>> 5. Kill publisher's postgres process before COMMIT.
>> > >>>
>> > >>>   @pub $ kill -s SIGKILL [pid of postgres]
>> > >>>
>> > >>> 6. Start publisher's postgres process.
>> > >>>
>> > >>>   @pub $ pg_ctl start -D ${PGDATA}
>> > >>>
>> > >>> 7. After a while, we can see the files remaining.
>> > >>>   (Immediately after starting publiser, we can not see these files.)
>> > >>>
>> > >>>   @pub $ pg_ctl start -D ${PGDATA}
>> > >>>
>> > >>>   When I configured with '--enable-cassert', below assertion error
>> > >>>   was appeared.
>> > >>>
>> > >>> TRAP: FailedAssertion("!(txn->final_lsn != 0)", File:
>> > >>> "reorderbuffer.c",
>> > >>> Line: 2576)
>> > >>> 
>> > >>>
>> > >>> Attached patch sets final_lsn to the last ReorderBufferChange if
>> > >>> final_lsn == 0.
>> > >>
>> > >> Thank you for the report. I could reproduce this issue with the above
>> > >> step. My analysis is, the cause of that a serialized reorder buffer
>> > >> isn't cleaned up is that the aborted transaction without an abort WAL
>> > >> record has no chance to set ReorderBufferTXN->final_lsn. So if there
>> > >> is such serialized transactions ReorderBufferRestoreCleanup cleanups
>> > >> no files, which is cause of the assertion failure (or a file being
>> > >> orphaned). What do you think?
>> > >>
>> > >> On detail of your patch, I'm not sure it's safe if we set the lsn of
>> > >> other than commit record or abort record to final_lsn. The comment in
>> > >> reorderbuffer.h says,
>> > >>
>> > >> typedef trcut ReorderBufferTXN
>> > >> {
>> > >> (snip)
>> > >>
>> > >> /* 
>> > >>  * LSN of the record that lead to this xact to be committed or
>> > >>  * aborted. This can be a
>> > >>  * * plain commit record
>> > >>  * * plain commit record, of a parent transaction
>> > >>  * * prepared transaction commit
>> > >>  * * plain abort record
>> > >>  * * prepared transaction abort
>> > >>  * * error during decoding
>> > >>  * 
>> > >>  */
>> > >> XLogRecPtr  final_lsn;
>> > >>
>> > >> But with your patch, we could set a lsn of a record that is other than
>> > >> what listed above to final_lsn. One way I came up with is to make
>> >
>> > I added some comments on final_lsn.
>> >
>> > >> ReorderBufferRestoreCleanup accept an invalid value of final_lsn and
>> > >> regards it as a aborted transaction that doesn't has a abort WAL
>> > >> record. So we can cleanup all serialized files if final_lsn of a
>> > >> transaction is invalid.
>> > >
>> > >

Re: [HACKERS] CLUSTER command progress monitor

2017-11-21 Thread Thomas Munro
On Wed, Nov 22, 2017 at 1:53 PM, Michael Paquier
 wrote:
> On Wed, Nov 22, 2017 at 5:55 AM, Robert Haas  wrote:
>> I have been of the opinion all along that progress monitoring needs to
>> report facts, not theories.  The number of tuples read thus far is a
>> fact, and is fine to report for whatever value it may have to someone.
>> The number of tuples that will be read in the future is a theory, and
>> as you say, progress monitoring is most likely to be used in cases
>> where theory and practice ended up being very different.
>
> +1. We should never as well enter in things like trying to estimate
> the amount of time remaining to finish a task [1].
>
> [1]: https://www.xkcd.com/612/

+1

That is one reason I made pg_stat_replication.XXX_lag report the lag
of WAL that has been processed, not (say) the time until we catch up.
In some information-poor scenarios it interpolates which isn't perfect
but the general idea is that is shows you measurements of the past
(facts), not predictions about the future (theories).

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] CLUSTER command progress monitor

2017-11-21 Thread Michael Paquier
On Wed, Nov 22, 2017 at 5:55 AM, Robert Haas  wrote:
> I have been of the opinion all along that progress monitoring needs to
> report facts, not theories.  The number of tuples read thus far is a
> fact, and is fine to report for whatever value it may have to someone.
> The number of tuples that will be read in the future is a theory, and
> as you say, progress monitoring is most likely to be used in cases
> where theory and practice ended up being very different.

+1. We should never as well enter in things like trying to estimate
the amount of time remaining to finish a task [1].

[1]: https://www.xkcd.com/612/
-- 
Michael



Re: Anybody care about having the verbose form of the tzdata files?

2017-11-21 Thread Michael Paquier
On Wed, Nov 22, 2017 at 9:34 AM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Tue, Nov 21, 2017 at 6:28 PM, Daniel Gustafsson  wrote:
>> On 20 Nov 2017, at 21:38, Tom Lane  wrote:
 Anybody here actually care about reading the zone data files?
>> Perhaps I do. If this set of files gets removed and replaced by the zi
>> file, is it possible to still know easily which files are being
>> removed during a minor upgrade? When doing minor upgrades of a MSI
>> installer (Windows, yeah!), I need to keep track of files that get
>> deleted or a minor upgrade would simply fail. The tweak that I have is
>> to list them and recreate them as empty. The thing is ugly as hell,
>> but I need to be able to track which files are being removed easily.
>> And as far as I am checking, for example taking the rather recent
>> example of Riyadh87 in commit e04641f4, src/timezone/data allows to
>> keep easily track of files removed. If this gets removed, I am pretty
>> convinced that this tracking gets more complicated.
>
> I'm a bit confused.  The files under src/timezone/data/ don't correspond
> to individual installed zone data files; most of them describe a lot of
> zones.  (Riyadh87 and friends were outliers.)  Seems to me that if you
> care about the installed file list, much the easiest way is to run
> "make install" and then look to see what's under share/timezones/.
> That wouldn't change if we use the abbreviated form of the zic input
> data.

Yeah. That's basically what I do when I have a doubt, seeing an
automated minor upgrade failing or when getting a complain. But the
process is an hassle, and I can get things basically fine if I have an
easy reference of things removed.

> Now, personally, I've long diff'd the old and new timezone/data/ files
> in the process of writing the commit message for a tzdata update.
> I'd have to change that process --- but it was always a pretty tedious and
> obsessive-compulsive way to do it anyway, because most of the diffs are
> comments.  I'd probably just start relying more fully on the IANA
> announcement emails, like this one:
>
> http://mm.icann.org/pipermail/tz-announce/2017-October/47.html
>
> As far as I've seen, they are reliably good about summarizing everything
> you need to know about an update.  They definitely always mention
> additions and removals of zones.

If you add a reference to those upstream announces in your commit
message, that would be fine as well for me. I don't tend to follow
those folks closely (I really should I guess).
-- 
Michael



Re: [HACKERS] Issues with logical replication

2017-11-21 Thread Michael Paquier
On Fri, Nov 17, 2017 at 5:12 AM, Robert Haas  wrote:
> On Thu, Nov 16, 2017 at 2:41 PM, Andres Freund  wrote:
>>> To me, it seems like SnapBuildWaitSnapshot() is fundamentally
>>> misdesigned
>>
>> Maybe I'm confused, but why is it fundamentally misdesigned? It's not
>> such an absurd idea to wait for an xid in a WAL record.  I get that
>> there's a race condition here, which obviously bad, but I don't really
>> see as evidence of the above claim.
>>
>> I actually think this code used to be safe because ProcArrayLock used to
>> be held while generating and logging the running snapshots record.  That
>> was removed when fixing some other bug, but perhaps that shouldn't have
>> been done...
>
> OK.  Well, I might be overstating the case.  My comment about
> fundamental misdesign was really just based on the assumption that
> XactLockTableWait() could be used to wait for an XID the instant it
> was generated.  That was never gonna work and there's no obvious clean
> workaround for the problem.  Getting snapshot building to work
> properly seems to be Hard (TM).

The patches discussed here deserve tracking, so please note that I
have added an entry in the CF app:
https://commitfest.postgresql.org/16/1381/
-- 
Michael



Re: Anybody care about having the verbose form of the tzdata files?

2017-11-21 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Nov 21, 2017 at 6:28 PM, Daniel Gustafsson  wrote:
> On 20 Nov 2017, at 21:38, Tom Lane  wrote:
>>> Anybody here actually care about reading the zone data files?

>> I doubt there is anyone who cares about that who isn’t already consuming the
>> upstream data.

> Perhaps I do. If this set of files gets removed and replaced by the zi
> file, is it possible to still know easily which files are being
> removed during a minor upgrade? When doing minor upgrades of a MSI
> installer (Windows, yeah!), I need to keep track of files that get
> deleted or a minor upgrade would simply fail. The tweak that I have is
> to list them and recreate them as empty. The thing is ugly as hell,
> but I need to be able to track which files are being removed easily.
> And as far as I am checking, for example taking the rather recent
> example of Riyadh87 in commit e04641f4, src/timezone/data allows to
> keep easily track of files removed. If this gets removed, I am pretty
> convinced that this tracking gets more complicated.

I'm a bit confused.  The files under src/timezone/data/ don't correspond
to individual installed zone data files; most of them describe a lot of
zones.  (Riyadh87 and friends were outliers.)  Seems to me that if you
care about the installed file list, much the easiest way is to run
"make install" and then look to see what's under share/timezones/.
That wouldn't change if we use the abbreviated form of the zic input
data.

Now, personally, I've long diff'd the old and new timezone/data/ files
in the process of writing the commit message for a tzdata update.
I'd have to change that process --- but it was always a pretty tedious and
obsessive-compulsive way to do it anyway, because most of the diffs are
comments.  I'd probably just start relying more fully on the IANA
announcement emails, like this one:

http://mm.icann.org/pipermail/tz-announce/2017-October/47.html

As far as I've seen, they are reliably good about summarizing everything
you need to know about an update.  They definitely always mention
additions and removals of zones.

regards, tom lane



Re: Logical Replication and triggers

2017-11-21 Thread Michael Paquier
On Wed, Nov 22, 2017 at 6:35 AM, Robert Haas  wrote:
> On Tue, Nov 21, 2017 at 4:28 PM, Simon Riggs  wrote:
>> I would have acted on this myself a few days back if I thought the
>> patch was correct, but I see multiple command counter increments
>> there, so suspect an alternate fix is correct.
>
> Oh, well, I'm glad you said something.  I was actually thinking about
> committing the patch Peter posted in
> http://postgr.es/m/619c557d-93e6-1833-1692-b010b176f...@2ndquadrant.com
> because it looks correct to me, but I'm certainly not an expert on
> that code so I'll wait for your analysis.

There is still some time until the next round of releases. I am seeing
no tracking of this patch in the CF app. It would be good to not lose
track of it.. I'll add an entry and reply on the original thread.
-- 
Michael



Re: Anybody care about having the verbose form of the tzdata files?

2017-11-21 Thread Michael Paquier
On Tue, Nov 21, 2017 at 6:28 PM, Daniel Gustafsson  wrote:
>> On 20 Nov 2017, at 21:38, Tom Lane  wrote:
>> Anybody here actually care about reading the zone data files?
>
> I doubt there is anyone who cares about that who isn’t already consuming the
> upstream data.

Perhaps I do. If this set of files gets removed and replaced by the zi
file, is it possible to still know easily which files are being
removed during a minor upgrade? When doing minor upgrades of a MSI
installer (Windows, yeah!), I need to keep track of files that get
deleted or a minor upgrade would simply fail. The tweak that I have is
to list them and recreate them as empty. The thing is ugly as hell,
but I need to be able to track which files are being removed easily.
And as far as I am checking, for example taking the rather recent
example of Riyadh87 in commit e04641f4, src/timezone/data allows to
keep easily track of files removed. If this gets removed, I am pretty
convinced that this tracking gets more complicated.
-- 
Michael



Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Kyotaro HORIGUCHI
At Tue, 21 Nov 2017 20:53:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20171121.205304.90315453.horiguchi.kyot...@lab.ntt.co.jp>
> At Tue, 21 Nov 2017 20:27:25 +0900, atorikoshi 
>  wrote in 
> 
> > Thanks for reviewing!
> > 
> > 
> > On 2017/11/21 18:12, Masahiko Sawada wrote:
> > > On Tue, Nov 21, 2017 at 3:48 PM, Masahiko Sawada
> > >  wrote:
> > >> On Mon, Nov 20, 2017 at 7:35 PM, atorikoshi
> > >>  wrote:
> > >>> Hi,
> > >>>
> > >>> I put many queries into one transaction and made ReorderBuffer spill
> > >>> data to disk, and sent SIGKILL to postgres before the end of the
> > >>> transaction.
> > >>>
> > >>> After starting up postgres again, I observed the files spilled to
> > >>> data wasn't deleted.
> > >>>
> > >>> I think these files should be deleted because its transaction was no
> > >>> more valid, so no one can use these files.
> > >>>
> > >>>
> > >>> Below is a reproduction instructions.
> > >>>
> > >>> 
> > >>> 1. Create table and publication at publiser.
> > >>>
> > >>>   @pub =# CREATE TABLE t1(
> > >>>   id INT PRIMARY KEY,
> > >>>   name TEXT);
> > >>>
> > >>>   @pub =# CREATE PUBLICATION pub FOR TABLE t1;
> > >>>
> > >>> 2. Create table and subscription at subscriber.
> > >>>
> > >>>   @sub =# CREATE TABLE t1(
> > >>>   id INT PRIMARY KEY,
> > >>>   name TEXT
> > >>>   );
> > >>>
> > >>>   @sub =# CREATE SUBSCRIPTION sub
> > >>>   CONNECTION 'host=[hostname] port=[port] dbname=[dbname]'
> > >>>   PUBLICATION pub;
> > >>>
> > >>> 3. Put many queries into one transaction.
> > >>>
> > >>>   @pub =# BEGIN;
> > >>> INSERT INTO t1
> > >>> SELECT
> > >>>   i,
> > >>>   'aa'
> > >>> FROM
> > >>> generate_series(1, 100) as i;
> > >>>
> > >>> 4. Then we can see spilled files.
> > >>>
> > >>>   @pub $ ls -1 ${PGDATA}/pg_replslot/sub/
> > >>> state
> > >>> xid-561-lsn-0-100.snap
> > >>> xid-561-lsn-0-200.snap
> > >>> xid-561-lsn-0-300.snap
> > >>> xid-561-lsn-0-400.snap
> > >>> xid-561-lsn-0-500.snap
> > >>> xid-561-lsn-0-600.snap
> > >>> xid-561-lsn-0-700.snap
> > >>> xid-561-lsn-0-800.snap
> > >>> xid-561-lsn-0-900.snap
> > >>>
> > >>> 5. Kill publisher's postgres process before COMMIT.
> > >>>
> > >>>   @pub $ kill -s SIGKILL [pid of postgres]
> > >>>
> > >>> 6. Start publisher's postgres process.
> > >>>
> > >>>   @pub $ pg_ctl start -D ${PGDATA}
> > >>>
> > >>> 7. After a while, we can see the files remaining.
> > >>>   (Immediately after starting publiser, we can not see these files.)
> > >>>
> > >>>   @pub $ pg_ctl start -D ${PGDATA}
> > >>>
> > >>>   When I configured with '--enable-cassert', below assertion error
> > >>>   was appeared.
> > >>>
> > >>> TRAP: FailedAssertion("!(txn->final_lsn != 0)", File:
> > >>> "reorderbuffer.c",
> > >>> Line: 2576)
> > >>> 
> > >>>
> > >>> Attached patch sets final_lsn to the last ReorderBufferChange if
> > >>> final_lsn == 0.
> > >>
> > >> Thank you for the report. I could reproduce this issue with the above
> > >> step. My analysis is, the cause of that a serialized reorder buffer
> > >> isn't cleaned up is that the aborted transaction without an abort WAL
> > >> record has no chance to set ReorderBufferTXN->final_lsn. So if there
> > >> is such serialized transactions ReorderBufferRestoreCleanup cleanups
> > >> no files, which is cause of the assertion failure (or a file being
> > >> orphaned). What do you think?
> > >>
> > >> On detail of your patch, I'm not sure it's safe if we set the lsn of
> > >> other than commit record or abort record to final_lsn. The comment in
> > >> reorderbuffer.h says,
> > >>
> > >> typedef trcut ReorderBufferTXN
> > >> {
> > >> (snip)
> > >>
> > >> /* 
> > >>  * LSN of the record that lead to this xact to be committed or
> > >>  * aborted. This can be a
> > >>  * * plain commit record
> > >>  * * plain commit record, of a parent transaction
> > >>  * * prepared transaction commit
> > >>  * * plain abort record
> > >>  * * prepared transaction abort
> > >>  * * error during decoding
> > >>  * 
> > >>  */
> > >> XLogRecPtr  final_lsn;
> > >>
> > >> But with your patch, we could set a lsn of a record that is other than
> > >> what listed above to final_lsn. One way I came up with is to make
> > 
> > I added some comments on final_lsn.
> > 
> > >> ReorderBufferRestoreCleanup accept an invalid value of final_lsn and
> > >> regards it as a aborted transaction that doesn't has a abort WAL
> > >> record. So we can cleanup all serialized files if final_lsn of a
> > >> transaction is invalid.
> > >
> > > After more thought, since there are some codes cosmetically setting
> > > final_lsn when the fate of transaction is determined possibly we
> > > should not accept a invalid value of final_lsn ev

Re: [HACKERS] Avoiding OOM in a hash join with many duplicate inner keys

2017-11-21 Thread Thomas Munro
On Wed, Mar 8, 2017 at 1:29 PM, Tom Lane  wrote:
> Thomas Munro  writes:
>> I have been wondering about a couple of different worst case execution
>> strategies that would be better than throwing our hands up and
>> potentially exploding memory once we detect that further partitioning
>> is not going to help, if we still manage to reach that case despite
>> adding stats-based defences like this due to statistics being missing,
>> bad or confused by joins below us.
>
> Yeah, it would definitely be nice if we could constrain the runtime
> space consumption better.
>
>> 1.  We could probe the fraction of the hash table that we have managed
>> to load into work_mem so far and then rewind the outer batch and do it
>> again for the next work_mem-sized fraction of the inner batch and so
>> on.  For outer joins we'd need to scan for unmatched tuples after each
>> hash table refill.
>
> I do not understand how that works for a left join?  You'd need to track
> whether a given outer tuple has been matched in any one of the fractions
> of the inner batch, so that when you're done with the batch you could know
> which outer tuples need to be emitted null-extended.  Right now we only
> need to track that state for the current outer tuple, but in a rescan
> situation we'd have to remember it for each outer tuple in the batch.
>
> Perhaps it could be done by treating the outer batch file as read/write
> and incorporating a state flag in each tuple; or to reduce write volume,
> maintaining a separate outer batch file parallel to the main one with just
> a bool or even just a bit per outer tuple.  Seems messy though.

Right.  Messy.  I think what I described may fall under the category
of "block nested loop".  It looks doable but not very appealing for
left joins, and performance seems not great, multiplying the probing
scans by the number of fragments.  Whether we actually care about
performance at all when we've reached this emergency state and are
primarily concerned with completing the query I'm not entirely sure.

Another idea would be to identify the offending bucket (how?) and
spill it to disk in its own file, and track it by pushing a special
control object with a distinguishing header flag into the hash table
(or a new overflow table, or extend the duties of the skew table,
or...).  We'd have to deal with the matched flags of spilled inner
tuples for right/full joins.  Matching is really per-key, not
per-tuple, so if there is a control object in memory for each of these
"overflow" buckets then perhaps it could hold the matched flag that
covers all tuples with each distinct key.  What I like about this is
that is doesn't change the join algorithm at all, it just bolts on a
per-bucket escape valve.  The changes might be quite localised, though
I know someone who probably wouldn't like an extra branch in
ExecScanHashBucket().

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] More stats about skipped vacuums

2017-11-21 Thread Michael Paquier
On Tue, Nov 21, 2017 at 4:09 PM, Kyotaro HORIGUCHI
 wrote:
> By the way I'm uneasy that the 'last_vacuum_index_scans' (and
> vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
> both VACUUM command and autovacuum, while last_vacuum and
> vacuum_count is mentioning only the command. Splitting it into
> vacuum/autovaccum seems nonsense but the name is confusing. Do
> you have any idea?

Hm. I think that you should actually have two fields, one for manual
vacuum and one for autovacuum, because each is tied to respectively
maintenance_work_mem and autovacuum_work_mem. This way admins are able
to tune each one of those parameters depending on a look at
pg_stat_all_tables. So those should be named perhaps
last_vacuum_index_scans and last_autovacuum_index_scans?
-- 
Michael



Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

2017-11-21 Thread Peter Geoghegan
On Tue, Nov 21, 2017 at 7:29 AM, Robert Haas  wrote:
> Hash joins are a place where we could have a smoother cost function
> than we do.

Yes, it definitely is.

> When we run out of memory, instead of switching from
> (say) a single batch to two batches, switch to 64 batches, but
> initially keep 63 of them in memory and only write the very last one
> to disk.  Every time we again run out of memory, dump another batch to
> disk.  If we end up dumping more than half or so of the batches to
> disk, switch to an even larger number of batches to make it even more
> fine-grained.

That could work.

> That having been said, I think the place where our plans most commonly
> go wrong is where we incorrectly estimate the number of tuples by
> multiple orders of magnitude - 100x is common, 1000x is common, a
> million x is not uncommon, even a billion x is not unheard-of.  And I
> don't think there's any way to make a hash join happy if it thinks
> it's going to need 1 batch and it ends up needing a million batches.

What about dynamic role reversal? That could make a big difference.

> At that, even if the cost function is very smooth, you've moved so far
> along the curve that you're probably not in a good place.  So, while I
> think that smoothing out the cost functions is a good idea, I think we
> also need to consider what more can be done to improve the estimates -
> and especially to avoid estimates that are off by huge multiples.

I agree that it would be enormously valuable if we could make
estimates much better, so I think that I understand why you emphasize
it. But, I don't think that there are any good ideas for improving
join selectivity that don't involve expert DBA knowledge, or
novel/risky techniques for feedback to the system about column
redundancy/correlation, etc. These do not seem like scalable
approaches, and so they don't particularly appeal to me as projects.
I'd be happy to be shown to be wrong about this.

OTOH, techniques like dynamic role reversal, for when there are many
batches and it's faster to flip the outer and inner side do seem
promising. It's probably possible to come up with a more or less
unambiguous improvement, without layering complexity. I suspect that
this technique is widely implemented, and will cut down on cases
leading to terrible performance to a significant degree. I should try
to talk Thomas into working on it.

-- 
Peter Geoghegan



Re: [HACKERS] GnuTLS support

2017-11-21 Thread Michael Paquier
On Tue, Nov 21, 2017 at 11:45 PM, Andreas Karlsson  wrote:
> There is a function called gnutls_session_channel_binding() which seems to
> do something very similar to SSL_get*_finished() which has been in GnuTLS
> since 2.12.
>
> https://www.gnutls.org/manual/gnutls.html#Channel-Bindings

Nice. This way you can get tls-unique to work properly.
-- 
Michael



Re: Logical Replication and triggers

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 4:28 PM, Simon Riggs  wrote:
> I would have acted on this myself a few days back if I thought the
> patch was correct, but I see multiple command counter increments
> there, so suspect an alternate fix is correct.

Oh, well, I'm glad you said something.  I was actually thinking about
committing the patch Peter posted in
http://postgr.es/m/619c557d-93e6-1833-1692-b010b176f...@2ndquadrant.com
because it looks correct to me, but I'm certainly not an expert on
that code so I'll wait for your analysis.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: default range partition and constraint exclusion

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 4:36 AM, Amit Langote
 wrote:
>>> The attached will make the constraint to look like:
>>
>> Uh, if the constraint exclusion logic we're using is drawing false
>> conclusions, we need to fix it so it doesn't, not change the
>> constraint so that the wrong logic gives the right answer.
>
> I did actually consider it, but ended up concluding that constraint
> exclusion is doing all it can using the provided list partition constraint
> clauses.
>
> If all predicate_refuted_by() receives is the expression tree (AND/OR)
> with individual nodes being strict clauses involving partition keys (and
> nothing about the nullness of the keys), the downstream code is just
> playing by the rules as explained in the header comment of
> predicate_refuted_by_recurse() in concluding that query's restriction
> clause a = 2 refutes it.

Oh, wait a minute.  Actually, I think predicate_refuted_by() is doing
the right thing here.  Isn't the problem that mc2p2 shouldn't be
accepting a (2, null) tuple at all?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Logical Replication and triggers

2017-11-21 Thread Simon Riggs
On 21 November 2017 at 16:13, Thomas Rosenstein
 wrote:

> To weigh in here, I actually find it's a big hurdle
>
> I'm a postgres user and not a postgres dev, so I definitly have the feeling
> I'm not qualified to answer if this really does what it's intended todo.
> Further more not beeing in the processes it will take me probably 2 - 3
> hours (if I have time) to figure out everything I should do and how I should
> do it,
> somebody doing this regularly might take 5 minutes.
>
> Yes it fixes my replication issues, yes it seems to work on the first look,
> but what does it really do - no idea!

Fair enough, but that's not what Craig asked is it?

Seems like he gave a helpful, long explanation of how you can help
expedite the process of fixing the bug, to which he received no reply.

We're trying to ensure that bugs are fixed, but also take care not to
introduce further bugs or false fixes that don't move us forwards.

I would have acted on this myself a few days back if I thought the
patch was correct, but I see multiple command counter increments
there, so suspect an alternate fix is correct.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] CLUSTER command progress monitor

2017-11-21 Thread Peter Geoghegan
On Tue, Nov 21, 2017 at 12:55 PM, Robert Haas  wrote:
> I agree.
>
> I have been of the opinion all along that progress monitoring needs to
> report facts, not theories.  The number of tuples read thus far is a
> fact, and is fine to report for whatever value it may have to someone.

That makes a lot of sense to me. I sometimes think that we're too
hesitant to expose internal information due to concerns about it being
hard to interpret. I see wait events as bucking this trend, which I
welcome. We see similar trends in the Linux kernel, with tools like
perf and BCC/eBPF now being regularly used to debug production issues.

> The number of tuples that will be read in the future is a theory, and
> as you say, progress monitoring is most likely to be used in cases
> where theory and practice ended up being very different.

You hit the nail on the head here.

It's not that these things are not difficult to interpret - the
concern itself is justified. It just needs to be weighed against the
benefit of having some instrumentation to start with. People are much
more likely to complain about obscure debug information, which makes
them feel dumb, than they are to complain about the absence of any
instrumentation, but I still think that the latter is the bigger
problem.

Besides, you don't necessarily have to understand something to act on
it. The internals of Oracle are trade secrets, but they were the first
to have wait events, I think. At least having something that you can
Google can make all the difference.

-- 
Peter Geoghegan



Re: [HACKERS] pgbench regression test failure

2017-11-21 Thread Tom Lane
Robert Haas  writes:
> On Tue, Nov 21, 2017 at 3:28 PM, Tom Lane  wrote:
>> or just
>> ... latency limit: 33/33 (100.000 %)

> Oh, yeah.  That last one sounds good; no reason to print the same
> value more than once.

Sold; I'll go make it so.

regards, tom lane



Re: Logical Replication and triggers

2017-11-21 Thread Thomas Rosenstein
On Tue, Nov 21, 2017 at 3:29 PM, Simon Riggs  
wrote:

You realize we're talking about a bug fix, right?  And for a feature
that was developed and committed by your colleagues?


Craig is asking Thomas to confirm the proposed bug fix works. How is
this not normal?


That's not exactly how I read Craig's email, but it's of course
difficult to know what tone somebody intended from an email.  Suffice
it to say that I think "please pay attention to this proposed bug-fix
patch" is a pretty legitimate request.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


To weigh in here, I actually find it's a big hurdle

I'm a postgres user and not a postgres dev, so I definitly have the 
feeling I'm not qualified to answer if this really does what it's 
intended todo.
Further more not beeing in the processes it will take me probably 2 - 3 
hours (if I have time) to figure out everything I should do and how I 
should do it,

somebody doing this regularly might take 5 minutes.

Yes it fixes my replication issues, yes it seems to work on the first 
look, but what does it really do - no idea!





Re: [HACKERS] CLUSTER command progress monitor

2017-11-21 Thread Peter Geoghegan
On Mon, Oct 2, 2017 at 6:04 AM, Robert Haas  wrote:
> Progress reporting on sorts seems like a tricky problem to me, as I
> said before.  In most cases, a sort is going to involve an initial
> stage where it reads all the input tuples and writes out quicksorted
> runs, and then a merge phase where it merges all the output tapes into
> a sorted result.  There are some complexities; for example, if the
> number of tapes is really large, then we might need multiple merge
> phases, only the last of which will produce tuples.

This would ordinarily be the point at which I'd say "but you're very
unlikely to require multiple passes for an external sort these days".
But I won't say that on this thread, because CLUSTER generally has
unusually wide tuples, and so is much more likely to be I/O bound, to
require multiple passes, etc. (I bet the v10 enhancements
disproportionately improved CLUSTER performance.)

-- 
Peter Geoghegan



Re: [HACKERS] Replication status in logical replication

2017-11-21 Thread Masahiko Sawada
On Tue, Nov 14, 2017 at 6:46 AM, Thomas Munro
 wrote:
> On Tue, Sep 26, 2017 at 3:45 PM, Masahiko Sawada  
> wrote:
>> On Tue, Sep 26, 2017 at 10:36 AM, Vaishnavi Prabakaran
>>  wrote:
>>> On Wed, Sep 13, 2017 at 9:59 AM, Daniel Gustafsson  wrote:
 I’m not entirely sure why this was flagged as "Waiting for Author” by the
 automatic run, the patch applies for me and builds so resetting back to
 “Needs
 review”.

>>>
>>> This patch applies and build cleanly and I did a testing with one publisher
>>> and one subscriber, and confirm that the replication state after restarting
>>> the server now is "streaming" and not "Catchup".
>>>
>>> And, I don't find any issues with code and patch to me is ready for
>>> committer, marked the same in cf entry.
>
> Hi Sawada-san,
>
> My patch-testing robot doesn't like this patch[1].  I just tried it on
> my laptop to double-check and get some more details, and saw the same
> failures:
>
> (1) "make check" under src/test/recovery fails like this:
>
> t/006_logical_decoding.pl  2/16 # Looks like your test
> exited with 29 just after 4.
> t/006_logical_decoding.pl  Dubious, test returned 29
> (wstat 7424, 0x1d00)
> Failed 12/16 subtests
>
> regress_log_006_logical_decoding says:
>
> ok 4 - got same expected output from pg_recvlogical decoding session
> pg_recvlogical timed out at
> /opt/local/lib/perl5/vendor_perl/5.24/IPC/Run.pm line 2918.
>  waiting for endpos 0/1609B60 with stdout '', stderr '' at
> /Users/munro/projects/postgres/src/test/recovery/../../../src/test/perl/PostgresNode.pm
> line 1700.
> ### Stopping node "master" using mode immediate
> # Running: pg_ctl -D
> /Users/munro/projects/postgres/src/test/recovery/tmp_check/t_006_logical_decoding_master_data/pgdata
> -m immediate stop
> waiting for server to shut down done
> server stopped
> # No postmaster PID for node "master"
> # Looks like your test exited with 29 just after 4.
>
> (2) "make check" under src/test/subscription says:
>
> t/001_rep_changes.pl .. ok
> t/002_types.pl  #
> # Looks like your test exited with 60 before it could output anything.
> t/002_types.pl  Dubious, test returned 60 (wstat 15360, 0x3c00)
> Failed 3/3 subtests
> t/003_constraints.pl ..
>
> Each of those tooks several minutes, and I stopped it there.  It may
> be going to say some more things but is taking a very long time
> (presumably timing out, but the 001 took ages and then succeeded...
> hmm).  In fact I had to run this on my laptop to see that because on
> Travis CI the whole test job just gets killed after 10 minutes of
> non-output and the above output was never logged because of the way
> concurrent test jobs' output is buffered.
>
> I didn't try to figure out what is going wrong.
>

Thank you for the notification!

After investigation, I found out that my previous patch was wrong
direction. I should have changed XLogSendLogical() so that we can
check the read LSN and set WalSndCaughtUp = true even after read a
record without wait. Attached updated patch passed 'make check-world'.
Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


logical_repl_caught_up_v2.patch
Description: Binary data


Re: feature request: consume asynchronous notification via a function

2017-11-21 Thread Merlin Moncure
On Tue, Nov 21, 2017 at 2:16 PM, Merlin Moncure  wrote:
> On Tue, Nov 21, 2017 at 12:20 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> On Tue, Nov 21, 2017 at 11:32 AM, Merlin Moncure  wrote:
 I am very much looking at the new stored procedure functionality and
 imaging a loop like this:

 LOOP
   FOR r IN SELECT * FROM pg_get_notifications(30)
   LOOP
 PERFORM do_stuff(r);
   END LOOP;
   COMMIT;  -- advance xmin etc
 END LOOP;
>>
>>> Yeah, if you keep the timeout fairly short, it would probably work OK
>>> (with Peter's stuff).
>>
>> Traditionally, NOTIFY messages are delivered to the client only between
>> transactions, so that there is no question about whether the
>> message-delivery should roll back if the surrounding transaction aborts.
>> It's not very clear to me what the behavior of pg_get_notifications()
>> inside a transaction ought to be.  Is it OK if it's a volatile function
>> and the messages are just gone once the function has returned them,
>> even if you fail to do anything about them because your transaction
>> fails later?
>
> I think destroying upon consumption is OK.  There are a lot of
> mitigation strategies to deal with that issue and NOTIFY is for
> signalling, not queuing.
>
>> (I'd be against having a function that returns more than one at a time,
>> in any case, as that just complicates matters even more.)

Hm, a less controversial approach might be to only allow consumption
of notifications that were delivered at the start of transaction.
Procedures could then issue an intervening 'COMMIT' statement to pick
up new notifications.  There's be no reason for a timeout argument in
that case obviously, so the end user would have to poll in order to
pick up the notification, which I don't like.   This would be an
alternative approach to the way it do it today, which is to poll for a
set table flag in a non-serializable transaction, maybe with enough
differentiation in use to merit introduction.

merlin



Re: [HACKERS] [PATCH] Incremental sort

2017-11-21 Thread Thomas Munro
On Tue, Nov 21, 2017 at 1:00 PM, Alexander Korotkov
 wrote:
> On Mon, Nov 20, 2017 at 12:24 AM, Thomas Munro
>  wrote:
>> Is there some reason not to use ApplySortComparator for this?  I think
>> you're missing out on lower-overhead comparators, and in any case it's
>> probably good code reuse, no?
>
> However, for incremental sort case we don't need to know here whether A > B
> or B > A.  It's enough for us to know if A = B or A != B.  In some cases
> it's way cheaper.  For instance, for texts equality check is basically
> memcmp while comparison may use collation.

Ah, right, of course.

>> I gather that you have
>> determined empirically that it's better to be able to sort groups of
>> at least MIN_GROUP_SIZE than to be able to skip the comparisons on the
>> leading attributes, but why is that the case?
>
> Right.  The issue that not only case of one tuple per group could cause
> overhead, but few tuples (like 2 or 3) is also case of overhead.  Also,
> overhead is related not only to sorting.  While investigate of regression
> case provided by Heikki [1], I've seen extra time spent mostly in extra
> copying of sample tuple and comparison with that.  In order to cope this
> overhead I've introduced MIN_GROUP_SIZE which allows to skip copying sample
> tuples too frequently.

I see.  I wonder if there could ever be a function like
ExecMoveTuple(dst, src).  Given the polymorphism involved it'd be
slightly complicated and you'd probably have a general case that just
copies the tuple to dst and clears src, but there might be a bunch of
cases where you can do something more efficient like moving a pointer
and pin ownership.  I haven't really thought that through and
there may be fundamental problems with it...

If you're going to push the tuples into the sorter every time, then I
guess there are some special cases that could allow future
optimisations: (1) if you noticed that every prefix was different, you
can skip the sort operation (that is, you can use the sorter as a dumb
tuplestore and just get the tuples out in the same order you put them
in; not sure if Tuplesort supports that but it presumably could), (2)
if you noticed that every prefix was the same (that is, you have only
one prefix/group in the sorter) then you could sort only on the suffix
(that is, you could somehow tell Tuplesort to ignore the leading
columns), (3) as a more complicated optimisation for intermediate
group sizes 1 < n < MIN_GROUP_SIZE, you could somehow number the
groups with an integer that increments whenever you see the prefix
change, and somehow tell tuplesort.c to use that instead of the
leading columns.  Ok, that last one is probably hard but the first two
might be easier...

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] CLUSTER command progress monitor

2017-11-21 Thread Robert Haas
On Mon, Nov 20, 2017 at 12:05 PM, Antonin Houska  wrote:
> Robert Haas  wrote:
>> On Wed, Aug 30, 2017 at 10:12 PM, Tatsuro Yamada
>>  wrote:
>> >   1. scanning heap
>> >   2. sort tuples
>>
>> These two phases overlap, though. I believe progress reporting for
>> sorts is really hard.  In the simple case where the data fits in
>> work_mem, none of the work of the sort gets done until all the data is
>> read.  Once you switch to an external sort, you're writing batch
>> files, so a lot of the work is now being done during data loading.
>> But as the number of batch files grows, the final merge at the end
>> becomes an increasingly noticeable part of the cost, and eventually
>> you end up needing multiple merge passes.  I think we need some smart
>> way to report on sorts so that we can tell how much of the work has
>> really been done, but I don't know how to do it.
>
> Whatever complexity is hidden in the sort, cost_sort() should have taken it
> into consideration when called via plan_cluster_use_sort(). Thus I think that
> once we have both startup and total cost, the current progress of the sort
> stage can be estimated from the current number of input and output
> rows. Please remind me if my proposal appears to be too simplistic.

I think it is far too simplistic.  If the sort is being fed by a
sequential scan, reporting the number of blocks scanned so far as
compared to the total number that will be scanned would be a fine way
of reporting on the progress of the sequential scan -- and it's better
to use blocks, which we know for sure about, than rows, at which we
can only guess.  But that's the *scan* progress, not the *sort*
progress.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] CLUSTER command progress monitor

2017-11-21 Thread Robert Haas
On Mon, Nov 20, 2017 at 12:25 PM, Tom Lane  wrote:
> Antonin Houska  writes:
>> Robert Haas  wrote:
>>> These two phases overlap, though. I believe progress reporting for
>>> sorts is really hard.
>
>> Whatever complexity is hidden in the sort, cost_sort() should have taken it
>> into consideration when called via plan_cluster_use_sort(). Thus I think that
>> once we have both startup and total cost, the current progress of the sort
>> stage can be estimated from the current number of input and output
>> rows. Please remind me if my proposal appears to be too simplistic.
>
> Well, even if you assume that the planner's cost model omits nothing
> (which I wouldn't bet on), its result is only going to be as good as the
> planner's estimate of the number of rows to be sorted.  And, in cases
> where people actually care about progress monitoring, it's likely that
> the planner got that wrong, maybe horribly so.  I think it's a bad idea
> for progress monitoring to depend on the planner's estimates in any way
> whatsoever.

I agree.

I have been of the opinion all along that progress monitoring needs to
report facts, not theories.  The number of tuples read thus far is a
fact, and is fine to report for whatever value it may have to someone.
The number of tuples that will be read in the future is a theory, and
as you say, progress monitoring is most likely to be used in cases
where theory and practice ended up being very different.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Combine function returning NULL unhandled?

2017-11-21 Thread Robert Haas
On Mon, Nov 20, 2017 at 10:36 PM, Andres Freund  wrote:
> The plain transition case contains:
> if (pergroupstate->transValueIsNull)
> {
> /*
>  * Don't call a strict function with NULL inputs.  
> Note it is
>  * possible to get here despite the above tests, if 
> the transfn is
>  * strict *and* returned a NULL on a prior cycle. If 
> that happens
>  * we will propagate the NULL all the way to the end.
>  */
> return;
> }
>
> how come similar logic is not present for combine functions? I don't see
> any checks preventing a combinefunc from returning NULL, nor do I see
> https://www.postgresql.org/docs/devel/static/sql-createaggregate.html
> spell out a requirement that that not be the case.

I don't know of a reason why that logic shouldn't be present for the
combine-function case as well.  It seems like it should be pretty
straightforward to write a test that hits that case and watch it blow
up ... assuming it does, then I guess we should back-patch the
addition of that logic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Custom compression methods

2017-11-21 Thread Tomas Vondra

On 11/21/2017 09:28 PM, Ildus K wrote:
>> Hmmm, it still doesn't work for me. See this:
>>
>> test=# create extension pg_lz4 ;
>> CREATE EXTENSION
>> test=# create table t_lz4 (v text compressed lz4);
>> CREATE TABLE
>> test=# create table t_pglz (v text);
>> CREATE TABLE
>> test=# insert into t_lz4 select repeat(md5(1::text),300);
>> INSERT 0 1
>> test=# insert into t_pglz select * from t_lz4;
>> INSERT 0 1
>> test=# drop extension pg_lz4 cascade;
>> NOTICE:  drop cascades to 2 other objects
>> DETAIL:  drop cascades to compression options for lz4
>> drop cascades to table t_lz4 column v
>> DROP EXTENSION
>> test=# \c test
>> You are now connected to database "test" as user "user".
>> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
>> test=# select * from t_pglz ;
>> ERROR:  cache lookup failed for compression options 16419
>>
>> That suggests no recompression happened.
> 
> I will check that. Is your extension published somewhere?
> 

No, it was just an experiment, so I've only attached it to the initial
review. Attached is an updated version, with a fix or two.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pg_lz4.tgz
Description: application/compressed-tar


Re: [HACKERS] pgbench regression test failure

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 3:28 PM, Tom Lane  wrote:
> Seems like a good idea, but the way you've written it is inconsistent
> with the "n/m" notation used just above.  I'd suggest
>
> ... latency limit: 33 (33/33, 100.000 %)
>
> or just
>
> ... latency limit: 33/33 (100.000 %)

Oh, yeah.  That last one sounds good; no reason to print the same
value more than once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Logical Replication and triggers

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 3:29 PM, Simon Riggs  wrote:
>> You realize we're talking about a bug fix, right?  And for a feature
>> that was developed and committed by your colleagues?
>
> Craig is asking Thomas to confirm the proposed bug fix works. How is
> this not normal?

That's not exactly how I read Craig's email, but it's of course
difficult to know what tone somebody intended from an email.  Suffice
it to say that I think "please pay attention to this proposed bug-fix
patch" is a pretty legitimate request.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Logical Replication and triggers

2017-11-21 Thread Simon Riggs
On 21 November 2017 at 13:27, Robert Haas  wrote:
> On Sat, Nov 18, 2017 at 7:30 PM, Craig Ringer  wrote:
>> On 15 November 2017 at 21:12, Thomas Rosenstein
>>  wrote:
>>> I would like somebody to consider Petr Jelineks patch for worker.c from
>>> here
>>> (https://www.postgresql.org/message-id/619c557d-93e6-1833-1692-b010b176ff77%402ndquadrant.com)
>>>
>>> I'm was facing the same issue with 10.1 and BEFORE INSERT OR UPDATE
>>> triggers.
>>
>> Please:
>>
>> - Apply it to current HEAD
>> - Test its functionality
>> - Report back on the patch thread
>> - Update the commitfest app with your results and sign on as a reviewer
>> - If you're able, read over the patch and make any comments you can
>>
>> "Somebody" needs to be you, if you want this functionality.
>
> You realize we're talking about a bug fix, right?  And for a feature
> that was developed and committed by your colleagues?

Craig is asking Thomas to confirm the proposed bug fix works. How is
this not normal?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Custom compression methods

2017-11-21 Thread Ildus K
On Tue, 21 Nov 2017 18:47:49 +0100
Tomas Vondra  wrote:

 
> 
> I propose to use either
> 
>CompressionMethodOptions (and CompressionMethodRoutine)
> 
> or
> 
>CompressionOptions (and CompressionRoutine)

Sounds good, thanks.

> 
> OK. But then I don't understand why tsvector.c does things like
> 
> VARSIZE(data) - VARHDRSZ_CUSTOM_COMPRESSED - arrsize
> VARRAWSIZE_4B_C(data) - arrsize
> 
> instead of
> 
> VARSIZE_ANY_EXHDR(data) - arrsize
> VARSIZE_ANY(data) - arrsize
> 
> Seems somewhat confusing.
> 

VARRAWSIZE_4B_C returns original size of data, before compression (from
va_rawsize in current postgres, and from va_info in my patch), not size
of the already compressed data, so you can't use VARSIZE_ANY here.

VARSIZE_ANY_EXHDR in current postgres returns VARSIZE-VARHDRSZ, despite
the varlena is compressed or not, so I just kept this behavior for
custom compressed varlenas too. If you look into tuptoaster.c you will
also see lines like 'VARSIZE(attr) - TOAST_COMPRESS_HDRSZ'. So I think
if VARSIZE_ANY_EXHDR will subtract different header sizes then it
should subtract them for usual compressed varlenas too.

> >   
> 
> Hmmm, it still doesn't work for me. See this:
> 
> test=# create extension pg_lz4 ;
> CREATE EXTENSION
> test=# create table t_lz4 (v text compressed lz4);
> CREATE TABLE
> test=# create table t_pglz (v text);
> CREATE TABLE
> test=# insert into t_lz4 select repeat(md5(1::text),300);
> INSERT 0 1
> test=# insert into t_pglz select * from t_lz4;
> INSERT 0 1
> test=# drop extension pg_lz4 cascade;
> NOTICE:  drop cascades to 2 other objects
> DETAIL:  drop cascades to compression options for lz4
> drop cascades to table t_lz4 column v
> DROP EXTENSION
> test=# \c test
> You are now connected to database "test" as user "user".
> test=# insert into t_lz4 select repeat(md5(1::text),300);^C
> test=# select * from t_pglz ;
> ERROR:  cache lookup failed for compression options 16419
> 
> That suggests no recompression happened.

I will check that. Is your extension published somewhere?




Re: [HACKERS] pgbench regression test failure

2017-11-21 Thread Tom Lane
Robert Haas  writes:
> On Mon, Nov 20, 2017 at 1:40 PM, Tom Lane  wrote:
>> I dunno, it just looks odd to me that when we've set up a test case in
>> which every one of the transactions is guaranteed to exceed the latency
>> limit, that it doesn't say that they all did.  I don't particularly buy
>> your assumption that the percentages should sum.  Anybody else have an
>> opinion there?

> I agree with you, but I don't think either approach is free from
> possible confusion.  I think it would help to show the numerator and
> the denominator explicitly, e.g.:

> number of clients: 1
> number of threads: 1
> number of transactions per client: 100
> number of transactions actually processed: 33/100
> number of transactions skipped: 67 (67.000 %)
> number of transactions above the 1.0 ms latency limit: 33 (33 of 33, 100.000 
> %)

> (My proposed change is in the last line.)

Seems like a good idea, but the way you've written it is inconsistent
with the "n/m" notation used just above.  I'd suggest

... latency limit: 33 (33/33, 100.000 %)

or just

... latency limit: 33/33 (100.000 %)

regards, tom lane



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2017-11-21 Thread Robert Haas
On Mon, Nov 20, 2017 at 5:19 PM, Masahiko Sawada  wrote:
> Attached updated version patch. I've moved only relation extension
> locks out of heavy-weight lock as per discussion so far.
>
> I've done a write-heavy benchmark on my laptop; loading 24kB data to
> one table using COPY by 1 client, for 10 seconds. The through-put of
> patched is 10% better than current HEAD. The result of 5 times is the
> following.
>
> - PATCHED -
> tps = 178.791515 (excluding connections establishing)
> tps = 176.522693 (excluding connections establishing)
> tps = 168.705442 (excluding connections establishing)
> tps = 158.158009 (excluding connections establishing)
> tps = 161.145709 (excluding connections establishing)
>
> - HEAD -
> tps = 147.079803 (excluding connections establishing)
> tps = 149.079540 (excluding connections establishing)
> tps = 149.082275 (excluding connections establishing)
> tps = 148.255376 (excluding connections establishing)
> tps = 145.542552 (excluding connections establishing)
>
> Also I've done a micro-benchmark; calling LockRelationForExtension and
> UnlockRelationForExtension tightly in order to measure the number of
> lock/unlock cycles per second. The result is,
> PATCHED = 3.95892e+06 (cycles/sec)
> HEAD = 1.15284e+06 (cycles/sec)
> The patched is 3 times faster than current HEAD.
>
> Attached updated patch and the function I used for micro-benchmark.
> Please review it.

That's a nice speed-up.

How about a preliminary patch that asserts that we never take another
heavyweight lock while holding a relation extension lock?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] pgbench regression test failure

2017-11-21 Thread Robert Haas
On Mon, Nov 20, 2017 at 1:40 PM, Tom Lane  wrote:
> I dunno, it just looks odd to me that when we've set up a test case in
> which every one of the transactions is guaranteed to exceed the latency
> limit, that it doesn't say that they all did.  I don't particularly buy
> your assumption that the percentages should sum.  Anybody else have an
> opinion there?

I agree with you, but I don't think either approach is free from
possible confusion.  I think it would help to show the numerator and
the denominator explicitly, e.g.:

number of clients: 1
number of threads: 1
number of transactions per client: 100
number of transactions actually processed: 33/100
number of transactions skipped: 67 (67.000 %)
number of transactions above the 1.0 ms latency limit: 33 (33 of 33, 100.000 %)

(My proposed change is in the last line.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: feature request: consume asynchronous notification via a function

2017-11-21 Thread Merlin Moncure
On Tue, Nov 21, 2017 at 12:20 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Nov 21, 2017 at 11:32 AM, Merlin Moncure  wrote:
>>> I am very much looking at the new stored procedure functionality and
>>> imaging a loop like this:
>>>
>>> LOOP
>>>   FOR r IN SELECT * FROM pg_get_notifications(30)
>>>   LOOP
>>> PERFORM do_stuff(r);
>>>   END LOOP;
>>>   COMMIT;  -- advance xmin etc
>>> END LOOP;
>
>> Yeah, if you keep the timeout fairly short, it would probably work OK
>> (with Peter's stuff).
>
> Traditionally, NOTIFY messages are delivered to the client only between
> transactions, so that there is no question about whether the
> message-delivery should roll back if the surrounding transaction aborts.
> It's not very clear to me what the behavior of pg_get_notifications()
> inside a transaction ought to be.  Is it OK if it's a volatile function
> and the messages are just gone once the function has returned them,
> even if you fail to do anything about them because your transaction
> fails later?

I think destroying upon consumption is OK.  There are a lot of
mitigation strategies to deal with that issue and NOTIFY is for
signalling, not queuing.

> (I'd be against having a function that returns more than one at a time,
> in any case, as that just complicates matters even more.)

ok.

merlin



Re: [HACKERS] Parallel Append implementation

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 6:57 AM, amul sul  wrote:
> By doing following change on the v19 patch does the fix for me:
>
> --- a/src/backend/executor/nodeAppend.c
> +++ b/src/backend/executor/nodeAppend.c
> @@ -489,11 +489,9 @@ choose_next_subplan_for_worker(AppendState *node)
> }
>
> /* Pick the plan we found, and advance pa_next_plan one more time. */
> -   node->as_whichplan = pstate->pa_next_plan;
> +   node->as_whichplan = pstate->pa_next_plan++;
> if (pstate->pa_next_plan == node->as_nplans)
> pstate->pa_next_plan = append->first_partial_plan;
> -   else
> -   pstate->pa_next_plan++;
>
> /* If non-partial, immediately mark as finished. */
> if (node->as_whichplan < append->first_partial_plan)
>
> Attaching patch does same changes to Amit's ParallelAppend_v19_rebased.patch.

Yes, that looks like a correct fix.  Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Re: protocol version negotiation (Re: Libpq PGRES_COPY_BOTH - version compatibility)

2017-11-21 Thread Robert Haas
On Sun, Nov 19, 2017 at 7:49 PM, Badrul Chowdhury  wrote:
>>> I spent a little more time looking at this patch today.  I think that the 
>>> patch
>>> should actually send NegotiateProtocolVersion when *either* the requested
>>> version is differs from the latest one we support *or* an unsupported 
>>> protocol
>>> option is present.  Otherwise, you only find out about unsupported protocol
>>> options if you also request a newer minor version, which isn't good, 
>>> because it
>>> makes it hard to add new protocol options *without* bumping the protocol
>>> version.
>
> It makes sense from a maintainability point of view.
>
>>> Here's an updated version with that change and a proposed commit message.
>
> I have tested the new patch and it works great. The comments look good as 
> well.

Committed and back-patched to all supported branches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] SQL procedures

2017-11-21 Thread Andrew Dunstan


On 11/20/2017 04:25 PM, I wrote:
> I've been through this fairly closely, and I think it's pretty much
> committable. The only big item that stands out for me is the issue of
> OUT parameters.
>
> While returning multiple result sets will be a useful feature, it's also
> very common in my experience for stored procedures to have scalar out
> params as well. I'm not sure how we should go about providing for it,
> but I think we need to be sure we're not closing any doors.
>
> Here, for example, is how the MySQL stored procedure feature works with
> JDBC:
> 
> I think it will be OK if we use cursors to return multiple result sets,
> along the lines of Peter's next patch, but we shouldn't regard that as
> the end of the story. Even if we don't provide for it in this go round
> we should aim at eventually providing for stored procedure OUT params.
>
>
>



Of course it's true that we could replace a scalar OUT parameter with a
one row resultset if we have return of multiple resultsets from SPs. But
it's different from the common use pattern and a darn sight more
cumbersome to use.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Logical Replication and triggers

2017-11-21 Thread Robert Haas
On Sat, Nov 18, 2017 at 7:30 PM, Craig Ringer  wrote:
> On 15 November 2017 at 21:12, Thomas Rosenstein
>  wrote:
>> I would like somebody to consider Petr Jelineks patch for worker.c from
>> here
>> (https://www.postgresql.org/message-id/619c557d-93e6-1833-1692-b010b176ff77%402ndquadrant.com)
>>
>> I'm was facing the same issue with 10.1 and BEFORE INSERT OR UPDATE
>> triggers.
>
> Please:
>
> - Apply it to current HEAD
> - Test its functionality
> - Report back on the patch thread
> - Update the commitfest app with your results and sign on as a reviewer
> - If you're able, read over the patch and make any comments you can
>
> "Somebody" needs to be you, if you want this functionality.

You realize we're talking about a bug fix, right?  And for a feature
that was developed and committed by your colleagues?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Does XMLSERIALIZE output xmlattributes in a stable order?

2017-11-21 Thread Chapman Flack
On 11/21/2017 11:49 AM, Tom Lane wrote:

> AFAICS, XMLSERIALIZE in our current implementation boils down to
> being a binary-compatible coercion from XML (which is stored as
> a string) to text.  So the interesting question here is where are
> you getting the XML values from?  The stability of the results is
> going to be whatever the next level down does.

Well, constructed using xmlelement and xmlattributes in a big query.
The structure of the query does not change from one run to the next.

So as long as the internal XML form is essentially already serialized,
I guess it comes down to what xmlattributes(...) inside xmlelement
produces. If that is stable, say in the order of the attribute
arguments, then that probably fits the bill.

I don't see that clearly addressed in the doc for xmlattributes
either. Should something be added to the docs, it's probably worth
mentioning at XMLSERIALIZE anyway, keeping the fact that the internal
form is already serialized as more of an implementation detail.

-Chap


select
xmlserialize(document xmlroot(
  xmlelement(name top,
xmlattributes(
  foo,
  bar,
  baz
),
xmlelement(name things,
  (
  select
xmlagg(
  xmlelement(name thing,
xmlattributes(
  name,
  importance,
  awesomeness,
  fluidity
),
case when comment is not null then
  xmlelement(name comment, comment)
end,
(
select
  xmlagg(
xmlelement(name property,
  xmlattributes(setting)
)
  )
from
  unnest(properties) as q(setting)
),
(
select
  xmlagg(
xmlelement(name referenced,
  xmlattributes(
linksrc as source
  )
)
  )
from
  (
  select distinct
...



Re: feature request: consume asynchronous notification via a function

2017-11-21 Thread Tom Lane
Robert Haas  writes:
> On Tue, Nov 21, 2017 at 11:32 AM, Merlin Moncure  wrote:
>> I am very much looking at the new stored procedure functionality and
>> imaging a loop like this:
>> 
>> LOOP
>>   FOR r IN SELECT * FROM pg_get_notifications(30)
>>   LOOP
>> PERFORM do_stuff(r);
>>   END LOOP;
>>   COMMIT;  -- advance xmin etc
>> END LOOP;

> Yeah, if you keep the timeout fairly short, it would probably work OK
> (with Peter's stuff).

Traditionally, NOTIFY messages are delivered to the client only between
transactions, so that there is no question about whether the
message-delivery should roll back if the surrounding transaction aborts.
It's not very clear to me what the behavior of pg_get_notifications()
inside a transaction ought to be.  Is it OK if it's a volatile function
and the messages are just gone once the function has returned them,
even if you fail to do anything about them because your transaction
fails later?

(I'd be against having a function that returns more than one at a time,
in any case, as that just complicates matters even more.)

regards, tom lane



Re: View with duplicate GROUP BY entries

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 1:11 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Nov 21, 2017 at 12:05 PM, Tom Lane  wrote:
>>> Yeah, we probably ought to make more of an effort to regenerate the
>>> original query wording.  I do not think that forcing positional notation
>>> is a suitable answer in this case, because it would result in converting
>>> SQL-standard queries to nonstandard ones.
>
>> Who cares?  The other end is presumptively PostgresSQL, because this
>> is postgres_fdw.
>
> No, you missed the context.  Yes, the original problem is in postgres_fdw,
> and there indeed it seems fine to emit GROUP BY 1,2.  What Ashutosh is
> pointing out is that ruleutils.c can emit a representation of a view
> that fails to preserve its original semantics, thus causing dump/reload
> problems that have nothing at all to do with FDWs.

Oh, woops.  Sorry.

> And what I'm pointing
> out is that we don't like pg_dump to emit nonstandard representations
> of objects that were created with perfectly standard-compliant queries;
> therefore emitting GROUP BY 1,2 isn't good if the query wasn't spelled
> like that to begin with.

That's not an unreasonable goal, but I'm not sure how much effort I'd
personally be willing to expend on it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: View with duplicate GROUP BY entries

2017-11-21 Thread Tom Lane
Robert Haas  writes:
> On Tue, Nov 21, 2017 at 12:05 PM, Tom Lane  wrote:
>> Yeah, we probably ought to make more of an effort to regenerate the
>> original query wording.  I do not think that forcing positional notation
>> is a suitable answer in this case, because it would result in converting
>> SQL-standard queries to nonstandard ones.

> Who cares?  The other end is presumptively PostgresSQL, because this
> is postgres_fdw.

No, you missed the context.  Yes, the original problem is in postgres_fdw,
and there indeed it seems fine to emit GROUP BY 1,2.  What Ashutosh is
pointing out is that ruleutils.c can emit a representation of a view
that fails to preserve its original semantics, thus causing dump/reload
problems that have nothing at all to do with FDWs.  And what I'm pointing
out is that we don't like pg_dump to emit nonstandard representations
of objects that were created with perfectly standard-compliant queries;
therefore emitting GROUP BY 1,2 isn't good if the query wasn't spelled
like that to begin with.

regards, tom lane



Re: [COMMITTERS] pgsql: Add hash partitioning.

2017-11-21 Thread Robert Haas
On Mon, Nov 20, 2017 at 7:46 AM, amul sul  wrote:
> Thanks for fixing this function.  Patch looks good to me, except column number
> in the following errors message should to be 2.
>
> 354 +SELECT satisfies_hash_partition('mchash'::regclass, 2, 1,
> NULL::int, NULL::int);
> 355 +ERROR:  column 1 of the partition key has type "text", but
> supplied value is of type "integer"
>
> Same at the line # 374 & 401 in the patch.

Oops.  Looks like the indexing should be 1-based rather than 0-based.
Committed with that change.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: feature request: consume asynchronous notification via a function

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 11:32 AM, Merlin Moncure  wrote:
>> I think that wouldn't work very well, because I think we must have a
>> snapshot open in order to run pg_get_notifications(), and that means
>> we're holding back the system-wide xmin.
>
> I am very much looking at the new stored procedure functionality and
> imaging a loop like this:
>
> LOOP
>   FOR r IN SELECT * FROM pg_get_notifications(30)
>
>   LOOP
> PERFORM do_stuff(r);
>   END LOOP;
>
>   COMMIT;  -- advance xmin etc
> END LOOP;
>
> ...I'm obviously speculatively thinking ahead to Peter's stored
> procedure work seeing the light of day (which, based on the utility vs
> the simplicity of the patch and how it works in testing I'm very
> optimistic about).  The above would provide real time response to
> certain actions I do now with polling, typically in bash.  Without
> stored procedures, I agree that this would be a foot gun.

Yeah, if you keep the timeout fairly short, it would probably work OK
(with Peter's stuff).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: View with duplicate GROUP BY entries

2017-11-21 Thread Robert Haas
On Tue, Nov 21, 2017 at 12:05 PM, Tom Lane  wrote:
> Ashutosh Bapat  writes:
>> While reviewing patch for similar problem in postgres_fdw [1], I
>> noticed that we don't use positional notation while creating the view.
>> This might introduced anomalies when GROUP BY entries are
>> non-immutable.
>
> Yeah, we probably ought to make more of an effort to regenerate the
> original query wording.  I do not think that forcing positional notation
> is a suitable answer in this case, because it would result in converting
> SQL-standard queries to nonstandard ones.

Who cares?  The other end is presumptively PostgresSQL, because this
is postgres_fdw.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Custom compression methods

2017-11-21 Thread Tomas Vondra
Hi,

On 11/21/2017 03:47 PM, Ildus Kurbangaliev wrote:
> On Mon, 20 Nov 2017 00:04:53 +0100
> Tomas Vondra  wrote:
> 
> ...
>
>> 6) I'm rather confused by AttributeCompression vs.
>> ColumnCompression. I mean, attribute==column, right? Of course, one
>> is for data from parser, the other one is for internal info. But
>> can we make the naming clearer?
> 
> For now I have renamed AttributeCompression to CompressionOptions,
> not sure that's a good name but at least it gives less confusion.
> 

I propose to use either

   CompressionMethodOptions (and CompressionMethodRoutine)

or

   CompressionOptions (and CompressionRoutine)

>>
>> 7) The docs in general are somewhat unsatisfactory, TBH. For example
>> the ColumnCompression has no comments, unlike everything else in
>> parsenodes. Similarly for the SGML docs - I suggest to expand them to
>> resemble FDW docs
>> (https://www.postgresql.org/docs/10/static/fdwhandler.html) which
>> also follows the handler/routines pattern.
> 
> I've added more comments. I think I'll add more documentation if the
> committers will approve current syntax.
> 

OK. Haven't reviewed this yet.

>>
>> 8) One of the unclear things if why we even need 'drop' routing. It
>> seems that if it's defined DropAttributeCompression does something.
>> But what should it do? I suppose dropping the options should be done
>> using dependencies (just like we drop columns in this case).
>>
>> BTW why does DropAttributeCompression mess with att->attisdropped in
>> this way? That seems a bit odd.
> 
> 'drop' routine could be useful. An extension could do something
> related with the attribute, like remove extra tables or something
> else. The compression options will not be removed after unlinking
> compression method from a column because there is still be stored
> compressed data in that column.
> 

OK. So something like a "global" dictionary used for the column, or
something like that? Sure, seems useful and I've been thinking about
that, but I think we badly need some extension using that, even if in a
very simple way. Firstly, we need a "how to" example, secondly we need
some way to test it.

>>
>> 13) When writing the experimental extension, I was extremely
>> confused about the regular varlena headers, custom compression
>> headers, etc. In the end I stole the code from tsvector.c and
>> whacked it a bit until it worked, but I wouldn't dare to claim I
>> understand how it works.
>>
>> This needs to be documented somewhere. For example postgres.h has
>> a bunch of paragraphs about varlena headers, so perhaps it should
>> be there? I see the patch tweaks some of the constants, but does
>> not update the comment at all.
> 
> This point is good, I'm not sure how this documentation should look 
> like. I've just assumed that people should have deep undestanding of 
> varlenas if they're going to compress them. But now it's easy to
> make mistake there. Maybe I should add some functions that help to
> construct varlena, with different headers. I like the way is how
> jsonb is constructed. It uses StringInfo and there are few helper
> functions (reserveFromBuffer, appendToBuffer and others). Maybe they
> should be not static.
> 

Not sure. My main problem was not understanding how this affects the
varlena header, etc. And I had no idea where to look.

>>
>> Perhaps it would be useful to provide some additional macros
>> making access to custom-compressed varlena values easier. Or
>> perhaps the VARSIZE_ANY / VARSIZE_ANY_EXHDR / VARDATA_ANY already
>> support that? This part is not very clear to me.
> 
> These macros will work, custom compressed varlenas behave like old
> compressed varlenas.
> 

OK. But then I don't understand why tsvector.c does things like

VARSIZE(data) - VARHDRSZ_CUSTOM_COMPRESSED - arrsize
VARRAWSIZE_4B_C(data) - arrsize

instead of

VARSIZE_ANY_EXHDR(data) - arrsize
VARSIZE_ANY(data) - arrsize

Seems somewhat confusing.

>>> Still it's a problem if the user used for example `SELECT 
>>>  INTO * FROM *` because postgres will copy 
>>> compressed tuples, and there will not be any dependencies
>>> between destination and the options.
>>>   
>>
>> This seems like a rather fatal design flaw, though. I'd say we need
>> to force recompression of the data, in such cases. Otherwise all
>> the dependency tracking is rather pointless.
> 
> Fixed this problem too. I've added recompression for datum that use 
> custom compression.
> 

Hmmm, it still doesn't work for me. See this:

test=# create extension pg_lz4 ;
CREATE EXTENSION
test=# create table t_lz4 (v text compressed lz4);
CREATE TABLE
test=# create table t_pglz (v text);
CREATE TABLE
test=# insert into t_lz4 select repeat(md5(1::text),300);
INSERT 0 1
test=# insert into t_pglz select * from t_lz4;
INSERT 0 1
test=# drop extension pg_lz4 cascade;
NOTICE:  drop cascades to 2 other objects
DETAIL:  drop cascades to compression options for lz4
drop cascades to tab

Re: to_typemod(type_name) information function

2017-11-21 Thread Stephen Frost
Greeting, Sophie!

* Sophie Herold (sophi...@hemio.de) wrote:
> I did not find any advice on how to choose a new OID for pg_proc.

(Haven't looked at the patch itself yet really, but wanted to answer
this.)

The main thing is to not duplicate the OID, which you can avoid by
calling 'unused_oids' in src/include/catalog.  That will then return a
list of OIDs that haven't been used yet.  Generally speaking, for a case
where you only need one OID, grabbing one from any of the blocks listed
is fine, though it doesn't hurt to check and see what the nearby used
OIDs were for and if there might be some reason to keep a particular OID
free for future use (just for grouping convenience with other related
things).

Generally though, it's not something you have to worry about too much,
just try to avoid duplicating them.  Even then, if you do, most likely
the committer who picks the patch up will realize it and adjust
accordingly.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: to_typemod(type_name) information function

2017-11-21 Thread Sophie Herold
Hi,

the following patch allows to retrieve the typemod. Without this patch,
it does not seem to be possible to generate the first column.

  SELECT format_type(to_regtype(t), pg_to_typemod(t)),
 format_type(to_regtype(t), NULL)
  FROM (VALUES
('INTERVAL SECOND (5)'),
('Varchar(17)'),
('timestamptz (2)')) AS x(t);
   format_type |   format_type
  -+--
   interval second(5)  | interval
   character varying(17)   | character varying
   timestamp(2) with time zone | timestamp with time zone

I did not find any advice on how to choose a new OID for pg_proc.

Best,
Sophie
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 698daf69ea..b0211fcad2 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17126,6 +17126,11 @@ SELECT pg_type_is_visible('myschema.widget'::regtype);
get the path in the file system that this tablespace is located in
   
   
+   pg_to_typemod(type_name)
+   integer
+   get the typemod of the named type
+  
+  
pg_typeof(any)
regtype
get the data type of any value
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 1980ff5ac7..ac7407325a 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -30,6 +30,7 @@
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "parser/parse_type.h"
 #include "parser/scansup.h"
 #include "postmaster/syslogger.h"
 #include "rewrite/rewriteHandler.h"
@@ -1002,3 +1003,24 @@ pg_get_replica_identity_index(PG_FUNCTION_ARGS)
 	else
 		PG_RETURN_NULL();
 }
+
+/*
+ * pg_to_typemod - extracts typemod from "typename"
+ *
+ * This functions suplements to_regtype to obtain the required arguments for
+ * format_type(type_oid, typemod).
+ */
+Datum
+pg_to_typemod(PG_FUNCTION_ARGS)
+{
+	char	   *typ_name = text_to_cstring(PG_GETARG_TEXT_PP(0));
+	Oid			oid;
+	int32		result;
+
+	parseTypeString(typ_name, &oid, &result, true);
+
+  if (OidIsValid(oid))
+	  PG_RETURN_INT32(result);
+	else
+	  PG_RETURN_NULL();
+}
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index c969375981..c40603770b 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -1202,6 +1202,8 @@ DATA(insert OID = 972  (  hashbpcharextended PGNSP PGUID 12 1 0 0 0 f f f f t f
 DESCR("hash");
 DATA(insert OID = 1081 (  format_type	   PGNSP PGUID 12 1 0 0 0 f f f f f f s s 2 0 25 "26 23" _null_ _null_ _null_ _null_ _null_ format_type _null_ _null_ _null_ ));
 DESCR("format a type oid and atttypmod to canonical SQL");
+DATA(insert OID = 4569 (  pg_to_typemod	   PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 23 "25" _null_ _null_ _null_ _null_ _null_ pg_to_typemod _null_ _null_ _null_ ));
+DESCR("get the typemod of the named type");
 DATA(insert OID = 1084 (  date_in		   PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 1082 "2275" _null_ _null_ _null_ _null_ _null_ date_in _null_ _null_ _null_ ));
 DESCR("I/O");
 DATA(insert OID = 1085 (  date_out		   PGNSP PGUID 12 1 0 0 0 f f f f t f s s 1 0 2275 "1082" _null_ _null_ _null_ _null_ _null_ date_out _null_ _null_ _null_ ));
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index 130a0e4be3..0584b22155 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -133,3 +133,26 @@ ERROR:  function num_nulls() does not exist
 LINE 1: SELECT num_nulls();
^
 HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
+---
+--- pg_get_typemod()
+---
+SELECT format_type(to_regtype(t), pg_to_typemod(t)) FROM (VALUES ('INTERVAL SECOND (5)'), ('Varchar(17)'), ('timestamptz (2)')) x(t);
+ format_type 
+-
+ interval second(5)
+ character varying(17)
+ timestamp(2) with time zone
+(3 rows)
+
+SELECT pg_to_typemod('int');
+ pg_to_typemod 
+---
+-1
+(1 row)
+
+SELECT pg_to_typemod('"Unknown Type"') IS NULL;
+ ?column? 
+--
+ t
+(1 row)
+
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 1a20c1f765..24d89df1e5 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -29,3 +29,11 @@ SELECT num_nulls(VARIADIC '{}'::int[]);
 -- should fail, one or more arguments is required
 SELECT num_nonnulls();
 SELECT num_nulls();
+
+---
+--- pg_get_typemod()
+---
+
+SELECT format_type(to_regtype(t), pg_to_typemod(t)) FROM (VALUES ('INTERVAL SECOND (5)'), ('Varchar(17)'), ('timestamptz (2)')) x(t);
+SELECT pg_to_typemod('int');
+SELECT pg_to_typemod('"Unknown Type"') IS NULL;


Re: View with duplicate GROUP BY entries

2017-11-21 Thread Tom Lane
Ashutosh Bapat  writes:
> While reviewing patch for similar problem in postgres_fdw [1], I
> noticed that we don't use positional notation while creating the view.
> This might introduced anomalies when GROUP BY entries are
> non-immutable.

Yeah, we probably ought to make more of an effort to regenerate the
original query wording.  I do not think that forcing positional notation
is a suitable answer in this case, because it would result in converting
SQL-standard queries to nonstandard ones.  We might have to extend the
parsetree representation so that we can tell whether positional notation
was used to begin with.

regards, tom lane



Re: Does XMLSERIALIZE output xmlattributes in a stable order?

2017-11-21 Thread Tom Lane
Chapman Flack  writes:
> Suppose I have a query that generates some XML content, and I want
> to do this on a periodic schedule and check the resulting XML into
> a version control system.
> ...
> But then, there are the attributes of elements. Order of attributes
> is not significant in XML, and is not required (by the "XML Infoset"
> standard) to be preserved. Nevertheless, it would be a useful
> property (for a purpose like I've described) if XMLSERIALIZE were
> known to at least produce the attributes in some consistent order
> across evaluations of the same query.

> Is that true of the implementation in PostgreSQL? I might find out
> with a quick test, but it seemed worth explicitly asking.

AFAICS, XMLSERIALIZE in our current implementation boils down to
being a binary-compatible coercion from XML (which is stored as
a string) to text.  So the interesting question here is where are
you getting the XML values from?  The stability of the results is
going to be whatever the next level down does.

regards, tom lane



Re: Inlining functions with "expensive" parameters

2017-11-21 Thread Andres Freund
Hi,

On 2017-11-21 10:16:58 -0500, Tom Lane wrote:
> I'm also wondering about folding CaseTestExpr and CoerceToDomainValue
> into the same mechanism.  It's not very hard to see those cases as
> being the same as a function-based lambda.

Yea, that'd be good. The current mechanisms is uh, historically grown.

- Andres



Re: Inlining functions with "expensive" parameters

2017-11-21 Thread Andres Freund
On 2017-11-21 09:59:00 -0500, Robert Haas wrote:
> On Thu, Nov 16, 2017 at 2:51 PM, Andres Freund  wrote:
> > Right, but it doesn't sound that hard to introduce. Basically there'd need 
> > to be a WithParamValue node,  that first evaluates parameters and then 
> > executes the child expression. I'm thinking of doing this hierarchically so 
> > there's less issues with the setting of the param value being moved away 
> > from the child expression using it.
> 
> I don't quite follow the need for this.  I mean, if we just stick a
> Param reference in there and create a corresponding InitPlan, the
> Param will be evaluated on demand, right?  Is the point of the new
> node to make sure that the Param gets re-evaluated when needed?

It'll work in some of those cases. But you e.g. can't use an InitPlan to
reference to a table's columns without some major contortions, no?
Inlining stuff like
  SELECT * FROM foo WHERE sql_func(foo.blarg, anotherfunc());
is pretty important for some usecases. The re-evaluation concern's also
there, this should also work with e.g. volatile parameters like
nextval(), without establishing a second mechanism to deal with them.

Greetings,

Andres Freund



Re: feature request: consume asynchronous notification via a function

2017-11-21 Thread Merlin Moncure
On Tue, Nov 21, 2017 at 7:59 AM, Robert Haas  wrote:
> On Fri, Nov 17, 2017 at 9:49 AM, Merlin Moncure  wrote:
>> Currently the only way that I know of to consume async notifications
>> via SQL (as opposed to a client application) is via dblink_get_notify.
>> This method isn't very good; it requires some extra support coding,
>> eats a connection and a backend, and doesn't have any timeout
>> facilities.  The lack a good facility to do this will become more
>> troublesome if/when Peter's recent fantastic work to implement stored
>> procedures in the database gets accepted; asynchronous notifications
>> could be a more efficient mechanic for backend processes to signal
>> each other than the current method of signalling via fields in a
>> table.
>>
>> A good interface might look something like:
>> pg_get_notifications(
>>   TimeOut INT DEFAULT 0,
>>   notify_name  OUT TEXT,
>>   payload OUT TEXT,
>>   pid OUT INT) RETURNS SETF RECORD AS...
>>
>> The function would return immediately by default, or until TimeOut
>> seconds transpired.  We'd still have to poll internally, so that
>> signals could be checked etc, but this would be a nice way to consume
>> notifications without any dependencies -- what do you think?
>
> I think that wouldn't work very well, because I think we must have a
> snapshot open in order to run pg_get_notifications(), and that means
> we're holding back the system-wide xmin.

I am very much looking at the new stored procedure functionality and
imaging a loop like this:

LOOP
  FOR r IN SELECT * FROM pg_get_notifications(30)

  LOOP
PERFORM do_stuff(r);
  END LOOP;

  COMMIT;  -- advance xmin etc
END LOOP;

...I'm obviously speculatively thinking ahead to Peter's stored
procedure work seeing the light of day (which, based on the utility vs
the simplicity of the patch and how it works in testing I'm very
optimistic about).  The above would provide real time response to
certain actions I do now with polling, typically in bash.  Without
stored procedures, I agree that this would be a foot gun.

merlin



Does XMLSERIALIZE output xmlattributes in a stable order?

2017-11-21 Thread Chapman Flack
Suppose I have a query that generates some XML content, and I want
to do this on a periodic schedule and check the resulting XML into
a version control system.

To avoid spurious diffs, I know I can control the order of child
elements generated by xmlagg by slipping an ORDER BY into the
aggregate expression.

But then, there are the attributes of elements. Order of attributes
is not significant in XML, and is not required (by the "XML Infoset"
standard) to be preserved. Nevertheless, it would be a useful
property (for a purpose like I've described) if XMLSERIALIZE were
known to at least produce the attributes in some consistent order
across evaluations of the same query.

Is that true of the implementation in PostgreSQL? I might find out
with a quick test, but it seemed worth explicitly asking.

This is subtle enough that, if it's true, it is probably worth
mentioning in the docs. (If it isn't true, it might even be worth
making it true, then mentioning it in the docs.) While [XML Infoset]
does say that an element's attributes are an "unordered set",
[XQuery and XPath Data Model (XDM)] says "the order of Attribute Nodes
is stable but implementation dependent", and it's the latter document
that's referenced by [XSLT and XQuery Serialization], which is the
standard upon which XMLSERIALIZE is defined in [SQL/XML]. (Phew!)

-Chap



Re: Treating work_mem as a shared resource (Was: Parallel Hash take II)

2017-11-21 Thread Robert Haas
On Fri, Nov 17, 2017 at 9:22 PM, Peter Geoghegan  wrote:
> I think that it's reasonable for us to make it a goal of the executor
> to have operations that have a smooth cost function, in order to
> manage the risk of misestimation well, and to make it a goal to have
> operations that are otherwise adaptive to misestimation.

Hash joins are a place where we could have a smoother cost function
than we do.  When we run out of memory, instead of switching from
(say) a single batch to two batches, switch to 64 batches, but
initially keep 63 of them in memory and only write the very last one
to disk.  Every time we again run out of memory, dump another batch to
disk.  If we end up dumping more than half or so of the batches to
disk, switch to an even larger number of batches to make it even more
fine-grained.  The current system is bad because you jump from
spooling NO tuples to a tuplestore to spooling HALF of the inner AND
outer tuples to a tuplestore.  If the hash table is just a little too
big to fit, we could write 1/64 or 2/64 or 3/64 of the inner and outer
tuples to a tuplestore instead of HALF of them, which would be a huge
win.

That having been said, I think the place where our plans most commonly
go wrong is where we incorrectly estimate the number of tuples by
multiple orders of magnitude - 100x is common, 1000x is common, a
million x is not uncommon, even a billion x is not unheard-of.  And I
don't think there's any way to make a hash join happy if it thinks
it's going to need 1 batch and it ends up needing a million batches.
At that, even if the cost function is very smooth, you've moved so far
along the curve that you're probably not in a good place.  So, while I
think that smoothing out the cost functions is a good idea, I think we
also need to consider what more can be done to improve the estimates -
and especially to avoid estimates that are off by huge multiples.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: Re: User defined data types in Logical Replication

2017-11-21 Thread Huong Dangminh
Sorry for not replying sooner.

> > Attached draft patch fixed this issue, at least on my environment.
> 
> It works good for me.
> 
> > Please review it.
> 
> I will review it soon.

There is one more case that user-defined data type is not supported in Logical 
Replication.
That is when remote data type's name does not exist in SUBSCRIBE.

In relation.c:logicalrep_typmap_gettypname
We search OID in syscache by remote's data type name and mapping it, if it does 
not exist in syscache 
We will be faced with the bellow error.

if (!OidIsValid(entry->typoid))
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("data type \"%s.%s\" required for 
logical replication does not exist",
entry->nspname, 
entry->typname)));

I think, it is not necessary to check typoid here in order to avoid above case, 
is that right?
I attached a patch based on Sawada-san's patch with a bit of messages modified 
and remove the above check.
Could somebody check it for me or should I add it into CF?


---
Thanks and best regards,
Dang Minh Huong
NEC Solution Innovators, Ltd.
http://www.nec-solutioninnovators.co.jp/en/



fix_slot_store_error_callback_V2.patch
Description: fix_slot_store_error_callback_V2.patch


Re: Inlining functions with "expensive" parameters

2017-11-21 Thread Tom Lane
I wrote:
> Robert Haas  writes:
>> I don't quite follow the need for this.  I mean, if we just stick a
>> Param reference in there and create a corresponding InitPlan, the
>> Param will be evaluated on demand, right?  Is the point of the new
>> node to make sure that the Param gets re-evaluated when needed?

> Yeah.  By default, an initplan is only going to be evaluated once per
> query, not once per row.  It's possible that you could set up Param
> dependencies to force the correct semantics, but that would be more
> complicated, and probably less performant, than introducing a new
> expression node type.  Also, I'm dubious that it could be made to work
> at all for standalone expression evaluation; Param-driven re-evaluation
> is only considered when running a plan tree.

BTW, thinking about that a bit more, standalone expressions (eg CHECK
constraints) are possibly going to need some work anyway to make this go
through.  I do not think there's any mechanism in place at the moment
to set up a ParamListInfo array for such an expression.  While it's
conceivable that the expression eval machinery could be set up to not
need one, I'm not real sure, if we try to make these things be
PARAM_EXEC Params.

This suggests that maybe the Params used for LambdaExprs should be a
distinct kind of Param, not the same PARAM_EXEC type that's used for
Params supplied externally to eval of a particular expression.  That
would reflect the fact that expression eval handles them completely
internally.

I'm also wondering about folding CaseTestExpr and CoerceToDomainValue
into the same mechanism.  It's not very hard to see those cases as
being the same as a function-based lambda.

regards, tom lane



Re: Inlining functions with "expensive" parameters

2017-11-21 Thread Tom Lane
Robert Haas  writes:
> On Thu, Nov 16, 2017 at 2:51 PM, Andres Freund  wrote:
>> Right, but it doesn't sound that hard to introduce. Basically there'd
>> need to be a WithParamValue node,  that first evaluates parameters and
>> then executes the child expression.

> I don't quite follow the need for this.  I mean, if we just stick a
> Param reference in there and create a corresponding InitPlan, the
> Param will be evaluated on demand, right?  Is the point of the new
> node to make sure that the Param gets re-evaluated when needed?

Yeah.  By default, an initplan is only going to be evaluated once per
query, not once per row.  It's possible that you could set up Param
dependencies to force the correct semantics, but that would be more
complicated, and probably less performant, than introducing a new
expression node type.  Also, I'm dubious that it could be made to work
at all for standalone expression evaluation; Param-driven re-evaluation
is only considered when running a plan tree.

regards, tom lane



Re: Inlining functions with "expensive" parameters

2017-11-21 Thread Robert Haas
On Thu, Nov 16, 2017 at 2:51 PM, Andres Freund  wrote:
> Right, but it doesn't sound that hard to introduce. Basically there'd need to 
> be a WithParamValue node,  that first evaluates parameters and then executes 
> the child expression. I'm thinking of doing this hierarchically so there's 
> less issues with the setting of the param value being moved away from the 
> child expression using it.

I don't quite follow the need for this.  I mean, if we just stick a
Param reference in there and create a corresponding InitPlan, the
Param will be evaluated on demand, right?  Is the point of the new
node to make sure that the Param gets re-evaluated when needed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] GnuTLS support

2017-11-21 Thread Andreas Karlsson

On 11/20/2017 02:56 AM, Michael Paquier wrote:

Sorry about that. Something more specific needs to happen here as well
for channel binding support with SCRAM. CheckSCRAMAuth() now assumes
that the channel binding mechanism SCRAM-SHA-256-PLUS can be published
to the client if SSL is used, because OpenSSL is the only
implementation available. Does gnutls include an API to allow an
application to fetch the bytes from the TLS finished message? I can
see some references by glancing at the tarball of gnutls 3.6.1, but
you would need something similar to OpenSSL's SSL_get_peer_finished().
If that cannot be achieved I think that it will be necessary to tweak
auth.c to not publish the -PLUS mechanism if for example the result of
be_tls_get_peer_finished() is NULL. No need for a new SSL-specific
API. At the end it would prove to be more portable to do so for all
the SSL implementations, MacOS stuff does not document an API to get
the TLS finish message bytes.


There is a function called gnutls_session_channel_binding() which seems 
to do something very similar to SSL_get*_finished() which has been in 
GnuTLS since 2.12.


https://www.gnutls.org/manual/gnutls.html#Channel-Bindings


  --with-ssl=(openssl|gnutls)


WIth potential patches coming to use macos' SSL implementation or
Windows channel, there should really be only one implementation
available at compile time. That's more simple as a first step as well.
So +1 for the --with-ssl switch.


Agreed.

Andreas



Re: IndexTupleDSize macro seems redundant

2017-11-21 Thread Amit Kapila
On Mon, Nov 20, 2017 at 9:01 PM, Ildar Musin  wrote:
> Hi all,
>
> While I was looking through the indexes code I got confused by couple of
> macros - IndexTupleSize() and IndexTupleDSize() - which seem to do the same
> thing with only difference that the first one takes pointer as an argument
> while the second one takes struct. And in most cases IndexTupleDSize() is
> used with dereferencing of index tuple where IndexTupleSize() would suit
> perfectly. Is there a particular reason to have them both? I've made a patch
> that removes IndexTupleDSize macro. All the tests have passed.
>

+1.  I was also once confused with these macros.  I think this is a
good cleanup.  On a quick look, I don't see any problem with your
changes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: feature request: consume asynchronous notification via a function

2017-11-21 Thread Robert Haas
On Fri, Nov 17, 2017 at 9:49 AM, Merlin Moncure  wrote:
> Currently the only way that I know of to consume async notifications
> via SQL (as opposed to a client application) is via dblink_get_notify.
> This method isn't very good; it requires some extra support coding,
> eats a connection and a backend, and doesn't have any timeout
> facilities.  The lack a good facility to do this will become more
> troublesome if/when Peter's recent fantastic work to implement stored
> procedures in the database gets accepted; asynchronous notifications
> could be a more efficient mechanic for backend processes to signal
> each other than the current method of signalling via fields in a
> table.
>
> A good interface might look something like:
> pg_get_notifications(
>   TimeOut INT DEFAULT 0,
>   notify_name  OUT TEXT,
>   payload OUT TEXT,
>   pid OUT INT) RETURNS SETF RECORD AS...
>
> The function would return immediately by default, or until TimeOut
> seconds transpired.  We'd still have to poll internally, so that
> signals could be checked etc, but this would be a nice way to consume
> notifications without any dependencies -- what do you think?

I think that wouldn't work very well, because I think we must have a
snapshot open in order to run pg_get_notifications(), and that means
we're holding back the system-wide xmin.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Partition-wise aggregation/grouping

2017-11-21 Thread Robert Haas
On Fri, Nov 17, 2017 at 7:24 AM, Ashutosh Bapat
 wrote:
> + * this value should be multiplied with cpu_tuple_cost wherever applicable.
> + */
> +#define DEFAULT_APPEND_COST_FACTOR 0.5
>
> I am wondering whether we should just define
> #define APPEND_TUPLE_COST (cpu_tuple_cost * 0.5)
> and use this macro everywhere. What else use DEFAULT_APPEND_COST_FACTOR would
> have other than multiplying with cpu_tuple_cost?

-1.  If you wrap it in a macro like that, future readers of the code
will have to go look up what the macro does.  If you just multiply by
DEFAULT_APPEND_COST_FACTOR it will be clear that's what being used is
a multiple of cpu_tuple_cost.

> I am not sure whether we should be discussing why this technique performs
> better or when it performs better. We don't have similar discussion for
> partition-wise join. That paragraph just describes the technique and may be we
> want to do the same here.

+1.

> + * might be optimal because of presence of suitable paths with pathkeys or
> + * because the hash tables for most of the partitions fit into the memory.
> + * However, if partition keys are not part of the group by clauses, then we
> + * still able to compute the partial aggregation for each partition and then
> + * finalize them over append. This can either win or lose. It may win if the
> + * PartialAggregate stage greatly reduces the number of groups and lose if we
> + * have lots of small groups.
>
> I have not seen prologue of a function implementing a query optimization
> technique explain why that technique improves performance. So I am not sure
> whether the comment should include this explanation. One of the reasons being
> that the reasons why a technique works might change over the period of time
> with the introduction of other techniques, thus obsoleting the comment.  But
> may be it's good to have it here.

+1 for keeping it.

> +extra.inputRows = 0;/* Not needed at child paths creation */
>
> Why? Comment should be on its own line.

Comments on same line are fine if they are short enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



[PATCH] using arc4random for strong randomness matters.

2017-11-21 Thread David CARLIER
Hello,

This is my first small personal contribution.

Motivation :
- Using fail-safe, file descriptor free solution on *BSD and Darwin system
- Somehow avoiding at the moment FreeBSD as it still uses RC4 (seemingly
updated to Chacha20 for FreeBSD 12.0 and eventually backported later on).
- For implementation based on Chacha* it is known to be enough fast for the
purpose.
- make check passes.

Hope it is good.

Thanks in advance.


0001-Using-arc4random-API-for-strong-randomness-if-availa.patch
Description: Binary data


Re: [HACKERS] Parallel Append implementation

2017-11-21 Thread amul sul
On Tue, Nov 21, 2017 at 2:22 PM, Amit Khandekar  wrote:
> On 21 November 2017 at 12:44, Rafia Sabih  
> wrote:
>> On Mon, Nov 13, 2017 at 12:54 PM, Amit Khandekar  
>> wrote:
>>> Thanks a lot Robert for the patch. I will have a look. Quickly tried
>>> to test some aggregate queries with a partitioned pgbench_accounts
>>> table, and it is crashing. Will get back with the fix, and any other
>>> review comments.
>>>
>>> Thanks
>>> -Amit Khandekar
>>
>> I was trying to get the performance of this patch at commit id -
>> 11e264517dff7a911d9e6494de86049cab42cde3 and TPC-H scale factor 20
>> with the following parameter settings,
>> work_mem = 1 GB
>> shared_buffers = 10GB
>> effective_cache_size = 10GB
>> max_parallel_workers_per_gather = 4
>> enable_partitionwise_join = on
>>
>> and the details of the partitioning scheme is as follows,
>> tables partitioned = lineitem on l_orderkey and orders on o_orderkey
>> number of partitions in each table = 10
>>
>> As per the explain outputs PA was used in following queries- 1, 3, 4,
>> 5, 6, 7, 8, 10, 12, 14, 15, 18, and 21.
>> Unfortunately, at the time of executing any of these query, it is
>> crashing with the following information in  core dump of each of the
>> workers,
>>
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x10600984 in pg_atomic_read_u32_impl (ptr=0x3ec29294)
>> at ../../../../src/include/port/atomics/generic.h:48
>> 48 return ptr->value;
>>
>> In case this a different issue as you pointed upthread, you may want
>> to have a look at this as well.
>> Please let me know if you need any more information in this regard.
>
> Right, for me the crash had occurred with a similar stack, although
> the real crash happened in one of the workers. Attached is the script
> file
> pgbench_partitioned.sql to create a schema with which I had reproduced
> the crash.
>
> The query that crashed :
> select sum(aid), avg(aid) from pgbench_accounts;
>
> Set max_parallel_workers_per_gather to at least 5.
>
> Also attached is v19 patch rebased.
>

I've spent little time to debug this crash. The crash happens in ExecAppend()
due to subnode in node->appendplans array is referred using incorrect
array index (out of bound value) in the following code:

/*
 * figure out which subplan we are currently processing
 */
subnode = node->appendplans[node->as_whichplan];

This incorrect value to node->as_whichplan is get assigned in the
choose_next_subplan_for_worker().

By doing following change on the v19 patch does the fix for me:

--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -489,11 +489,9 @@ choose_next_subplan_for_worker(AppendState *node)
}

/* Pick the plan we found, and advance pa_next_plan one more time. */
-   node->as_whichplan = pstate->pa_next_plan;
+   node->as_whichplan = pstate->pa_next_plan++;
if (pstate->pa_next_plan == node->as_nplans)
pstate->pa_next_plan = append->first_partial_plan;
-   else
-   pstate->pa_next_plan++;

/* If non-partial, immediately mark as finished. */
if (node->as_whichplan < append->first_partial_plan)

Attaching patch does same changes to Amit's ParallelAppend_v19_rebased.patch.

Regards,
Amul


fix_crash.patch
Description: Binary data


Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Kyotaro HORIGUCHI
Hello,

At Tue, 21 Nov 2017 20:27:25 +0900, atorikoshi 
 wrote in 

> Thanks for reviewing!
> 
> 
> On 2017/11/21 18:12, Masahiko Sawada wrote:
> > On Tue, Nov 21, 2017 at 3:48 PM, Masahiko Sawada
> >  wrote:
> >> On Mon, Nov 20, 2017 at 7:35 PM, atorikoshi
> >>  wrote:
> >>> Hi,
> >>>
> >>> I put many queries into one transaction and made ReorderBuffer spill
> >>> data to disk, and sent SIGKILL to postgres before the end of the
> >>> transaction.
> >>>
> >>> After starting up postgres again, I observed the files spilled to
> >>> data wasn't deleted.
> >>>
> >>> I think these files should be deleted because its transaction was no
> >>> more valid, so no one can use these files.
> >>>
> >>>
> >>> Below is a reproduction instructions.
> >>>
> >>> 
> >>> 1. Create table and publication at publiser.
> >>>
> >>>   @pub =# CREATE TABLE t1(
> >>>   id INT PRIMARY KEY,
> >>>   name TEXT);
> >>>
> >>>   @pub =# CREATE PUBLICATION pub FOR TABLE t1;
> >>>
> >>> 2. Create table and subscription at subscriber.
> >>>
> >>>   @sub =# CREATE TABLE t1(
> >>>   id INT PRIMARY KEY,
> >>>   name TEXT
> >>>   );
> >>>
> >>>   @sub =# CREATE SUBSCRIPTION sub
> >>>   CONNECTION 'host=[hostname] port=[port] dbname=[dbname]'
> >>>   PUBLICATION pub;
> >>>
> >>> 3. Put many queries into one transaction.
> >>>
> >>>   @pub =# BEGIN;
> >>> INSERT INTO t1
> >>> SELECT
> >>>   i,
> >>>   'aa'
> >>> FROM
> >>> generate_series(1, 100) as i;
> >>>
> >>> 4. Then we can see spilled files.
> >>>
> >>>   @pub $ ls -1 ${PGDATA}/pg_replslot/sub/
> >>> state
> >>> xid-561-lsn-0-100.snap
> >>> xid-561-lsn-0-200.snap
> >>> xid-561-lsn-0-300.snap
> >>> xid-561-lsn-0-400.snap
> >>> xid-561-lsn-0-500.snap
> >>> xid-561-lsn-0-600.snap
> >>> xid-561-lsn-0-700.snap
> >>> xid-561-lsn-0-800.snap
> >>> xid-561-lsn-0-900.snap
> >>>
> >>> 5. Kill publisher's postgres process before COMMIT.
> >>>
> >>>   @pub $ kill -s SIGKILL [pid of postgres]
> >>>
> >>> 6. Start publisher's postgres process.
> >>>
> >>>   @pub $ pg_ctl start -D ${PGDATA}
> >>>
> >>> 7. After a while, we can see the files remaining.
> >>>   (Immediately after starting publiser, we can not see these files.)
> >>>
> >>>   @pub $ pg_ctl start -D ${PGDATA}
> >>>
> >>>   When I configured with '--enable-cassert', below assertion error
> >>>   was appeared.
> >>>
> >>> TRAP: FailedAssertion("!(txn->final_lsn != 0)", File:
> >>> "reorderbuffer.c",
> >>> Line: 2576)
> >>> 
> >>>
> >>> Attached patch sets final_lsn to the last ReorderBufferChange if
> >>> final_lsn == 0.
> >>
> >> Thank you for the report. I could reproduce this issue with the above
> >> step. My analysis is, the cause of that a serialized reorder buffer
> >> isn't cleaned up is that the aborted transaction without an abort WAL
> >> record has no chance to set ReorderBufferTXN->final_lsn. So if there
> >> is such serialized transactions ReorderBufferRestoreCleanup cleanups
> >> no files, which is cause of the assertion failure (or a file being
> >> orphaned). What do you think?
> >>
> >> On detail of your patch, I'm not sure it's safe if we set the lsn of
> >> other than commit record or abort record to final_lsn. The comment in
> >> reorderbuffer.h says,
> >>
> >> typedef trcut ReorderBufferTXN
> >> {
> >> (snip)
> >>
> >> /* 
> >>  * LSN of the record that lead to this xact to be committed or
> >>  * aborted. This can be a
> >>  * * plain commit record
> >>  * * plain commit record, of a parent transaction
> >>  * * prepared transaction commit
> >>  * * plain abort record
> >>  * * prepared transaction abort
> >>  * * error during decoding
> >>  * 
> >>  */
> >> XLogRecPtr  final_lsn;
> >>
> >> But with your patch, we could set a lsn of a record that is other than
> >> what listed above to final_lsn. One way I came up with is to make
> 
> I added some comments on final_lsn.
> 
> >> ReorderBufferRestoreCleanup accept an invalid value of final_lsn and
> >> regards it as a aborted transaction that doesn't has a abort WAL
> >> record. So we can cleanup all serialized files if final_lsn of a
> >> transaction is invalid.
> >
> > After more thought, since there are some codes cosmetically setting
> > final_lsn when the fate of transaction is determined possibly we
> > should not accept a invalid value of final_lsn even in the case.
> >

It doesn't seem just a cosmetic. The first/last_lsn are used to
determin the files to be deleted. On the other hand the TXN
cannot have last_lsn since it hasn't see abort/commit record.

> My new patch keeps setting final_lsn, but changed its location to the
> top of ReorderBufferCleanupTXN().
> I think it's a kind of preparation, so doing it at the top improves
> readability.
> 
> > Any

Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread atorikoshi

Thanks for reviewing!


On 2017/11/21 18:12, Masahiko Sawada wrote:

On Tue, Nov 21, 2017 at 3:48 PM, Masahiko Sawada  wrote:

On Mon, Nov 20, 2017 at 7:35 PM, atorikoshi
 wrote:

Hi,

I put many queries into one transaction and made ReorderBuffer spill
data to disk, and sent SIGKILL to postgres before the end of the
transaction.

After starting up postgres again, I observed the files spilled to
data wasn't deleted.

I think these files should be deleted because its transaction was no
more valid, so no one can use these files.


Below is a reproduction instructions.


1. Create table and publication at publiser.

  @pub =# CREATE TABLE t1(
  id INT PRIMARY KEY,
  name TEXT);

  @pub =# CREATE PUBLICATION pub FOR TABLE t1;

2. Create table and subscription at subscriber.

  @sub =# CREATE TABLE t1(
  id INT PRIMARY KEY,
  name TEXT
  );

  @sub =# CREATE SUBSCRIPTION sub
  CONNECTION 'host=[hostname] port=[port] dbname=[dbname]'
  PUBLICATION pub;

3. Put many queries into one transaction.

  @pub =# BEGIN;
INSERT INTO t1
SELECT
  i,
  'aa'
FROM
generate_series(1, 100) as i;

4. Then we can see spilled files.

  @pub $ ls -1 ${PGDATA}/pg_replslot/sub/
state
xid-561-lsn-0-100.snap
xid-561-lsn-0-200.snap
xid-561-lsn-0-300.snap
xid-561-lsn-0-400.snap
xid-561-lsn-0-500.snap
xid-561-lsn-0-600.snap
xid-561-lsn-0-700.snap
xid-561-lsn-0-800.snap
xid-561-lsn-0-900.snap

5. Kill publisher's postgres process before COMMIT.

  @pub $ kill -s SIGKILL [pid of postgres]

6. Start publisher's postgres process.

  @pub $ pg_ctl start -D ${PGDATA}

7. After a while, we can see the files remaining.
  (Immediately after starting publiser, we can not see these files.)

  @pub $ pg_ctl start -D ${PGDATA}

  When I configured with '--enable-cassert', below assertion error
  was appeared.

TRAP: FailedAssertion("!(txn->final_lsn != 0)", File: "reorderbuffer.c",
Line: 2576)


Attached patch sets final_lsn to the last ReorderBufferChange if
final_lsn == 0.


Thank you for the report. I could reproduce this issue with the above
step. My analysis is, the cause of that a serialized reorder buffer
isn't cleaned up is that the aborted transaction without an abort WAL
record has no chance to set ReorderBufferTXN->final_lsn. So if there
is such serialized transactions ReorderBufferRestoreCleanup cleanups
no files, which is cause of the assertion failure (or a file being
orphaned). What do you think?

On detail of your patch, I'm not sure it's safe if we set the lsn of
other than commit record or abort record to final_lsn. The comment in
reorderbuffer.h says,

typedef trcut ReorderBufferTXN
{
(snip)

/* 
 * LSN of the record that lead to this xact to be committed or
 * aborted. This can be a
 * * plain commit record
 * * plain commit record, of a parent transaction
 * * prepared transaction commit
 * * plain abort record
 * * prepared transaction abort
 * * error during decoding
 * 
 */
XLogRecPtr  final_lsn;

But with your patch, we could set a lsn of a record that is other than
what listed above to final_lsn. One way I came up with is to make


I added some comments on final_lsn.


ReorderBufferRestoreCleanup accept an invalid value of final_lsn and
regards it as a aborted transaction that doesn't has a abort WAL
record. So we can cleanup all serialized files if final_lsn of a
transaction is invalid.


After more thought, since there are some codes cosmetically setting
final_lsn when the fate of transaction is determined possibly we
should not accept a invalid value of final_lsn even in the case.



My new patch keeps setting final_lsn, but changed its location to the
top of ReorderBufferCleanupTXN().
I think it's a kind of preparation, so doing it at the top improves
readability.


Anyway I think you should register this patch to the next commit fest so as not 
forget.


Thanks for you advice, I've registered this issue as a bug.

Regards,

--
Atsushi Torikoshi
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

diff --git a/src/backend/replication/logical/reorderbuffer.c 
b/src/backend/replication/logical/reorderbuffer.c
index 0f607ba..bbeb494 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -1075,6 +1075,21 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, 
ReorderBufferTXN *txn)
boolfound;
dlist_mutable_iter iter;
 
+   /*
+* Before cleaning up, do a preparation.
+* If this transaction encountered crash during transaction,
+* txn->final_lsn remains initial value.
+* To properly remove entries which were spilled to disk, we need valid
+* final_lsn.
+* So 

Re: [HACKERS] pg_basebackup --progress output for batch execution

2017-11-21 Thread Martín Marqués
Arthur,

Thank you very much for reviewing the patch and for your valuable
input (you made me read the Microsoft Visual C specs ;))

Regards,

2017-11-21 8:11 GMT-03:00 Arthur Zakirov :
> On Tue, Nov 21, 2017 at 07:16:46AM -0300, Martín Marqués wrote:
>>
>> Ups! Attached the corrected version.:)
>>
>
> Thank you for the new version.
>
> The patch applies via 'patch' command without warnings and errors, tests 
> passed. Marked the patch as "Ready for Commiter".
>
> --
> Arthur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company



-- 
Martín Marquéshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] pg_basebackup --progress output for batch execution

2017-11-21 Thread Arthur Zakirov
On Tue, Nov 21, 2017 at 07:16:46AM -0300, Martín Marqués wrote:
> 
> Ups! Attached the corrected version.:)
> 

Thank you for the new version.

The patch applies via 'patch' command without warnings and errors, tests 
passed. Marked the patch as "Ready for Commiter".

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [HACKERS] pg_basebackup --progress output for batch execution

2017-11-21 Thread Martín Marqués
El 21/11/17 a las 04:56, Arthur Zakirov escribió:
> On Mon, Nov 20, 2017 at 04:45:48PM -0300, Martín Marqués wrote:
>> New version of patch, without the --batch-mode option and using isatty()
>>
> 
> Great!
> 
>> +fprintf(stderr, "waiting for checkpoint");
>> +if (isatty(fileno(stderr)))
>> +fprintf(stderr, "\n");
>> +else
>> +fprintf(stderr, "\r");
> 
> Here the condition should be inverted I think. The next condition should be 
> used:
> 
>> if (!isatty(fileno(stderr)))
>> ...
> 
> Otherwise pg_basebackup will insert '\n' in terminal instead '\r'.

Ups! Attached the corrected version.:)

Nice catch. I completely missed that. Thanks.

-- 
Martín Marquéshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
new file mode 100644
index a8715d9..0ac9bd5
*** a/src/bin/pg_basebackup/pg_basebackup.c
--- b/src/bin/pg_basebackup/pg_basebackup.c
*** progress_report(int tablespacenum, const
*** 811,817 
  totaldone_str, totalsize_str, percent,
  tablespacenum, tablespacecount);
  
! 	fprintf(stderr, "\r");
  }
  
  static int32
--- 811,820 
  totaldone_str, totalsize_str, percent,
  tablespacenum, tablespacecount);
  
! 	if (!isatty(fileno(stderr)))
! 		fprintf(stderr, "\n");
! 	else
! 		fprintf(stderr, "\r");
  }
  
  static int32
*** BaseBackup(void)
*** 1796,1802 
  progname);
  
  	if (showprogress && !verbose)
! 		fprintf(stderr, "waiting for checkpoint\r");
  
  	basebkp =
  		psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s",
--- 1799,1811 
  progname);
  
  	if (showprogress && !verbose)
! 	{
! 		fprintf(stderr, "waiting for checkpoint");
! 		if (!isatty(fileno(stderr)))
! 			fprintf(stderr, "\n");
! 		else
! 			fprintf(stderr, "\r");
! 	}
  
  	basebkp =
  		psprintf("BASE_BACKUP LABEL '%s' %s %s %s %s %s %s",
*** BaseBackup(void)
*** 1929,1935 
  	if (showprogress)
  	{
  		progress_report(PQntuples(res), NULL, true);
! 		fprintf(stderr, "\n");	/* Need to move to next line */
  	}
  
  	PQclear(res);
--- 1938,1945 
  	if (showprogress)
  	{
  		progress_report(PQntuples(res), NULL, true);
! 		if (isatty(fileno(stderr)))
! 			fprintf(stderr, "\n");	/* Need to move to next line */
  	}
  
  	PQclear(res);


Re: [HACKERS] path toward faster partition pruning

2017-11-21 Thread Rajkumar Raghuwanshi
On Fri, Nov 17, 2017 at 3:31 PM, Amit Langote  wrote:

>
> Support for hash partitioning and tests for the same.  Also, since
> update/delete on partitioned tables still depend on constraint exclusion
> for pruning, fix things such that get_relation_constraints includes
> partition constraints in its result only for non-select queries (for
> selects we have the new pruning code).  Other bug fixes.
>
> Hi Amit,

I have applied attached patch set on commit
11e264517dff7a911d9e6494de86049cab42cde3 and try to test for hash
partition. I got a server crash with below test.

CREATE TABLE hp_tbl (a int, b int, c int) PARTITION BY HASH (a);
CREATE TABLE hp_tbl_p1 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4,
remainder 0) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p1_p1 PARTITION OF hp_tbl_p1 FOR VALUES WITH (modulus
4, remainder 0) PARTITION BY HASH (c);
CREATE TABLE hp_tbl_p1_p1_p1 PARTITION OF hp_tbl_p1_p1 FOR VALUES WITH
(modulus 4, remainder 0);
CREATE TABLE hp_tbl_p2 PARTITION OF hp_tbl FOR VALUES WITH (modulus 4,
remainder 1) PARTITION BY HASH (b);
CREATE TABLE hp_tbl_p2_p1 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus
4, remainder 1);
CREATE TABLE hp_tbl_p2_p2 PARTITION OF hp_tbl_p2 FOR VALUES WITH (modulus
4, remainder 2);
insert into  hp_tbl select i,i,i from generate_series(0,10)i where i not
in(2,4,6,7,10);

explain select * from hp_tbl where a = 2;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


Re: [HACKERS] path toward faster partition pruning

2017-11-21 Thread Kyotaro HORIGUCHI
Thank you and sorry for the confused comments.

At Mon, 13 Nov 2017 18:46:28 +0900, Amit Langote 
 wrote in 
<8460a6c3-68c5-b78a-7d18-d253180f2...@lab.ntt.co.jp>
> Horiguchi-san,
> 
> Thanks for taking a look.  Replying to all your emails here.

> > In 0003, 
> > 
> > +match_clauses_to_partkey(RelOptInfo *rel,
> > ...
> > +  if (rinfo->pseudoconstant &&
> > +(IsA(clause, Const) &&
> > + Const *) clause)->constisnull) ||
> > +  !DatumGetBool(((Const *) clause)->constvalue
> > +  {
> > +*constfalse = true;
> > +continue;
> > 
> > If we call this function in both conjunction and disjunction
> > context (the latter is only in recursive case). constfalse ==
> > true means no need of any clauses for the former case.
> > 
> > Since (I think) just a list of RestrictInfo is expected to be
> > treated as a conjunction (it's quite doubious, though..),
> 
> I think it makes sense to consider a list of RestrictInfo's, such as
> baserestrictinfo, that is passed as input to match_clauses_to_partkey(),
> to be mutually conjunctive for our purpose here.

You're right and I know it. I'm ok to leave it since I recalled
that clause_selectivity always has a similar code.

> On 2017/11/10 14:44, Kyotaro HORIGUCHI wrote:
> > At Fri, 10 Nov 2017 14:38:11 +0900, Kyotaro HORIGUCHI wrote:
> >
> > This is working fine. Sorry for the bogus comment.
> 
> I'd almost started looking around if something might be wrong after all. :)

Very sorry for the wrong comment:(

> On 2017/11/10 16:07, Kyotaro HORIGUCHI wrote:
> > At Fri, 10 Nov 2017 14:44:55 +0900, Kyotaro HORIGUCHI wrote:
> >
> >>> Those two conditions are not orthogonal. Maybe something like
> >>> following seems more understantable.
> >>>
>  if (!constfalse)
>  {
>    /* No constraints on the keys, so, return *all* partitions. */
>    if (nkeys == 0)
>  return bms_add_range(result, 0, partdesc->nparts - 1);
> 
>    result = get_partitions_for_keys(relation, &keys);
>  }
> >
> > So, the condition (!constfalse && nkeys == 0) cannot return
> > there. I'm badly confused by the variable name.
> 
> Do you mean by 'constfalse'?

Perhaps. The name "constfalse" is suggesting (for me) that the
cluses evaluate to false constantly. But acutally it means just
the not-in-the-return clauses are results in false. Anyway I'll
take a look on v12 and will comment at the time.

> > I couldn't find another reasonable structure using the current
> > classify_p_b_keys(), but could you add a comment like the
> > following as an example?
> 
> OK, will add comments explaining what's going on.
> 
> 
> Will post the updated patches after also taking care of David's and Amul's
> review comments upthread.
> 
> Thanks,
> Amit

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: default range partition and constraint exclusion

2017-11-21 Thread Amit Langote
On 2017/11/18 8:28, Robert Haas wrote:
> On Fri, Nov 17, 2017 at 12:57 AM, Amit Langote
>  wrote:
>> While working on the patch for partition pruning for declarative
>> partitioned tables, I noticed that default range partition will fail to be
>> included in a plan in certain cases due to pruning by constraint exclusion.
>> you'll notice that it doesn't explicitly say that the default partition
>> allows rows where a is null or b is null or both are null.  Given that,
>> constraint exclusion will end up concluding that the default partition's
>> constraint is refuted by a = 2.
>>
>> The attached will make the constraint to look like:
> 
> Uh, if the constraint exclusion logic we're using is drawing false
> conclusions, we need to fix it so it doesn't, not change the
> constraint so that the wrong logic gives the right answer.

I did actually consider it, but ended up concluding that constraint
exclusion is doing all it can using the provided list partition constraint
clauses.

If all predicate_refuted_by() receives is the expression tree (AND/OR)
with individual nodes being strict clauses involving partition keys (and
nothing about the nullness of the keys), the downstream code is just
playing by the rules as explained in the header comment of
predicate_refuted_by_recurse() in concluding that query's restriction
clause a = 2 refutes it.

You may notice in the example I gave that all rows with non-null partition
keys have a non-default range partitions to choose from.  So, the only
rows that exist in the default partition are those that contain null in
either or both partition keys.

Constraint that's currently generated for the default partition only
describes rows that contain non-null partition keys.  The default
partition in the example contains no such rows, so constraint exclusion is
right in excluding that it is excluded by a clause like a = 2.

By the way, I made a mistake when listing the constraint that the proposed
patch will produce; it's actually:

NOT (
  a IS NOT NULL
   AND
  b IS NOT NULL
   AND
  (
((a < 1) OR ((a = 1) AND (b < 1)))
 OR
((a > 1) OR ((a = 1) AND (b >= 1)))
  )
)

Am I missing something?

Thanks,
Amit




Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2017-11-21 Thread Peter Moser
2017-11-14 18:42 GMT+01:00 Tom Lane :
> You might consider putting the rewriting into, um, the rewriter.
> It could be a separate pass after view expansion, if direct integration
> with the existing behavior seems unduly spaghetti-ish.  Or do it in
> an early phase of planning as he suggested.  There's not really that
> much difference between the rewriter and the planner for this purpose.
> Although one way to draw the distinction is that the output of the
> rewriter is (currently) still fully expressible as plain SQL, whereas
> once the planner goes into action the intermediate states of the tree
> might not really be SQL anymore (eg, it might contain join types that
> don't correspond to any SQL syntax).  So depending on what your rewrite
> emits, there would be a weak preference for calling it part of the
> rewriter or planner respectively.

2017-11-16 16:42 GMT+01:00 Robert Haas :
> Another thing to think about is that even though the CURRENT
> implementation just rewrites the relevant constructs as SQL, in the
> future somebody might want to do something else.  I feel like it's not
> hard to imagine a purpose-build ALIGN or NORMALIZE join type being a
> lot faster than the version that's just done by rewriting the SQL.
> That would be more work, potentially, but it would be nice if the
> initial implementation leant itself to be extended that way in the
> future, which an all-rewriter implementation would not.  On the other
> hand, maybe an early-in-the-optimizer implementation wouldn't either,
> and maybe it's not worth worrying about it anyway.  But it would be
> cool if this worked out in a way that meant it could be further
> improved without having to change it completely.

Hi hackers,
we like to rethink our approach...

For simplicity I'll drop ALIGN for the moment and focus solely on NORMALIZE:

SELECT * FROM (R NORMALIZE S ON R.x = S.y WITH (R.time, S.time)) c;

Our normalization executor node needs the following input (for now
expressed in plain SQL):

SELECT R.*, p1
FROM (SELECT *, row_id() OVER () rn FROM R) R
 LEFT OUTER JOIN (
SELECT y, LOWER(time) p1 FROM S
UNION
SELECT y, UPPER(time) p1 FROM S
 ) S
 ON R.x = S.y AND p1 <@ R.time
ORDER BY rn, p1;

In other words:
1) The left subquery adds an unique ID to each tuple (i.e., rn).
2) The right subquery creates two results for each input tuple: one for
   the upper and one for the lower bound of each input tuple's valid time
   column. The boundaries get put into a single (scalar) column, namely p1.
3) We join both subqueries if the normalization predicates hold (R.x = S.y)
   and p1 is inside the time of the current outer tuple.
4) Finally, we sort the result by the unique ID (rn) and p1, and give all
   columns of the outer relation, rn and p1 back.

Our first attempt to understand the new approach would be as follows: The
left base rel of the inner left-outer-join can be expressed as a WindowAgg
node. However, the right query of the join is much more difficult to build
(maybe through hash aggregates). Both queries could be put together with a
MergeJoin for instance. However, if we create the plan tree by hand and
choose algorithms for it manually, how is it possible to have it optimized
later? Or, if that is not possible, how do we choose the best algorithms
for it?

Best regards,
Anton, Johann, Michael, Peter



Re: [HACKERS] Assertion failure when the non-exclusive pg_stop_backup aborted.

2017-11-21 Thread Masahiko Sawada
On Tue, Nov 21, 2017 at 10:01 AM, Michael Paquier
 wrote:
> On Tue, Nov 21, 2017 at 9:37 AM, Masahiko Sawada  
> wrote:
>> On Tue, Nov 21, 2017 at 8:03 AM, Michael Paquier
>>  wrote:
>>> You could just add "as this allows to keep backup counters kept in
>>> shared memory consistent with the state of the session starting or
>>> stopping a backup.".
>>
>> Thank you for the suggestion, Michael-san. Attached updated patch.
>> Please review it.
>
> [nit]
> + * or stoppping a backup.
> s/stoppping/stopping/

Oops.

> Fujii-san, please note that the same concept does not apply to
> do_pg_start_backup().
>
>   * reason, *all* functionality between do_pg_start_backup() and
> - * do_pg_stop_backup() should be inside the error cleanup block!
> + * do_pg_stop_backup(), including do_pg_stop_backup() should be inside
> + * the error cleanup block!
>   */
> Weirdly worded here. "between do_pg_start_backup until
> do_pg_stop_backup is done" sounds better?

Agreed.

Thank you for comments. Attached updated patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_do_pg_abort_backup_v9.patch
Description: Binary data


Re: Anybody care about having the verbose form of the tzdata files?

2017-11-21 Thread Daniel Gustafsson
> On 20 Nov 2017, at 21:38, Tom Lane  wrote:

> So I'm wondering if we should replace src/timezone/data/* with
> tzdata.zi, in the name of reducing the size of our tarballs.

+1

> Anybody here actually care about reading the zone data files?

I doubt there is anyone who cares about that who isn’t already consuming the
upstream data.

cheers ./daniel


Re: [HACKERS] CLUSTER command progress monitor

2017-11-21 Thread Antonin Houska
Tom Lane  wrote:

> Antonin Houska  writes:
> > Robert Haas  wrote:
> >> These two phases overlap, though. I believe progress reporting for
> >> sorts is really hard.
> 
> > Whatever complexity is hidden in the sort, cost_sort() should have taken it
> > into consideration when called via plan_cluster_use_sort(). Thus I think 
> > that
> > once we have both startup and total cost, the current progress of the sort
> > stage can be estimated from the current number of input and output
> > rows. Please remind me if my proposal appears to be too simplistic.
> 
> Well, even if you assume that the planner's cost model omits nothing
> (which I wouldn't bet on), its result is only going to be as good as the
> planner's estimate of the number of rows to be sorted.  And, in cases
> where people actually care about progress monitoring, it's likely that
> the planner got that wrong, maybe horribly so.  I think it's a bad idea
> for progress monitoring to depend on the planner's estimates in any way
> whatsoever.

The general idea was that some sort of prediction of the total cost is needed
anyway if we should tell during execution what fraction of work has already
been done. And also that the cost computation that we perform during execution
shouldn't (ideally) differ from cost_sort(). So I thought that it's easier to
refine cost_sort() than to implement the same computation from scratch
elsewhere.

Besides that I see 2 circumstances that make the estimate of the number of
input tuples simpler in the CLUSTER case:

* There's only 1 input relation w/o any kind of clause.

* CLUSTER uses SnapshotAny, so pg_class(reltuples) is closer to the actual
  number of input rows than it would be in general case. (Of course, pg_class
  would only be useful for the initial estimate.)

Unlike planner, the executor could recalculate the cost estimate at some
point(s) as it recognizes that the actual number of tuples per page appears to
differ from the density derived from pg_class initially. Still wrong?

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: Failed to delete old ReorderBuffer spilled files

2017-11-21 Thread Masahiko Sawada
On Tue, Nov 21, 2017 at 3:48 PM, Masahiko Sawada  wrote:
> On Mon, Nov 20, 2017 at 7:35 PM, atorikoshi
>  wrote:
>> Hi,
>>
>> I put many queries into one transaction and made ReorderBuffer spill
>> data to disk, and sent SIGKILL to postgres before the end of the
>> transaction.
>>
>> After starting up postgres again, I observed the files spilled to
>> data wasn't deleted.
>>
>> I think these files should be deleted because its transaction was no
>> more valid, so no one can use these files.
>>
>>
>> Below is a reproduction instructions.
>>
>> 
>> 1. Create table and publication at publiser.
>>
>>   @pub =# CREATE TABLE t1(
>>   id INT PRIMARY KEY,
>>   name TEXT);
>>
>>   @pub =# CREATE PUBLICATION pub FOR TABLE t1;
>>
>> 2. Create table and subscription at subscriber.
>>
>>   @sub =# CREATE TABLE t1(
>>   id INT PRIMARY KEY,
>>   name TEXT
>>   );
>>
>>   @sub =# CREATE SUBSCRIPTION sub
>>   CONNECTION 'host=[hostname] port=[port] dbname=[dbname]'
>>   PUBLICATION pub;
>>
>> 3. Put many queries into one transaction.
>>
>>   @pub =# BEGIN;
>> INSERT INTO t1
>> SELECT
>>   i,
>>   'aa'
>> FROM
>> generate_series(1, 100) as i;
>>
>> 4. Then we can see spilled files.
>>
>>   @pub $ ls -1 ${PGDATA}/pg_replslot/sub/
>> state
>> xid-561-lsn-0-100.snap
>> xid-561-lsn-0-200.snap
>> xid-561-lsn-0-300.snap
>> xid-561-lsn-0-400.snap
>> xid-561-lsn-0-500.snap
>> xid-561-lsn-0-600.snap
>> xid-561-lsn-0-700.snap
>> xid-561-lsn-0-800.snap
>> xid-561-lsn-0-900.snap
>>
>> 5. Kill publisher's postgres process before COMMIT.
>>
>>   @pub $ kill -s SIGKILL [pid of postgres]
>>
>> 6. Start publisher's postgres process.
>>
>>   @pub $ pg_ctl start -D ${PGDATA}
>>
>> 7. After a while, we can see the files remaining.
>>   (Immediately after starting publiser, we can not see these files.)
>>
>>   @pub $ pg_ctl start -D ${PGDATA}
>>
>>   When I configured with '--enable-cassert', below assertion error
>>   was appeared.
>>
>> TRAP: FailedAssertion("!(txn->final_lsn != 0)", File: "reorderbuffer.c",
>> Line: 2576)
>> 
>>
>> Attached patch sets final_lsn to the last ReorderBufferChange if
>> final_lsn == 0.
>
> Thank you for the report. I could reproduce this issue with the above
> step. My analysis is, the cause of that a serialized reorder buffer
> isn't cleaned up is that the aborted transaction without an abort WAL
> record has no chance to set ReorderBufferTXN->final_lsn. So if there
> is such serialized transactions ReorderBufferRestoreCleanup cleanups
> no files, which is cause of the assertion failure (or a file being
> orphaned). What do you think?
>
> On detail of your patch, I'm not sure it's safe if we set the lsn of
> other than commit record or abort record to final_lsn. The comment in
> reorderbuffer.h says,
>
> typedef trcut ReorderBufferTXN
> {
> (snip)
>
> /* 
>  * LSN of the record that lead to this xact to be committed or
>  * aborted. This can be a
>  * * plain commit record
>  * * plain commit record, of a parent transaction
>  * * prepared transaction commit
>  * * plain abort record
>  * * prepared transaction abort
>  * * error during decoding
>  * 
>  */
> XLogRecPtr  final_lsn;
>
> But with your patch, we could set a lsn of a record that is other than
> what listed above to final_lsn. One way I came up with is to make
> ReorderBufferRestoreCleanup accept an invalid value of final_lsn and
> regards it as a aborted transaction that doesn't has a abort WAL
> record. So we can cleanup all serialized files if final_lsn of a
> transaction is invalid.

After more thought, since there are some codes cosmetically setting
final_lsn when the fate of transaction is determined possibly we
should not accept a invalid value of final_lsn even in the case.

> Since I'm not very familiar with snapshot
> building part please check it.

Sorry I wanted to say the reorderbuffer module. The snapshot building
isn't relevant.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



  1   2   >