date:20180509

Re: Needless additional partition check in INSERT?

2018-05-09 Thread Simon Riggs

On 10 May 2018 at 05:33, David Rowley  wrote:
> On 10 May 2018 at 16:13, Amit Langote  wrote:
>> The patch to ExecInsert looks good, but I think we also need to do the
>> same thing in CopyFrom.
>
> I think so too.
>
> Updated patch attached.

Patch is good.

The cause of this oversight is the lack of comments to explain the
original coding, so we need to correct that in this patch, please.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

PANIC during crash recovery of a recently promoted standby

2018-05-09 Thread Pavan Deolasee

Hello,

I recently investigated a problem where a standby is promoted to be the new
master. The promoted standby crashes shortly thereafter for whatever
reason. Upon running the crash recovery, the promoted standby (now master)
PANICs with message such as:

PANIC,XX000,"WAL contains references to invalid
pages""XLogCheckInvalidPages,
xlogutils.c:242",""

After investigation, I could recreate a reproduction scenario for this
problem. The attached TAP test (thanks Alvaro from converting my bash
script to a TAP test) demonstrates the problem. The test is probably
sensitive to timing, but it reproduces the problem consistently at least at
my end. While the original report was for 9.6, I can reproduce it on the
master and thus it probably affects all supported releases.

Investigations point to a possible bug where we fail to update the
minRecoveryPoint after completing the ongoing restart point upon promotion.
IMV after promotion the new master must always recover to the end of the
WAL to ensure that all changes are applied correctly. But what we've
instead is that minRecoveryPoint remains set to a prior location because of
this:

   /*
 * Update pg_control, using current time.  Check that it still shows
 * IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
 * this is a quick hack to make sure nothing really bad happens if
somehow
 * we get here after the end-of-recovery checkpoint.
 */
   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
{
ControlFile->checkPoint = lastCheckPointRecPtr;
ControlFile->checkPointCopy = lastCheckPoint;
ControlFile->time = (pg_time_t) time(NULL);

/*
 * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
 * this will have happened already while writing out dirty buffers,
 * but not necessarily - e.g. because no buffers were dirtied.  We
do
 * this because a non-exclusive base backup uses minRecoveryPoint to
 * determine which WAL files must be included in the backup, and the
 * file (or files) containing the checkpoint record must be
included,
 * at a minimum. Note that for an ordinary restart of recovery
there's
 * no value in having the minimum recovery point any earlier than
this
 * anyway, because redo will begin just after the checkpoint record.
 */
if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
{
ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
ControlFile->minRecoveryPointTLI =
lastCheckPoint.ThisTimeLineID;

/* update local copy */
minRecoveryPoint = ControlFile->minRecoveryPoint;
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
}
if (flags & CHECKPOINT_IS_SHUTDOWN)
ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
UpdateControlFile();
}
LWLockRelease(ControlFileLock);



After promotion, the minRecoveryPoint is only updated (cleared) when the
first regular checkpoint completes. If a crash happens before that, we will
run the crash recovery with a stale minRecoveryPoint, which results into
the PANIC that we diagnosed. The test case was written to reproduce the
issue as reported to us. Thus the test case TRUNCATEs and extends the table
at hand after promotion. The crash shortly thereafter leaves the pages in
uninitialised state because the shared buffers are not yet flushed to the
disk.

During crash recovery, we see uninitialised pages for the WAL records
written before the promotion. These pages are remembered and we expect to
either see a DROP TABLE or TRUNCATE WAL record before the minRecoveryPoint
is reached. But since the minRecoveryPoint is still pointing to a WAL
location prior to the TRUNCATE operation, crash recovery hits the
minRecoveryPoint before seeing the TRUNCATE WAL record. That results in a
PANIC situation.

I propose that we should always clear the minRecoveryPoint after promotion
to ensure that crash recovery always run to the end if a just-promoted
standby crashes before completing its first regular checkpoint. A WIP patch
is attached.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


0002-Ensure-recovery-is-run-to-the-end-upon-promotion-of-.patch
Description: Binary data


0001-A-new-TAP-test-to-test-a-recovery-bug.patch
Description: Binary data

Re: Needless additional partition check in INSERT?

2018-05-09 Thread Amit Langote

On 2018/05/10 13:33, David Rowley wrote:
> On 10 May 2018 at 16:13, Amit Langote  wrote:
>> The patch to ExecInsert looks good, but I think we also need to do the
>> same thing in CopyFrom.
> 
> I think so too.
> 
> Updated patch attached.

Thanks, looks good.

Regards,
Amit

Re: Needless additional partition check in INSERT?

2018-05-09 Thread David Rowley

On 10 May 2018 at 16:13, Amit Langote  wrote:
> The patch to ExecInsert looks good, but I think we also need to do the
> same thing in CopyFrom.

I think so too.

Updated patch attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_needless_additional_partition_check_v2.patch
Description: Binary data

Re: Needless additional partition check in INSERT?

2018-05-09 Thread Amit Langote

On 2018/05/10 12:55, David Rowley wrote:
> Hi,
> 
> Scanning ExecInsert, it looks like there's a needless additional
> partition constraint check against the tuple. This should only be
> required if there's a before row INSERT trigger.  The code block up
> one from the additional check tries to disable the check, but it goes
> ahead regardless, providing there's some other constraint.
> 
> ExecFindPartition should have already located the correct partition
> and nothing should have changed in the absence of before row insert
> triggers, so it looks like we're fine to not bother re-checking.

I think you're right.  So if we call ExecConstraints only because there
are other tuple constraints (rd_att->constr != NULL), then currently we
pass true for whether to check the partition constraint, irrespective of
the value of check_partition_constr.  I guess 19c47e7c820 should have done
it the way your patch teaches it do to begin with.

The patch to ExecInsert looks good, but I think we also need to do the
same thing in CopyFrom.

Thanks,
Amit

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-05-09 Thread Ashutosh Bapat

On Wed, May 9, 2018 at 5:20 PM, Etsuro Fujita
 wrote:
>
> (2018/04/25 18:51), Ashutosh Bapat wrote:
>> Actually I noticed that ConvertRowtypeExpr are used to cast a child's
>> whole row reference expression (not just a Var node) into that of its
>> parent and not. For example a cast like NULL::child::parent produces a
>> ConvertRowtypeExpr encapsulating a NULL constant node and not a Var
>> node. We are interested only in ConvertRowtypeExprs embedding Var
>> nodes with Var::varattno = 0. I have changed this code to use function
>> is_converted_whole_row_reference() instead of the above code with
>> Assert. In order to use that function, I had to take it out of
>> setrefs.c and put it in nodeFuncs.c which seemed to be a better fit.
>
> This change seems a bit confusing to me because the flag bits
> "PVC_INCLUDE_CONVERTROWTYPES" and "PVC_RECURSE_CONVERTROWTYPES" passed to
> pull_var_clause look as if it handles any ConvertRowtypeExpr nodes from a
> given clause, but really, it only handles ConvertRowtypeExprs you mentioned
> above.  To make that function easy to understand and use, I think it'd be
> better to use the IsA(node, ConvertRowtypeExpr) test as in the first version
> of the patch, instead of is_converted_whole_row_reference, which would be
> more consistent with other cases such as PlaceHolderVar.

I agree that using is_converted_whole_row_reference() is not
consistent with the other nodes that are handled by pull_var_clause().
I also agree that PVC_*_CONVERTROWTYPES doesn't reflect exactly what's
being done with those options. But using
is_converted_whole_row_reference() is the correct thing to do since we
are interested only in the whole-row references embedded in
ConvertRowtypeExpr. There can be anything encapsulated in
ConvertRowtypeExpr(), a non-shippable function for example. We don't
want to try to push that down in postgres_fdw's case. Neither in other
cases. So, it boils down to changing names of PVC_*_CONVERTROWTYPES to
something more meaningful. How about PVC_*_CONVERTWHOLEROWS?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Needless additional partition check in INSERT?

2018-05-09 Thread David Rowley

Hi,

Scanning ExecInsert, it looks like there's a needless additional
partition constraint check against the tuple. This should only be
required if there's a before row INSERT trigger.  The code block up
one from the additional check tries to disable the check, but it goes
ahead regardless, providing there's some other constraint.

ExecFindPartition should have already located the correct partition
and nothing should have changed in the absence of before row insert
triggers, so it looks like we're fine to not bother re-checking.

Or did I misunderstand?



-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


remove_needless_additional_partition_check.patch
Description: Binary data

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Amit Langote

On 2018/05/10 10:02, Michael Paquier wrote:
> Something
> that I find confusing on HEAD though is that DefineIndex calls itself
> around line 1006 and cascades through each children but there is no
> context about the error.
> 
> For example if I have this partition layer:
> CREATE TABLE measurement (
>  logdate date not null,
>  peaktempint,
>  unitsales   int
> ) PARTITION BY RANGE (logdate);
> CREATE FOREIGN TABLE measurement_y2016m07
> PARTITION OF measurement FOR VALUES FROM ('2016-07-01') TO
> ('2016-08-01')
> SERVER postgres_server;
> 
> Then by creating an index on the parent I get that:
> =# create index measurement_idx on measurement (peaktemp);
> ERROR:  42809: cannot create index on foreign table "measurement_y2016m07"
> LOCATION:  DefineIndex, indexcmds.c:442
> 
> This may be confusing for the user as the DDL does not complain directly
> about the relation worked on.  Perhaps we could add an error context?
> We surely don't want multiple contexts either when working on long
> chains, but we could build a chained list of relation names in the error
> message.

How about we error out even *before* calling DefineIndex for the 1st time?
 I see that ProcessUtilitySlow() gets a list of all partitions when
locking them for index creation before calling DefineIndex.  Maybe, just
go through the list and error out if one of them is a partition that we
don't support creating an index on?

For example, with the attached patch doing the above, I get:

create table p (a int) partition by range (a);
create table p1 partition of p for values from (minvalue) to (1);
create foreign table p2 partition of p for values from (1) to (maxvalue)
server loopback options (table_name 'p2base');
create index on p (a);
ERROR:  cannot create index on partitioned table containing foreign partitions

Thanks,
Amit
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 287addf429..d49e32d83f 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -67,6 +67,7 @@
 #include "tcop/utility.h"
 #include "utils/acl.h"
 #include "utils/guc.h"
+#include "utils/lsyscache.h"
 #include "utils/syscache.h"
 #include "utils/rel.h"
 
@@ -1318,12 +1319,26 @@ ProcessUtilitySlow(ParseState *pstate,
if (stmt->relation->inh)
{
Relationrel;
+   ListCell   *lc;
 
/* already locked by 
RangeVarGetRelidExtended */
rel = heap_open(relid, NoLock);
if (rel->rd_rel->relkind == 
RELKIND_PARTITIONED_TABLE)
inheritors = 
find_all_inheritors(relid, lockmode,

 NULL);
+   foreach(lc, inheritors)
+   {
+   charrelkind = 
get_rel_relkind(lfirst_oid(lc));
+
+   if (relkind != 
RELKIND_RELATION &&
+   relkind != 
RELKIND_MATVIEW &&
+   relkind != 
RELKIND_PARTITIONED_TABLE)
+   ereport(ERROR,
+   
(errcode(ERRCODE_WRONG_OBJECT_TYPE),
+
errmsg("cannot create index on partitioned table containing foreign 
partitions")));
+   }
+
+   list_free(inheritors);
heap_close(rel, NoLock);
}
 
@@ -1353,8 +1368,6 @@ ProcessUtilitySlow(ParseState *pstate,

 parsetree);
commandCollected = true;
EventTriggerAlterTableEnd();
-
-   list_free(inheritors);
}
break;

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Amit Langote

On 2018/05/10 10:37, Michael Paquier wrote:
> On Thu, May 10, 2018 at 10:15:05AM +0900, Amit Langote wrote:
>> While I agree with this, let me point out that we do allow inherited check
>> constraints on foreign tables that are not actually enforced locally.
>>
>> create table p (a int) partition by range (a);
>> create table p1 partition of p for values from (minvalue) to (1);
>> create table p2base (a int);
>> create foreign table p2 partition of p for values from (1) to (maxvalue)
>> server loopback options (table_name 'p2base');
>>
>> alter table p add check (a between -1000 and 1000);
>>
>> -- routed to foreign partition, which doesn't enforce check constraints
>> insert into p values (1001);
>> INSERT 0 1
> 
> That's not actually a surprise, right?  Since foreign tables can be part
> of inheritance trees in 9.5, CHECK constraints on foreign tables are not
> enforced locally, but used as planner hints to guess how a query would
> work remotely.  So getting partition children to work the same way is
> consistent.
> 
>> We have to do the following to prevent that.
>>
>> alter table p2base add check (a between -1000 and 1000);
>> insert into p values (1001);
>> ERROR:  new row for relation "p2base" violates check constraint
>> "p2base_a_check"
>> DETAIL:  Failing row contains (1001).
>> CONTEXT:  remote SQL command: INSERT INTO public.p2base(a) VALUES ($1)
> 
> This bit looks natural to me as well.

Yes, I know it is working as designed and documented.  I was just trying
to comment on this bit of Robert's email:

"...because a major point of such an index is to enforce a constraint; we
can't allege that we have such a constraint if foreign tables are just
silently skipped."

So if someday we go ahead and allow indexes to be created on partitioned
tables even if there are foreign partitions, how would we choose to deal
with a unique constraint?  Will it be same as CHECK constraints such that
we don't enforce it locally but only assume it to be enforced by the
remote data source using whatever method?

But it seems I've misinterpreted what he was saying.  He doesn't seem to
be saying anything about how or whether we enforce the unique constraint
on foreign tables.  Only that if someone creates a constraint index on the
partitioned table, all partitions *including* foreign partitions, must get
a copy.

So for now, we give users an error if they try to create an index on a
partitioned table with a mix of local and foreign partitions.  Once we
figure out how to allow creating indexes (constraint-enforcing or not) on
foreign tables, we can then think of relaxing that restriction.

Thanks,
Amit

Re: Should we add GUCs to allow partition pruning to be disabled?

2018-05-09 Thread Alvaro Herrera

David Rowley wrote:
> On 10 May 2018 at 14:01, Alvaro Herrera  wrote:
> > I'm thinking something a bit more radical.  First, since partition
> > pruning is the future and constraint exclusion is soon to be a thing of
> > the past, we should describe pruning first, and then describe exclusion
> > in terms of pruning.
> 
> But... that's not true.  The chapter describes inheritance partitioned
> tables too, and we're not getting rid of constraint exclusion because
> it's needed for those.

Oh, I'm sure it is, but nobody is going to set up new inheritance
partitioned tables anymore, except people who pg_upgrade from older
releases.  (And while I haven't tried, I'm sure it's possible to migrate
from old-style to new-style partitioned tables without incurring full
table rewrites, with little downside and lots to gain.)

Now, maybe you argue that we could have a structure like this instead:

5.10.1. Overview
5.10.2. Declarative Partitioning
5.10.3. Partition Pruning
5.10.4. Implementation Using Inheritance
5.10.5. Constraint Exclusion

I wouldn't oppose that.

> However, that might not mean your patch has to be changed. I'd better
> have a look...

Thanks :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Should we add GUCs to allow partition pruning to be disabled?

2018-05-09 Thread David Rowley

On 10 May 2018 at 14:01, Alvaro Herrera  wrote:
> I'm thinking something a bit more radical.  First, since partition
> pruning is the future and constraint exclusion is soon to be a thing of
> the past, we should describe pruning first, and then describe exclusion
> in terms of pruning.

But... that's not true.  The chapter describes inheritance partitioned
tables too, and we're not getting rid of constraint exclusion because
it's needed for those. However, that might not mean your patch has to
be changed. I'd better have a look...

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Should we add GUCs to allow partition pruning to be disabled?

2018-05-09 Thread Alvaro Herrera

David Rowley wrote:
> Thanks for reviewing again.

Hi,

I'm thinking something a bit more radical.  First, since partition
pruning is the future and constraint exclusion is soon to be a thing of
the past, we should describe pruning first, and then describe exclusion
in terms of pruning.  Second, I'd put constraint exclusion as a 
inside the  that describes pruning (but keep the XML "id" the
same, so that old links continue to work.)

I took a stab at this, but ran out of time before trimming the text for
constraint exclusion.  What do you think of this rough sketch?  I'm
thinking 5.10.4 is close to its final form (wording suggestions of
course welcome), but 5.10.4.1 still needs to be trimmed heavily, to
avoid repeating what was already explained in 5.10.4 (we need only
explain how exclusion differs from pruning.)

I'm a bit undecided on where to leave the .

(Note:
   make -C doc/src/sgml html XSLTPROCFLAGS='--stringparam rootid ddl'
builds only the 'ddl' chapter, which is nice when proofreading.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 3f3f567222..2152b4d16d 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3759,7 +3759,151 @@ ANALYZE measurement;

   
 
-  
+  
+   Partition Pruning
+
+   
+partition pruning
+   
+
+   
+Partition pruning is a query optimization technique
+that improves performance for partitioned tables.  As an example:
+
+
+SET enable_partition_pruning = on;-- the default
+SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01';
+
+
+Without partition pruning, the above query would scan each of the
+the partitions of the measurement table. With
+partition pruning enabled, the planner will examine the definition of each
+partition and prove that the partition need not
+be scanned because it could not contain any rows meeting the query's
+WHERE clause.  When the planner can prove this, it
+excludes the partition from the query plan.
+   
+
+   
+You can use the EXPLAIN command to show the difference
+between a plan with enable_partition_pruning on and a 
plan
+with it off.  A typical unoptimized plan for this type of table setup is:
+
+
+SET enable_partition_pruning = off;
+EXPLAIN SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01';
+QUERY PLAN 

+───
+ Aggregate  (cost=188.76..188.77 rows=1 width=8)
+   -  Append  (cost=0.00..181.05 rows=3085 width=0)
+ -  Seq Scan on measurement_y2006m02  (cost=0.00..33.12 rows=617 
width=0)
+   Filter: (logdate = '2008-01-01'::date)
+ -  Seq Scan on measurement_y2006m03  (cost=0.00..33.12 rows=617 
width=0)
+   Filter: (logdate = '2008-01-01'::date)
+ -  Seq Scan on measurement_y2007m11  (cost=0.00..33.12 rows=617 
width=0)
+   Filter: (logdate = '2008-01-01'::date)
+ -  Seq Scan on measurement_y2007m12  (cost=0.00..33.12 rows=617 
width=0)
+   Filter: (logdate = '2008-01-01'::date)
+ -  Seq Scan on measurement_y2008m01  (cost=0.00..33.12 rows=617 
width=0)
+   Filter: (logdate = '2008-01-01'::date)
+
+
+Some or all of the partitions might use index scans instead of
+full-table sequential scans, but the point here is that there
+is no need to scan the older partitions at all to answer this query.
+When we enable partition pruning, we get a significantly
+cheaper plan that will deliver the same answer:
+
+
+SET enable_partition_pruning = on;
+EXPLAIN SELECT count(*) FROM measurement WHERE logdate = DATE '2008-01-01';
+QUERY PLAN 

+───
+ Aggregate  (cost=37.75..37.76 rows=1 width=8)
+   -  Append  (cost=0.00..36.21 rows=617 width=0)
+ -  Seq Scan on measurement_y2008m01  (cost=0.00..33.12 rows=617 
width=0)
+   Filter: (logdate = '2008-01-01'::date)
+
+   
+
+   
+Note that partition pruning is driven only by the constraints defined by
+the partition keys, not by the presence of indexes.  Therefore it isn't
+necessary to define indexes on the key columns.  Whether an index
+needs to be created for a given partition depends on whether you
+expect that queries that scan the partition will generally scan
+a large part of the partition or just a small part.  An index will
+be helpful in the latter case but not the former.
+   
+
+   
+Partition pruning
+can be performed not only during the planning of a given query, but also
+during its execution.  This is useful as it can allow more partitions to
+be pruned

Re: Postgres, fsync, and OSs (specifically linux)

2018-05-09 Thread Craig Ringer

On 10 May 2018 at 06:55, Andres Freund  wrote:

> Do you have a patchset including that?  I didn't find anything after a
> quick search...

There was an earlier rev on the other thread but without msync checks.

I've added panic for msync in the attached, and tidied the comments a bit.

I didn't add comments on why we panic to each individual pg_fsync or
FileSync caller that panics; instead pg_fsync points to
pg_fsync_no_writethrough which explains it briefly and has links.

I looked at callers of pg_fsync, pg_fsync_no_writethrough,
pg_fsync_writethrough, mdsync, and FileSync when writing this.

WAL writing already PANIC'd on fsync failure, which helps, though we
now know that's not sufficient for complete safety.

Patch on top of v11 HEAD @ ddc1f32ee507

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From a6cade9e1de68962d95374127841b0af8eb4cfe0 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 10 Apr 2018 14:08:32 +0800
Subject: [PATCH v2] PANIC when we detect a possible fsync I/O error instead of
 retrying fsync

Panic the server on fsync failure in places where we can't simply repeat the
whole operation on retry. Most imporantly, panic when fsync fails during a
checkpoint.

This will result in log messages like:

PANIC:  58030: could not fsync file "base/12367/16386": Input/output error
LOG:  0: checkpointer process (PID 10799) was terminated by signal 6: Aborted

and, if the condition persists during redo:

LOG:  0: checkpoint starting: end-of-recovery immediate
PANIC:  58030: could not fsync file "base/12367/16386": Input/output error
LOG:  0: startup process (PID 10808) was terminated by signal 6: Aborted

Why?

In a number of places PostgreSQL we responded to fsync() errors by retrying the
fsync(), expecting that this would force the operating system to repeat any
write attempts. The code assumed that fsync() would return an error on all
subsequent calls until any I/O error was resolved.

This is not what actually happens on some platforms, including Linux. The
operating system may give up and drop dirty buffers for async writes on the
floor and mark the page mapping as bad. The first fsync() clears any error flag
from the page entry and/or our file descriptor. So a subsequent fsync() returns
success, even though the data PostgreSQL wrote was really discarded.

We have no way to find out which writes failed, and no way to ask the kernel to
retry indefinitely, so all we can do is PANIC. Redo will attempt the write
again, and if it fails again, it will also PANIC.

This doesn't completely prevent fsync reliability issues, because it only
handles cases where the kernel actually reports the error to us. It's entirely
possible for a buffered write to be lost without causing fsync to report an
error at all (see discussion below). Work on addressing those issues and
documenting them is ongoing and will be committed separately.

See:

* https://www.postgresql.org/message-id/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com
* https://www.postgresql.org/message-id/20180427222842.in2e4mibx45zd...@alap3.anarazel.de
* https://lwn.net/Articles/752063/
* https://lwn.net/Articles/753650/
* https://lwn.net/Articles/752952/
* https://lwn.net/Articles/752613/
---
 src/backend/access/heap/rewriteheap.c   |  6 +++---
 src/backend/access/transam/timeline.c   |  4 ++--
 src/backend/access/transam/twophase.c   |  2 +-
 src/backend/access/transam/xlog.c   |  4 ++--
 src/backend/replication/logical/snapbuild.c |  3 +++
 src/backend/storage/file/fd.c   | 29 +++--
 src/backend/storage/smgr/md.c   | 22 --
 src/backend/utils/cache/relmapper.c |  2 +-
 8 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 8d3c861a33..0320baffec 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -965,7 +965,7 @@ logical_end_heap_rewrite(RewriteState state)
 	while ((src = (RewriteMappingFile *) hash_seq_search(_status)) != NULL)
 	{
 		if (FileSync(src->vfd, WAIT_EVENT_LOGICAL_REWRITE_SYNC) != 0)
-			ereport(ERROR,
+			ereport(PANIC,
 	(errcode_for_file_access(),
 	 errmsg("could not fsync file \"%s\": %m", src->path)));
 		FileClose(src->vfd);
@@ -1180,7 +1180,7 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
 	 */
 	pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_MAPPING_SYNC);
 	if (pg_fsync(fd) != 0)
-		ereport(ERROR,
+		ereport(PANIC,
 (errcode_for_file_access(),
  errmsg("could not fsync file \"%s\": %m", path)));
 	pgstat_report_wait_end();
@@ -1279,7 +1279,7 @@ CheckPointLogicalRewriteHeap(void)
 			 */
 			pgstat_report_wait_start(WAIT_EVENT_LOGICAL_REWRITE_CHECKPOINT_SYNC);
 			if (pg_fsync(fd) != 0)

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Michael Paquier

On Thu, May 10, 2018 at 10:15:05AM +0900, Amit Langote wrote:
> While I agree with this, let me point out that we do allow inherited check
> constraints on foreign tables that are not actually enforced locally.
> 
> create table p (a int) partition by range (a);
> create table p1 partition of p for values from (minvalue) to (1);
> create table p2base (a int);
> create foreign table p2 partition of p for values from (1) to (maxvalue)
> server loopback options (table_name 'p2base');
> 
> alter table p add check (a between -1000 and 1000);
> 
> -- routed to foreign partition, which doesn't enforce check constraints
> insert into p values (1001);
> INSERT 0 1

That's not actually a surprise, right?  Since foreign tables can be part
of inheritance trees in 9.5, CHECK constraints on foreign tables are not
enforced locally, but used as planner hints to guess how a query would
work remotely.  So getting partition children to work the same way is
consistent.

> We have to do the following to prevent that.
> 
> alter table p2base add check (a between -1000 and 1000);
> insert into p values (1001);
> ERROR:  new row for relation "p2base" violates check constraint
> "p2base_a_check"
> DETAIL:  Failing row contains (1001).
> CONTEXT:  remote SQL command: INSERT INTO public.p2base(a) VALUES ($1)

This bit looks natural to me as well.
--
Michael


signature.asc
Description: PGP signature

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Amit Langote

On 2018/05/09 23:57, Robert Haas wrote:
> For right now, I think the options are (1) throw an ERROR if we
> encounter a foreign table or (2) silently skip the foreign table.  I
> think (2) is defensible for non-UNIQUE indexes, because the index is
> just a performance optimization.

Along with others, my +1 to option 1.

> However, for UNIQUE indexes, at
> least, it seems like we'd better do (1), because a major point of such
> an index is to enforce a constraint; we can't allege that we have such
> a constraint if foreign tables are just silently skipped.

While I agree with this, let me point out that we do allow inherited check
constraints on foreign tables that are not actually enforced locally.

create table p (a int) partition by range (a);
create table p1 partition of p for values from (minvalue) to (1);
create table p2base (a int);
create foreign table p2 partition of p for values from (1) to (maxvalue)
server loopback options (table_name 'p2base');

alter table p add check (a between -1000 and 1000);

-- routed to foreign partition, which doesn't enforce check constraints
insert into p values (1001);
INSERT 0 1

-- whereas, local partition does
insert into p values (-1001);
ERROR:  new row for relation "p1" violates check constraint "p_a_check"
DETAIL:  Failing row contains (-1001).

We have to do the following to prevent that.

alter table p2base add check (a between -1000 and 1000);
insert into p values (1001);
ERROR:  new row for relation "p2base" violates check constraint
"p2base_a_check"
DETAIL:  Failing row contains (1001).
CONTEXT:  remote SQL command: INSERT INTO public.p2base(a) VALUES ($1)

Thanks,
Amit

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Michael Paquier

On Wed, May 09, 2018 at 11:10:43AM -0400, Tom Lane wrote:
> Agreed about unique indexes.  I suggest that we throw an error for both
> cases, because (1) having unique and non-unique indexes behave differently
> for this purpose seems pretty weird; (2) throwing an error now preserves
> our options to do something different later.  Given where we are in the
> release cycle, we should be taking the most conservative path we can.

Heartfully agreed to throw an error for all cases for now.  Something
that I find confusing on HEAD though is that DefineIndex calls itself
around line 1006 and cascades through each children but there is no
context about the error.

For example if I have this partition layer:
CREATE TABLE measurement (
 logdate date not null,
 peaktempint,
 unitsales   int
) PARTITION BY RANGE (logdate);
CREATE FOREIGN TABLE measurement_y2016m07
PARTITION OF measurement FOR VALUES FROM ('2016-07-01') TO
('2016-08-01')
SERVER postgres_server;

Then by creating an index on the parent I get that:
=# create index measurement_idx on measurement (peaktemp);
ERROR:  42809: cannot create index on foreign table "measurement_y2016m07"
LOCATION:  DefineIndex, indexcmds.c:442

This may be confusing for the user as the DDL does not complain directly
about the relation worked on.  Perhaps we could add an error context?
We surely don't want multiple contexts either when working on long
chains, but we could build a chained list of relation names in the error
message.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] path toward faster partition pruning

2018-05-09 Thread Amit Langote

On 2018/05/09 22:43, Alvaro Herrera wrote:
> Amit Langote wrote:
>> On 2018/05/09 11:31, David Rowley wrote:
>>> On 9 May 2018 at 14:29, Amit Langote  wrote:
 On 2018/05/09 11:20, Michael Paquier wrote:
> While looking at this code, is there any reason to not make
> gen_partprune_steps static?  This is only used in partprune.c for now,
> so the intention is to make it available for future patches?
> 
>> Here is a patch that does that.
> 
> Pushed, thanks.

Thank you.

Regards,
Amit

Re: Postgres, fsync, and OSs (specifically linux)

2018-05-09 Thread Andres Freund

On 2018-05-01 09:38:03 +0800, Craig Ringer wrote:
> On 1 May 2018 at 00:09, Andres Freund  wrote:
> 
> > It's not. Only SYNC_FILE_RANGE_WAIT_{BEFORE,AFTER} eat errors. Which
> > seems sensible, because they could be considered data integrity
> > operations.
> 
> Ah, I misread that. Thankyou.
> 
> >> I'm very suspicious about the safety of the msync() path too.
> >
> > That seems justified however:
> 
> I'll add EIO tests there.

Do you have a patchset including that?  I didn't find anything after a
quick search...

Greetings,

Andres Freund

Re: [HACKERS] path toward faster partition pruning

2018-05-09 Thread Michael Paquier

On Wed, May 09, 2018 at 10:39:07AM -0300, Alvaro Herrera wrote:
> Michael Paquier wrote:
>> This removes a useless default clause in partprune.c and it got
>> forgotten in the crowd.  Just attaching it again here, and it can just
>> be applied on top of the rest.
> 
> Done, thanks for insisting.

Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: [SPAM] Re: Local partitioned indexes and pageinspect

2018-05-09 Thread Peter Geoghegan

On Wed, May 9, 2018 at 2:04 PM, Michael Paquier  wrote:
> On Wed, May 09, 2018 at 02:28:50PM -0300, Alvaro Herrera wrote:
>> I pushed some fixes produced here.  Attached is the remainder of the
>> patch you submitted.  I notice now that we haven't actually fixed
>> Peter's source of complaint, though.  AFAICS your patch just adds test
>> cases, and upthread discussion apparently converges on not doing
>> anything about the code.  I'm not yet sure what to think of that ...
>
> Thanks Álvaro.  I tend to think that there is little point in changing
> the error handling, still it would be good to get test coverage.  That's
> not really a bug though as in all those cases we don't get errors like
> "could not open file" or such.  So we could also let things as they are
> now.

Now that the relkind issue is documented, I wouldn't mind just leaving it as-is.

-- 
Peter Geoghegan

Re: [SPAM] Re: Local partitioned indexes and pageinspect

2018-05-09 Thread Michael Paquier

On Wed, May 09, 2018 at 02:28:50PM -0300, Alvaro Herrera wrote:
> I pushed some fixes produced here.  Attached is the remainder of the
> patch you submitted.  I notice now that we haven't actually fixed
> Peter's source of complaint, though.  AFAICS your patch just adds test
> cases, and upthread discussion apparently converges on not doing
> anything about the code.  I'm not yet sure what to think of that ...

Thanks Álvaro.  I tend to think that there is little point in changing
the error handling, still it would be good to get test coverage.  That's
not really a bug though as in all those cases we don't get errors like
"could not open file" or such.  So we could also let things as they are
now.
--
Michael


signature.asc
Description: PGP signature

Re: parallel.sgml for Gather with InitPlans

2018-05-09 Thread Robert Haas

On Tue, May 8, 2018 at 11:21 PM, Amit Kapila  wrote:
> On Tue, May 8, 2018 at 5:27 PM, Robert Haas  wrote:
>> On Mon, May 7, 2018 at 11:34 PM, Amit Kapila  wrote:
>>> I think we can cover InitPlan and Subplans that can be parallelized in
>>> a separate section "Parallel Subplans" or some other heading.  I think
>>> as of now we have enabled parallel subplans and initplans in a
>>> limited, but useful cases (as per TPC-H benchmark) and it might be
>>> good to cover them in a separate section.  I can come up with an
>>> initial patch (or I can review it if you write the patch) if you and
>>> or others think that makes sense.
>>
>> We could go that way, but what I wrote is short and -- I think -- accurate.
>>
>
> Okay, again thinking about it after your explanation, it appears
> correct to me, but it was not apparent on the first read.   I think
> other alternatives could be (a) Evaluation of initplan OR (b)
> Execution of initplan.  I think it makes sense to add what you have
> written or one of the alternatives suggested by me as you deem most
> appropriate.  I think one can always write a detailed explanation as a
> separate patch.

I've committed what I suggested before.  If you want to propose
another patch, feel free, but I'm not sure how much energy this is
worth.  The more detailed we make the documentation the more we have
to update the next time something changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Clock with Adaptive Replacement

2018-05-09 Thread Peter Geoghegan

On Thu, May 3, 2018 at 5:16 PM, Peter Geoghegan  wrote:
> On Thu, May 3, 2018 at 12:37 PM, Robert Haas  wrote:
>> On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov
>>  wrote:
>>> Sample output can be seen here:
>>> https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio
>>
>> Neat.  Not sure what generated this trace, but note this part:

> I don't have time to check this out just now, but it seems like an
> excellent idea. It would be nice if it could be enhanced further, so
> you get some idea of what the blocks are without having to decode them
> yourself using tools like pageinspect.

Although I'm a Linux user, I added --enable-dtrace to my standard
release build configure alias several months back. I wanted to be able
to use the user space probes it adds with BCC [1], which is a really
nice tool (the dtrace/system tap probes seem to be totally generic).
If you have a fairly recent kernel, many really nice tricks are
possible with only modest effort. I've been meaning to polish up a
selection of one-liners I came up with several months back, to make
them available for others.

Here is a one-liner that traces buffer reads within the backend with PID 6617:

pg@bat:~/notes/setup$ sudo /usr/share/bcc/tools/trace -p 6617 --time
'u:/code/postgresql/root/install/bin/postgres:postgresql:buffer__read__start
"fork: %d, blockNum: %u, spcNode: %u, dbNode: %u, relNode: %u,
backend: %d, isExtend: %d" arg1, arg2, arg3, arg4, arg5, arg6, arg7'

If I run the query "select * from pgbench_accounts where aid = 555"
within that backend, I can see the following:

TIME PIDTIDCOMM FUNC -
12:17:28 6617   6617   postgres buffer__read__start fork: 0,
blockNum: 290, spcNode: 1663, dbNode: 13100, relNode: 16404, backend:
-1, isExtend: 0
12:17:28 6617   6617   postgres buffer__read__start fork: 0,
blockNum: 3, spcNode: 1663, dbNode: 13100, relNode: 16404, backend:
-1, isExtend: 0
12:17:28 6617   6617   postgres buffer__read__start fork: 0,
blockNum: 2, spcNode: 1663, dbNode: 13100, relNode: 16404, backend:
-1, isExtend: 0
12:17:28 6617   6617   postgres buffer__read__start fork: 0,
blockNum: 9, spcNode: 1663, dbNode: 13100, relNode: 16396, backend:
-1, isExtend: 0

This is pretty easy to interpret: 290 is the root page, 3 in another
internal page, and 2 in the leaf page. Finally, block 9 is from the
heap relation.

I haven't actually instrumented the amount of time each read takes
here, so it's not quite as good as your dtrace script. I'm sure that
that would be trivial, since BCC really does seem to make
customization very easy. I just don't want to spend more than 10
minutes on this today.

[1] https://github.com/iovisor/bcc
-- 
Peter Geoghegan

Re: [HACKERS] Parallel Append implementation

2018-05-09 Thread Robert Haas

On Tue, May 8, 2018 at 5:05 PM, Thomas Munro
 wrote:
> +scanning them more than once would preduce duplicate results.  Plans that
>
> s/preduce/produce/

Fixed, thanks.

> +Append or MergeAppend plan node.
> vs.
> +Append of regular Index Scan plans; each
>
> I think we should standardise on Foo Bar,
> FooBar or foo bar when
> discussing executor nodes on this page.

Well, EXPLAIN prints MergeAppend but Index Scan, and I think we should
follow that precedent here.

As for  vs. , I think the reason I ended up using
 in the section on scans was because I thought that
Parallel Seq Scan might be confusing (what's a
"seq"?), so I tried to fudge my way around that by referring to it as
an abstract idea rather than the exact EXPLAIN output.  You then
copied that style in the join section, and, well, like you say, now we
have a sort of hodgepodge of styles.  Maybe that's a problem for
another patch, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

parallel-append-doc-v2.patch
Description: Binary data

Re: [HACKERS] Clock with Adaptive Replacement

2018-05-09 Thread Robert Haas

On Wed, May 9, 2018 at 1:37 PM, Merlin Moncure  wrote:
> On Wed, May 9, 2018 at 11:00 AM Robert Haas  wrote:
>> Independently of that, it would be probably also useful to avoid
>> bumping the reference count multiple times when a buffer is accessed
>> by the same backend several times in quick succession.  Perhaps this
>> could even be as simple as maintaining a list of the last two buffer
>> IDs for which we bumped the usage count; if we see one of them again,
>> don't bump the usage count again.
>
> Hm.  Is the objective here to optimize access patterns or to reduce atomic
> operations (or both)?   All else being equal, an algorithm that delivers
> the similar eviction results with less cache synchronization ought to be
> preferred...are the various algorithms scored in that way?

The CAR paper explains steps they've taken to minimize atomic
operations, so it is something that people do think about.

The main reason for wanting to avoid extra usage count bumps is to
avoid distorting the usage counts.  Any savings we get from reducing
atomic ops is a bonus.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Clock with Adaptive Replacement

2018-05-09 Thread Merlin Moncure

On Wed, May 9, 2018 at 11:00 AM Robert Haas  wrote:

> Independently of that, it would be probably also useful to avoid
> bumping the reference count multiple times when a buffer is accessed
> by the same backend several times in quick succession.  Perhaps this
> could even be as simple as maintaining a list of the last two buffer
> IDs for which we bumped the usage count; if we see one of them again,
> don't bump the usage count again.

Hm.  Is the objective here to optimize access patterns or to reduce atomic
operations (or both)?   All else being equal, an algorithm that delivers
the similar eviction results with less cache synchronization ought to be
preferred...are the various algorithms scored in that way?

merlin

Re: [HACKERS] Clock with Adaptive Replacement

2018-05-09 Thread Robert Haas

On Tue, May 8, 2018 at 10:22 AM, Юрий Соколов  wrote:
> "just prior to the point" means we have to have machinery for information
> expiration without eviction. For example, same "clock hand" should walk
> over all buffers continiously, and decrement "usage count", but without
> eviction performed. Right?

Right.

> As alternative, some approximation of Working Set algorithm could be used:
> - on every access "access time" should be written to item,
> - items accessed before some "expiration interval" are considered for
> expiration.
> This way, there is also fresh information about usage even without any
> expiration performed yet.

That's basically the same kind of thing.  Ideally we want to have one
mechanism that operates all the time, rather than one that works when
no eviction is occurring and a completely separate one that operates
when eviction is occurring.

Anyway, I think we really ought to investigate CAR.  CAR could be made
to piggyback on our existing GLOCK scheme.  See section III.B of the
the CAR paper, page 4, especially the second paragraph in the second
column.  What would change is that instead of sweeping through all
buffers, there would be two sets of buffers T1 and T2, each with a
separate clock hand.  Also, we'd need to store the history lists B1
and B2 someplace.

Independently of that, it would be probably also useful to avoid
bumping the reference count multiple times when a buffer is accessed
by the same backend several times in quick succession.  Perhaps this
could even be as simple as maintaining a list of the last two buffer
IDs for which we bumped the usage count; if we see one of them again,
don't bump the usage count again.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Cutting initdb's runtime (Perl question embedded)

2018-05-09 Thread Tom Lane

Alvaro Herrera  writes:
> Tom Lane wrote:
>> Robert Haas  writes:
>>> Who says we need a portable way?  If we had something that worked on
>>> Linux and macOS, it would cover most developer environments.  I wonder
>>> if readlink("/etc/localtime", buf, sz) might be a viable approach.

>> I wondered about that, but I'm afraid it's often a hardlink not a
>> symlink.  Still, we could try it.

> In Debian systems, it's a symlink.  Apparently in RHEL6 and older it's a
> copy or hardlink, and the file /etc/sysconfig/clock contains a ZONE
> variable that points to the right zone.

Yeah, on my RHEL6 box,

$ ls -l /etc/localtime 
-rw-r--r--. 1 root root 3519 May  4  2010 /etc/localtime

The mod date's a little astonishing, considering this system was built
in 2013.  It is *not* a hardlink, or even an exact match, to anything
under /usr/share/zoneinfo, though perhaps it was originally.  Also:

$ cat /etc/sysconfig/clock
# The time zone of the system is defined by the contents of /etc/localtime.
# This file is only for evaluation by system-config-date, do not rely on its
# contents elsewhere.
ZONE="America/New York"

I'm inclined to respect the comment, especially since I see they are not
spelling the zone name canonically anyway (note space not underscore);
so looking at this wouldn't work without some ill-defined heuristics.

However, more modern Red Hat systems seem to have /etc/localtime as
a symlink, so checking it would work there, and also macOS seems to
do it that way for as far back as I can check.

> This comment is insightful:
> https://github.com/HowardHinnant/date/issues/269#issuecomment-353792132
> It's talking about this code:
> https://github.com/HowardHinnant/date/blob/master/src/tz.cpp#L3652

Interesting.  Not sure if we want to try all those files.  But I'll
take a look at whipping up something that checks /etc/localtime.

regards, tom lane

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Arseny Sher

Simon Riggs  writes:

> Indexes on foreign tables cause an ERROR, so yes, we already just
> don't create them.
>
> You're suggesting silently skipping the ERROR. I can't see a reason for that.

Truly, I was inaccurate. I mean that index propagation is a nice feature,
and making it available for mix of foreign and local partitions skipping
the foreign ones is useful thing for users who understand the
consequences (of course there should be a warning in the docs -- my
patch was just an illustration of the idea). As I said, FDW-based
sharding often involves mix of foreign and local partitions, even in
simple manual forms. Imagine that you have a table which is too big to
fit into one server, so you partition it and move some partiitons to
another machine; you still would like to access the whole table to avoid
manual tuple routing. Or, you partition table by timestamps and keep
recent data on fast primary server, gradually moving old partitions to
another box. Again, it would be nice to access table as a whole to
e.g. calculate some analytics.

Anyway, given other opinions, I assume that the community has chosen
more conservative approach. Indeed, we can't guarantee uniqueness this
way; that's fully users responsiblity.

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Cutting initdb's runtime (Perl question embedded)

2018-05-09 Thread Robert Haas

On Wed, May 9, 2018 at 11:39 AM, Alvaro Herrera
 wrote:
> In Debian systems, it's a symlink.  Apparently in RHEL6 and older it's a
> copy or hardlink, and the file /etc/sysconfig/clock contains a ZONE
> variable that points to the right zone.  Maybe if we add enough
> platform-dependent hacks, we would use the slow fallback only for rare
> cases.  (Maybe have initdb emit a warning when the fallback is used, so
> that we know what else to look for.)

I just checked a couple of RHEL7 systems and it seems to be a symlink
there.  It's also a symlink on my laptop (macOS 10.13.3).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Robert Haas

On Wed, May 9, 2018 at 11:12 AM, Simon Riggs  wrote:
> If we can assume an index exists on a foreign table, why can we not
> just assume a unique index exists?? Why the difference?

We can't assume either of those things, and I didn't say that we should.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Robert Haas

On Wed, May 9, 2018 at 11:20 AM, Simon Riggs  wrote:
> On 9 May 2018 at 16:15, Robert Haas  wrote:
>> On Wed, May 9, 2018 at 11:14 AM, Simon Riggs  wrote:
>>> On 9 May 2018 at 16:10, Tom Lane  wrote:
 Robert Haas  writes:
> On Wed, May 9, 2018 at 9:08 AM, Simon Riggs  wrote:
>> Shouldn't the fix be to allow creation of indexes on foreign tables?
>> (Maybe they would be virtual or foreign indexes??)

> It might be useful to invent the concept of a foreign index, but not
> for v11 a month after feature freeze.

 Yeah.  That's a can of worms we can *not* open at this stage.
>>>
>>> Lucky nobody suggested that then, eh? Robert's just making a joke.
>>
>> Someone did suggest that.  It was you.
>
> Oh, you weren't joking. I think we're having serious problems with
> people putting words in my mouth again then.
>
> Please show me where I suggested doing anything for v11?

Come on, Simon.  It's in the quoted text.  I realize you didn't say
v11 specifically, but this is the context of a patch that is proposed
a bug-fix for v11.  If you meant that we should apply the patch as
proposed now, or some other one, and do the other thing later, you
could have said so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Simon Riggs

On 9 May 2018 at 16:15, Robert Haas  wrote:
> On Wed, May 9, 2018 at 11:14 AM, Simon Riggs  wrote:
>> On 9 May 2018 at 16:10, Tom Lane  wrote:
>>> Robert Haas  writes:
 On Wed, May 9, 2018 at 9:08 AM, Simon Riggs  wrote:
> Shouldn't the fix be to allow creation of indexes on foreign tables?
> (Maybe they would be virtual or foreign indexes??)
>>>
 It might be useful to invent the concept of a foreign index, but not
 for v11 a month after feature freeze.
>>>
>>> Yeah.  That's a can of worms we can *not* open at this stage.
>>
>> Lucky nobody suggested that then, eh? Robert's just making a joke.
>
> Someone did suggest that.  It was you.

Oh, you weren't joking. I think we're having serious problems with
people putting words in my mouth again then.

Please show me where I suggested doing anything for v11?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Cutting initdb's runtime (Perl question embedded)

2018-05-09 Thread Tom Lane

Robert Haas  writes:
> On Tue, May 8, 2018 at 4:48 PM, Tom Lane  wrote:
>> Really the only thing here that jumps out as being unduly expensive for
>> what it's doing is select_default_timezone.  That is, and always has been,
>> a brute-force algorithm; I wonder if there's a way to do better?

> Who says we need a portable way?  If we had something that worked on
> Linux and macOS, it would cover most developer environments.  I wonder
> if readlink("/etc/localtime", buf, sz) might be a viable approach.

I wondered about that, but I'm afraid it's often a hardlink not a
symlink.  Still, we could try it.

> Also, how about having a --timezone option for initdb?

We already have that, it's called the TZ environment variable.

There was an effort awhile back to try to speed up the buildfarm
by having that get set automatically, but it failed for reasons
I don't recall ATM.

regards, tom lane

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Robert Haas

On Wed, May 9, 2018 at 11:14 AM, Simon Riggs  wrote:
> On 9 May 2018 at 16:10, Tom Lane  wrote:
>> Robert Haas  writes:
>>> On Wed, May 9, 2018 at 9:08 AM, Simon Riggs  wrote:
 Shouldn't the fix be to allow creation of indexes on foreign tables?
 (Maybe they would be virtual or foreign indexes??)
>>
>>> It might be useful to invent the concept of a foreign index, but not
>>> for v11 a month after feature freeze.
>>
>> Yeah.  That's a can of worms we can *not* open at this stage.
>
> Lucky nobody suggested that then, eh? Robert's just making a joke.

Someone did suggest that.  It was you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Simon Riggs

On 9 May 2018 at 16:10, Tom Lane  wrote:
> Robert Haas  writes:
>> On Wed, May 9, 2018 at 9:08 AM, Simon Riggs  wrote:
>>> Shouldn't the fix be to allow creation of indexes on foreign tables?
>>> (Maybe they would be virtual or foreign indexes??)
>
>> It might be useful to invent the concept of a foreign index, but not
>> for v11 a month after feature freeze.
>
> Yeah.  That's a can of worms we can *not* open at this stage.

Lucky nobody suggested that then, eh? Robert's just making a joke.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Simon Riggs

On 9 May 2018 at 15:57, Robert Haas  wrote:

> For right now, I think the options are (1) throw an ERROR if we
> encounter a foreign table or (2) silently skip the foreign table.  I
> think (2) is defensible for non-UNIQUE indexes, because the index is
> just a performance optimization. However, for UNIQUE indexes, at
> least, it seems like we'd better do (1), because a major point of such
> an index is to enforce a constraint; we can't allege that we have such
> a constraint if foreign tables are just silently skipped.

If we can assume an index exists on a foreign table, why can we not
just assume a unique index exists?? Why the difference?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Tom Lane

Robert Haas  writes:
> On Wed, May 9, 2018 at 9:08 AM, Simon Riggs  wrote:
>> Shouldn't the fix be to allow creation of indexes on foreign tables?
>> (Maybe they would be virtual or foreign indexes??)

> It might be useful to invent the concept of a foreign index, but not
> for v11 a month after feature freeze.

Yeah.  That's a can of worms we can *not* open at this stage.

> For right now, I think the options are (1) throw an ERROR if we
> encounter a foreign table or (2) silently skip the foreign table.  I
> think (2) is defensible for non-UNIQUE indexes, because the index is
> just a performance optimization.  However, for UNIQUE indexes, at
> least, it seems like we'd better do (1), because a major point of such
> an index is to enforce a constraint; we can't allege that we have such
> a constraint if foreign tables are just silently skipped.

Agreed about unique indexes.  I suggest that we throw an error for both
cases, because (1) having unique and non-unique indexes behave differently
for this purpose seems pretty weird; (2) throwing an error now preserves
our options to do something different later.  Given where we are in the
release cycle, we should be taking the most conservative path we can.

regards, tom lane

Re: [HACKERS] Cutting initdb's runtime (Perl question embedded)

2018-05-09 Thread Robert Haas

On Tue, May 8, 2018 at 4:48 PM, Tom Lane  wrote:
> Really the only thing here that jumps out as being unduly expensive for
> what it's doing is select_default_timezone.  That is, and always has been,
> a brute-force algorithm; I wonder if there's a way to do better?  We can
> probably guess that every non-Windows platform is using the IANA timezone
> data these days.  If there were some way to extract the name of the active
> timezone setting directly, we wouldn't have to try to reverse-engineer it.
> But I don't know of any portable way :-(

Who says we need a portable way?  If we had something that worked on
Linux and macOS, it would cover most developer environments.  I wonder
if readlink("/etc/localtime", buf, sz) might be a viable approach.
You could, at least, try any time zone you can derive that way first,
before you try any others.

Also, how about having a --timezone option for initdb?  Then you could
determine the time zone just once and pass it down to subsequent
initdb invocations.  Or just hardcode the regression tests to use some
fixed time zone, like Antarctica/Troll.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Robert Haas

On Wed, May 9, 2018 at 9:08 AM, Simon Riggs  wrote:
> How much sense is it to have a partitioned table with a mix of local
> and foreign tables?

Fair question, but we put some effort into making it work, so I think
it should keep working.

> Shouldn't the fix be to allow creation of indexes on foreign tables?
> (Maybe they would be virtual or foreign indexes??)

It might be useful to invent the concept of a foreign index, but not
for v11 a month after feature freeze.

For right now, I think the options are (1) throw an ERROR if we
encounter a foreign table or (2) silently skip the foreign table.  I
think (2) is defensible for non-UNIQUE indexes, because the index is
just a performance optimization.  However, for UNIQUE indexes, at
least, it seems like we'd better do (1), because a major point of such
an index is to enforce a constraint; we can't allege that we have such
a constraint if foreign tables are just silently skipped.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Global snapshots

2018-05-09 Thread Robert Haas

On Tue, May 8, 2018 at 4:51 PM, Stas Kelvich  wrote:
>> On 7 May 2018, at 20:04, Robert Haas  wrote:
>> But what happens if a transaction starts on node A at time T0 but
>> first touches node B at a much later time T1, such that T1 - T0 >
>> global_snapshot_defer_time?
>
> Such transaction will get "global snapshot too old" error.

Ouch.  That's not so bad at READ COMMITTED, but at higher isolation
levels failure becomes extremely likely.  Any multi-statement
transaction that lasts longer than global_snapshot_defer_time is
pretty much doomed.

> In principle such behaviour can be avoided by calculating oldest
> global csn among all cluster nodes and oldest xmin on particular
> node will be held only when there is some open old transaction on
> other node. It's easy to do from global snapshot point of view,
> but it's not obvious how to integrate that into postgres_fdw. Probably
> that will require bi-derectional connection between postgres_fdw nodes
> (also distributed deadlock detection will be easy with such connection).

I don't think holding back xmin is a very good strategy.  Maybe it
won't be so bad if and when we get zheap, since only the undo log will
bloat rather than the table.  But as it stands, holding back xmin
means everything bloats and you have to CLUSTER or VACUUM FULL the
table in order to fix it.

If the behavior were really analogous to our existing "snapshot too
old" feature, it wouldn't be so bad.  Old snapshots continue to work
without error so long as they only read unmodified data, and only
error out if they hit modified pages.  SERIALIZABLE works according to
a similar principle: it worries about data that is written by one
transaction and read by another, but if there's a portion of the data
that is only read and not written, or at least not written by any
transactions that were active around the same time, then it's fine.
While the details aren't really clear to me, I'm inclined to think
that any solution we adopt for global snapshots ought to leverage this
same principle in some way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Simon Riggs

On 9 May 2018 at 15:26, Arseny Sher  wrote:
>
> Simon Riggs  writes:
>
>> How much sense is it to have a partitioned table with a mix of local
>> and foreign tables?
>
> Well, as much sense as fdw-based sharding has, for instance. It is
> arguable, but it exists.
>
>> Shouldn't the fix be to allow creation of indexes on foreign tables?
>> (Maybe they would be virtual or foreign indexes??)
>
> Similar ideas were discussed at [1]. There was no wide consensus of even
> what problems such feature would solve. Since currently indexes on
> foreign tables are just forbidden, it seems to me that the best what
> partitioning code can do today is just not creating them.
>
> [1] 
> https://www.postgresql.org/message-id/flat/4F62FD69.2060007%40lab.ntt.co.jp#4f62fd69.2060...@lab.ntt.co.jp

Indexes on foreign tables cause an ERROR, so yes, we already just
don't create them.

You're suggesting silently skipping the ERROR. I can't see a reason for that.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

2018-05-09 Thread Alvaro Herrera

Marina Polyakova wrote:
> Hello everyone in this thread!

> I got a similar server crash as in [1] on the master branch since the commit
> 9fdb675fc5d2de825414e05939727de8b120ae81 when the assertion fails because
> the second argument ScalarArrayOpExpr is not a Const or an ArrayExpr, but is
> an ArrayCoerceExpr (see [2]):

Hello Marina, thanks for reporting this.  I have pushed all fixes
derived from this report -- thanks to Amit and Michaël for those.
I verified your test case no longer crashes.  If you have more elaborate
test cases, please do try these too.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Arseny Sher

Simon Riggs  writes:

> How much sense is it to have a partitioned table with a mix of local
> and foreign tables?

Well, as much sense as fdw-based sharding has, for instance. It is
arguable, but it exists.

> Shouldn't the fix be to allow creation of indexes on foreign tables?
> (Maybe they would be virtual or foreign indexes??)

Similar ideas were discussed at [1]. There was no wide consensus of even
what problems such feature would solve. Since currently indexes on
foreign tables are just forbidden, it seems to me that the best what
partitioning code can do today is just not creating them.

[1] 
https://www.postgresql.org/message-id/flat/4F62FD69.2060007%40lab.ntt.co.jp#4f62fd69.2060...@lab.ntt.co.jp

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] path toward faster partition pruning

2018-05-09 Thread Alvaro Herrera

Amit Langote wrote:
> On 2018/05/09 11:31, David Rowley wrote:
> > On 9 May 2018 at 14:29, Amit Langote  wrote:
> >> On 2018/05/09 11:20, Michael Paquier wrote:
> >>> While looking at this code, is there any reason to not make
> >>> gen_partprune_steps static?  This is only used in partprune.c for now,
> >>> so the intention is to make it available for future patches?

> Here is a patch that does that.

Pushed, thanks.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] path toward faster partition pruning

2018-05-09 Thread Alvaro Herrera

Michael Paquier wrote:

> Alvaro, could it be possible to consider as well the patch I posted
> here?
> https://www.postgresql.org/message-id/20180424012042.gd1...@paquier.xyz
> 
> This removes a useless default clause in partprune.c and it got
> forgotten in the crowd.  Just attaching it again here, and it can just
> be applied on top of the rest.

Done, thanks for insisting.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Simon Riggs

On 9 May 2018 at 12:50, Arseny Sher  wrote:
> Hi,
>
> 8b08f7d4 added propagation of indexes on partitioned tables to
> partitions, which is very cool. However, index creation also recurses
> down to foreign tables. I doubt this is intentional, as such indexes are
> forbidden as not making much sense; attempt to create index on
> partitioned table with foreign partition leads to an error
> now. Attached lines fix this.

"Fix"?

How much sense is it to have a partitioned table with a mix of local
and foreign tables?

Shouldn't the fix be to allow creation of indexes on foreign tables?
(Maybe they would be virtual or foreign indexes??)

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Ashutosh Bapat

On Wed, May 9, 2018 at 5:20 PM, Arseny Sher  wrote:
> Hi,
>
> 8b08f7d4 added propagation of indexes on partitioned tables to
> partitions, which is very cool. However, index creation also recurses
> down to foreign tables. I doubt this is intentional, as such indexes are
> forbidden as not making much sense; attempt to create index on
> partitioned table with foreign partition leads to an error
> now. Attached lines fix this.
>

The patch looks good, but I guess that we have to do some tricks with
the validity of index on the partitioned table since not all
partitions have a given index when one of those is a foreign
partition.
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-05-09 Thread Etsuro Fujita

(2018/05/08 16:20), Ashutosh Bapat wrote:

On Tue, May 1, 2018 at 5:00 PM, Etsuro Fujita
  wrote:

Here are my review comments on patch
0003-postgres_fdw-child-join-with-ConvertRowtypeExprs-cau.patch:

* This assertion in deparseConvertRowtypeExpr wouldn't always be true
because of cases like ALTER FOREIGN TABLE DROP COLUMN plus ALTER FOREIGN
TABLE ADD COLUMN:

+   Assert(parent_desc->natts == child_desc->natts);

I think we should remove that assertion.

Removed.

* But I don't think that change is enough.  Here is another oddity with that
assertion removed.

To fix this, I think we should skip the deparseColumnRef processing for
dropped columns in the parent table.

Done.

Thanks for those changes!

I have changed the test file a bit to test these scenarios.

+1

* I think it would be better to do EXPLAIN with the VERBOSE option to the
queries this patch added to the regression tests, to see if
ConvertRowtypeExprs are correctly deparsed in the remote queries.

The reason VERBOSE option was omitted for partition-wise join was to
avoid repeated and excessive output for every child-join. I think that
still applies.

I'd like to leave that for the committer's judge.

PFA patch-set with the fixes.

Thanks for updating the patch!

I also noticed that the new function deparseConvertRowtypeExpr is not
quite following the naming convention of the other deparseFoo
functions. Foo is usually the type of the node the parser would
produced when the SQL string produced by that function is parsed. That
doesn't hold for the SQL string produced by ConvertRowtypeExpr but
then naming it as generic deparseRowExpr() wouldn't be correct either.
And then there are functions like deparseColumnRef which may produce a
variety of SQL strings which get parsed into different node types e.g.
a whole-row reference would produce ROW(...) which gets parsed as a
RowExpr. Please let me know if you have any suggestions for the name.

To be honest, I don't have any strong opinion about that.  But I like 
"deparseConvertRowtypeExpr" because that name seems to well represent 
the functionality, so I'd vote for that.

BTW, one thing I'm a bit concerned about is this:

(2018/04/25 18:51), Ashutosh Bapat wrote:
> Actually I noticed that ConvertRowtypeExpr are used to cast a child's
> whole row reference expression (not just a Var node) into that of its
> parent and not. For example a cast like NULL::child::parent produces a
> ConvertRowtypeExpr encapsulating a NULL constant node and not a Var
> node. We are interested only in ConvertRowtypeExprs embedding Var
> nodes with Var::varattno = 0. I have changed this code to use function
> is_converted_whole_row_reference() instead of the above code with
> Assert. In order to use that function, I had to take it out of
> setrefs.c and put it in nodeFuncs.c which seemed to be a better fit.

This change seems a bit confusing to me because the flag bits 
"PVC_INCLUDE_CONVERTROWTYPES" and "PVC_RECURSE_CONVERTROWTYPES" passed 
to pull_var_clause look as if it handles any ConvertRowtypeExpr nodes 
from a given clause, but really, it only handles ConvertRowtypeExprs you 
mentioned above.  To make that function easy to understand and use, I 
think it'd be better to use the IsA(node, ConvertRowtypeExpr) test as in 
the first version of the patch, instead of 
is_converted_whole_row_reference, which would be more consistent with 
other cases such as PlaceHolderVar.

Best regards,
Etsuro Fujita

Indexes on partitioned tables and foreign partitions

2018-05-09 Thread Arseny Sher

Hi,

8b08f7d4 added propagation of indexes on partitioned tables to
partitions, which is very cool. However, index creation also recurses
down to foreign tables. I doubt this is intentional, as such indexes are
forbidden as not making much sense; attempt to create index on
partitioned table with foreign partition leads to an error
now. Attached lines fix this.
*** a/src/backend/commands/indexcmds.c
--- b/src/backend/commands/indexcmds.c
*** DefineIndex(Oid relationId,
*** 915,920 
--- 915,926 
  int			maplen;
  
  childrel = heap_open(childRelid, lockmode);
+ /* Foreign table doesn't need indexes */
+ if (childrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+ {
+ 	heap_close(childrel, NoLock);
+ 	continue;
+ }
  childidxs = RelationGetIndexList(childrel);
  attmap =
  	convert_tuples_by_name_map(RelationGetDescr(childrel),
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
*** AttachPartitionEnsureIndexes(Relation re
*** 14352,14357 
--- 14352,14361 
  	MemoryContext cxt;
  	MemoryContext oldcxt;
  
+ 	/* Foreign table doesn't need indexes */
+ 	if (attachrel->rd_rel->relkind == RELKIND_FOREIGN_TABLE)
+ 		return;
+ 
  	cxt = AllocSetContextCreate(CurrentMemoryContext,
  "AttachPartitionEnsureIndexes",
  ALLOCSET_DEFAULT_SIZES);

--
Arseny Sher
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Setting libpq TCP keepalive parameters from environment

2018-05-09 Thread Oleksandr Shulgin

On Wed, May 9, 2018 at 1:58 AM, Craig Ringer  wrote:

> >
> > It would be much more convenient to just set the environment variable
> when
> > running the script and let it affect the whole process and its children.
> >
> > Would a patch be welcome?
>
> I can't really comment on that part, but:
>
> PGOPTIONS="-c tcp_keepalives_count=5 -c tcp_keepalives_interval=60"
> psql 'host=localhost'
>
> should enable server-side keepalives. Unsure how much that helps you
> if you need client-side keepalives too.
>

Good point, in our specific case it appears to work as well if it's the
server who sends the keepalives.

Thank you,
--
Alex

Re: Should we add GUCs to allow partition pruning to be disabled?

2018-05-09 Thread Amit Langote

On 2018/05/09 13:14, Amit Langote wrote:
> Hi David.
> 
> Thanks for addressing my comments.
> 
> On 2018/05/07 15:00, David Rowley wrote:
>> v2 patch is attached.
> 
> Looks good to me.

Sorry, I should've seen noticed v3 before sending my email.

v3 looks good too, but when going through it, I noticed one bit in 5.10.4.
Partitioning and Constraint Exclusion:

 A good rule of thumb is that partitioning constraints should
  contain only comparisons of the partitioning column(s) to constants
  using B-tree-indexable operators, which applies even to partitioned
  tables, because only B-tree-indexable column(s) are allowed in the
  partition key.

I think the part after ", which applies even to partitioned tables,.."
should be removed.

Attached find the updated patch.

Thanks,
Amit
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ffea744cb8..76606a8535 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3841,6 +3841,11 @@ ANY num_sync (  for more information
+on partition pruning and partitioning.
+   
   
  
 
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 004ecacbbf..d02edd771f 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3760,7 +3760,7 @@ ANALYZE measurement;
   
 
   
-   Partitioning and Constraint Exclusion
+   Inheritance Partitioning and Constraint Exclusion
 

 constraint exclusion
@@ -3768,9 +3768,8 @@ ANALYZE measurement;
 

 Constraint exclusion is a query optimization 
technique
-that improves performance for partitioned tables defined in the
-fashion described above (both declaratively partitioned tables and those
-implemented using inheritance).  As an example:
+that improves performance for inheritance partitioned tables defined in the
+fashion described above.  As an example:
 
 
 SET constraint_exclusion = on;
@@ -3847,15 +3846,14 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate 
= DATE '2008-01-01';
  is actually neither
 on nor off, but an intermediate 
setting
 called partition, which causes the technique to be
-applied only to queries that are likely to be working on partitioned
+applied only to queries that are likely to be working on inheritance 
partitioned
 tables.  The on setting causes the planner to examine
 CHECK constraints in all queries, even simple ones that
 are unlikely to benefit.

 

-The following caveats apply to constraint exclusion, which is used by
-both inheritance and partitioned tables:
+The following caveats apply to constraint exclusion:
 

 
@@ -3877,11 +3875,7 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate 
= DATE '2008-01-01';
   range tests for range partitioning, as illustrated in the preceding
   examples.  A good rule of thumb is that partitioning constraints should
   contain only comparisons of the partitioning column(s) to constants
-  using B-tree-indexable operators, which applies even to partitioned
-  tables, because only B-tree-indexable column(s) are allowed in the
-  partition key.  (This is not a problem when using declarative
-  partitioning, since the automatically generated constraints are simple
-  enough to be understood by the planner.)
+  using B-tree-indexable operators.
  
 
 
@@ -3898,6 +3892,94 @@ EXPLAIN SELECT count(*) FROM measurement WHERE logdate 
= DATE '2008-01-01';


   
+
+  
+   Declarative Partitioning and Partition Pruning
+
+   
+partition pruning
+   
+
+   
+Partition pruning is a query optimization technique
+similar to constraint exclusion, but applies only to declaratively
+partitioned tables.  Like constraint exclusion, this uses (but is not
+limited to using) the query's WHERE clause to exclude
+partitions which cannot possibly contain any matching records.
+   
+
+   
+Partition pruning is much more efficient than constraint exclusion, since
+it avoids scanning each partition's metadata to determine if the partition
+is required for a particular query.
+   
+
+   
+Partition pruning is also more powerful than constraint exclusion as it
+can be performed not only during the planning of a given query, but also
+during its execution.  This is useful as it can allow more partitions to
+be pruned when clauses contain expressions whose values are unknown to the
+query planner.  For example, parameters defined in a
+PREPARE statement, using a value obtained from a
+subquery or using a parameterized value on the inner side of a nested loop
+join.
+   
+
+   
+Partition pruning during execution can be performed at any of the
+following times:
+
+
+ 
+  
+   During initialization of the query plan.  Partition pruning can be
+   performed here for parameter values which are known during the
+   initialization phase of execution.

Re: [HACKERS] path toward faster partition pruning

2018-05-09 Thread Michael Paquier

On Wed, May 09, 2018 at 02:01:26PM +0900, Amit Langote wrote:
> On 2018/05/09 11:31, David Rowley wrote:
>> On 9 May 2018 at 14:29, Amit Langote  wrote:
>>> On 2018/05/09 11:20, Michael Paquier wrote:
 While looking at this code, is there any reason to not make
 gen_partprune_steps static?  This is only used in partprune.c for now,
 so the intention is to make it available for future patches?
>>>
>>> Yeah, making it static might be a good idea.  I had made it externally
>>> visible, because I was under the impression that the runtime pruning
>>> related code would want to call it from elsewhere within the planner.
>>> But, instead it introduced a make_partition_pruneinfo() which in turn
>>> calls get_partprune_steps.
>> 
>> Yeah. Likely left over from when run-time pruning was generating the
>> steps during execution rather than during planning.
> 
> Here is a patch that does that.

Thanks, Amit.

Alvaro, could it be possible to consider as well the patch I posted
here?
https://www.postgresql.org/message-id/20180424012042.gd1...@paquier.xyz

This removes a useless default clause in partprune.c and it got
forgotten in the crowd.  Just attaching it again here, and it can just
be applied on top of the rest.
--
Michael
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index f8844ef2eb..cbbb4c1827 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -2950,10 +2950,6 @@ perform_pruning_combine_step(PartitionPruneContext *context,
 }
 			}
 			break;
-
-		default:
-			elog(ERROR, "invalid pruning combine op: %d",
- (int) cstep->combineOp);
 	}
 
 	return result;


signature.asc
Description: PGP signature

53 matches

Mail list logo