Re: why partition pruning doesn't work?

2018-07-18 Thread Amit Langote
On 2018/07/16 2:02, Tom Lane wrote:
> Amit Langote  writes:
>> On 2018/06/19 2:05, Tom Lane wrote:
>>> Or maybe what we should do is drop ExecLockNonLeafAppendTables/
>>> ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it.
> 
>> Hmm, for InitPlan to do what ExecOpenAppendPartitionedTables does, we'd
>> have to have all the partitioned tables (contained in partitioned_rels
>> fields of all Append/MergeAppend/ModifyTable nodes of a plan) also listed
>> in a global list like rootResultRelations and nonleafResultRelations of
>> PlannedStmt.
> 
>> Attached updated patch.
> 
> I've been looking at this patch, and while it's not unreasonable on its
> own terms, I'm growing increasingly distressed at the ad-hoc and rather
> duplicative nature of the data structures that have gotten stuck into
> plan trees thanks to partitioning (the rootResultRelations and
> nonleafResultRelations lists being prime examples).
> 
> It struck me this morning that a whole lot of the complication here is
> solely due to needing to identify the right type of relation lock to take
> during executor startup, and that THAT WORK IS TOTALLY USELESS.

I agree.  IIUC, that's also the reason why PlannedStmt had to grow a
resultRelations field in the first place.

> In every
> case, we must already hold a suitable lock before we ever get to the
> executor; either one acquired during the parse/plan pipeline, or one
> re-acquired by AcquireExecutorLocks in the case of a cached plan.
> Otherwise it's entirely possible that the plan has been invalidated by
> concurrent DDL --- and it's not the executor's job to detect that and
> re-plan; that *must* have been done upstream.
> 
> Moreover, it's important from a deadlock-avoidance standpoint that these
> locks get acquired in the same order as they were acquired during the
> initial parse/plan pipeline.  I think it's reasonable to assume they were
> acquired in RTE order in that stage, so AcquireExecutorLocks is OK.  But,
> if the current logic in the executor gets them in that order, it's both
> non-obvious that it does so and horribly fragile if it does, seeing that
> the responsibility for this is split across InitPlan,
> ExecOpenScanRelation, and ExecLockNonLeafAppendTables.
> 
> So I'm thinking that what we really ought to do here is simplify executor
> startup to just open all rels with NoLock, and get rid of any supporting
> data structures that turn out to have no other use.  (David Rowley's
> nearby patch to create a properly hierarchical executor data structure for
> partitioning info is likely to tie into this too, by removing some other
> vestigial uses of those lists.)

I agree it would be nice to be able to get rid of those lists.

> I think that this idea has been discussed in the past, and we felt at
> the time that having the executor take its own locks was a good safety
> measure, and a relatively cheap one since the lock manager is pretty
> good at short-circuiting duplicative lock requests.  But those are
> certainly not free.  Moreover, I'm not sure that this is really a
> safety measure at all: if the executor were really taking any lock
> not already held, it'd be masking a DDL hazard.
> 
> To investigate this further, I made the attached not-meant-for-commit
> hack to verify whether InitPlan and related executor startup functions
> were actually taking any not-previously-held locks.  I could only find
> one such case: the parser always opens tables selected FOR UPDATE with
> RowShareLock, but if we end up implementing the resulting row mark
> with ROW_MARK_COPY, the executor is acquiring just AccessShareLock
> (because ExecOpenScanRelation thinks it needs to acquire some lock).
> The patch as presented hacks ExecOpenScanRelation to avoid that, and
> it passes check-world.
> 
> What we'd be better off doing, if we go this route, is to install an
> assertion-build-only test that verifies during relation_open(NoLock)
> that some kind of lock is already held on the rel.  That would protect
> not only the executor, but a boatload of existing places that open
> rels with NoLock on the currently-unverified assumption that a lock is
> already held.

+1

> I'm also rather strongly tempted to add a field to RangeTblEntry
> that records the appropriate lock strength to take, so that we don't
> have to rely on keeping AcquireExecutorLocks' logic to decide on the
> lock type in sync with whatever the parse/plan pipeline does.  (One
> could then imagine adding assertions in the executor that this field
> shows a lock strength of at least X, in place of actually opening
> the rel with X.)
> 
> BTW, there'd be a lot to be said for having InitPlan just open all
> the rels and build an array of Relation pointer

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-07-18 Thread Amit Langote
Hi David,

Thanks for taking a look.

On 2018/07/15 17:34, David Rowley wrote:
> I've looked over the code and the ExecUseUpdateResultRelForRouting()
> function is broken.  Your while loop only skips partitions for the
> current partitioned table, it does not skip ModifyTable subnodes that
> belong to other partitioned tables.
> 
> You can use the following. The code does not find the t1_a2 subnode.
> 
> create table t1 (a int, b int) partition by list(a);
> create table t1_a1 partition of t1 for values in(1) partition by list(b);
> create table t1_a2 partition of t1 for values in(2);
> create table t1_a1_b1 partition of t1_a1 for values in(1);
> create table t1_a1_b2 partition of t1_a1 for values in(2);
> insert into t1 values(2,2);
> 
> update t1 set a = a;

Hmm, it indeed is broken.

> I think there might not be enough information to make this work
> correctly, as if you change the loop to skip subnodes, then it won't
> work in cases where the partition[0] was pruned.
> 
> I've another patch sitting here, partly done, that changes
> pg_class.relispartition into pg_class.relpartitionparent.  If we had
> that then we could code your loop to work correctly.> Alternatively,
> I guess we could just ignore the UPDATE's ResultRelInfos and just
> build new ones. Unsure if there's actually a reason we need to reuse
> the existing ones, is there?

We try to use the existing ones because we thought back when the patch was
written (not by me though) that redoing all the work that
InitResultRelInfo does for each partition, for which we could have instead
used an existing one, would cumulatively end up being more expensive than
figuring out which ones we could reuse by a linear scan of partition and
result rels arrays in parallel.  I don't remember seeing a benchmark to
demonstrate the benefit of doing this though.  Maybe it was posted, but I
don't remember having looked at it closely.

> I think you'd need to know the owning partition and skip subnodes that
> don't belong to pd->reldesc. Alternatively, a hashtable could be built
> with all the oids belonging to pd->reldesc, then we could loop over
> the update_rris finding subnodes that can be found in the hashtable.
> Likely this will be much slower than the sort of merge lookup that the
> previous code did.

I think one option is to simply give up on the idea of matching *all*
UPDATE result rels that belong to a given partitioned table (pd->reldesc)
in one call of ExecUseUpdateResultRelForRouting.  Instead, pass the index
of the partition (in pd->partdesc->oids) to find the ResultRelInfo for,
loop over all UPDATE result rels looking for one, and return immediately
on finding one after having stored its pointer in proute->partitions.  In
the worst case, we'll end up scanning UPDATE result rels array for every
partition that gets touched, but maybe such an UPDATE query is less common
or even if such a query occurs, tuple routing might be the last of its
bottlenecks.

I have implemented that approach in the updated patch.

That means I also needed to change things so that
ExecUseUpdateResultRelsForRouting is now only called by ExecFindPartition,
because with the new arrangement, it's useless to call it from
ExecSetupPartitionTupleRouting.  Moreover, an UPDATE may not use tuple
routing at all, even if the fact that partition key is being updated
results in calling ExecSetupPartitionTupleRouting.

> Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code.
> The code seems to assume that there can only be at the most 65536
> partitions, but I don't think there's any code which restricts us to
> that. There is code in the planner that will bork when trying to
> create a RangeTblEntry up that high, but as far as I know that won't
> be noticed on the INSERT path.  I don't think this code has any
> business knowing what the special varnos are set to either.  It would
> be better to just remove the limit and suffer the small wasted array
> space.  I understand you've probably coded it like this due to the
> similar code that was in my patch, but with mine I knew the total
> number of partitions. Your patch does not.

OK, I changed it to UINT_MAX.

> Other thoughts on the patch:
> 
> I wonder if it's worth having syscache keep a count on the number of
> sub-partitioned tables a partition has. If there are none in the root
> partition then the partition_dispatch_info can be initialized with
> just 1 element to store the root details. Although, maybe it's not
> worth it to reduce the array size by 7 elements.

Hmm yes.  Allocating space for 8 pointers when we really need 1 is not too
bad, if the alternative is to modify partcache.c.

> Also, I'm a bit confused why you change the comments in
> execPartition.h for PartitionTupleRouting to be inline again. I
> brought those out of line as I thought the complexity of the code
> warranted that. You're inlining them again goes against what all the
> other structs do in that file.

It was out-of-line to begin with but 

Re: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Amit Langote
On 2018/07/13 22:10, Ashutosh Bapat wrote:
> On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho  wrote:
>>> I wondered if you compared to PG10 or to inheritence-partitioning (parent 
>>> with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into 
>>> child) ?
>>
>> Thank you for your reply.
>>
>> I compared to PG11beta2 with non-partitioned table.
>>
>> Non-partitioned table has 1100 records in one table.
>> Partitioned table has one record on each leaf partitions.
>>
> 
> I don't think partitioning should be employed this way even for the
> sake of comparison. Depending upon the size of each tuple, 1100 tuples
> are inserted into a single table, they will probably occupy few
> hundred pages. In a partitioned table with one tuple per partition
> they will occupy 1100 pages at least. There is other space, locking
> overheads to maintain 1100 tables. I think the right way to compare is
> to have really large that which really requires 1100 partitions and
> then compare performance by putting that data in 1100 partitions and
> in an unpartitioned table. Even with that kind of data, you will see
> some difference in performance, but that won't be as dramatic as you
> report.
> 
> I might be missing something though.

Perhaps, Kato-san only intended to report that the time that planner
spends for a partitioned table with 1100 partitions is just too high
compared to the time it spends on a non-partitioned table.  It is and has
been clear that that's because the planning time explodes as the number of
partitions increases.

If there's lots of data in it, then the result will look completely
different as you say, because scanning a single partition (of the 1100
total) will spend less time than scanning a non-partitioned table
containing 1100 partitions worth of data.  But the planning time would
still be more for the partitioned table, which seems to be the point of
this benchmark.

Thanks,
Amit




Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-16 Thread Amit Langote
On 2018/07/17 8:17, Alvaro Herrera wrote:
> On 2018-Jul-16, Ashutosh Bapat wrote:
> 
>>> Hmm, let me reword this comment completely.  How about the attached?
> 
>> That looks much better. However it took me a small while to understand
>> that (1), (2) and (3) correspond to strategies.
> 
> You're right.  Amended again and pushed.  I also marked the open item
> closed.
> 
> Thanks everyone

Thank you Ashutosh for further review and Alvaro for improving it a quite
a bit before committing.

Thanks,
Amit





Re: pgsql: Allow UNIQUE indexes on partitioned tables

2018-07-16 Thread Amit Langote
On 2018/07/17 13:57, Alvaro Herrera wrote:
> On 2018-Feb-19, David G. Johnston wrote:
> 
>> As an aside, adding a link to "Data Definiton/Table Partitioning" from at
>> least CREATE TABLE ... PARTITION BY; and swapping "PARTITION BY" and
>> "PARTITION OF" in the Parameters section of that page - one must partition
>> by a table before one can partition it (and also the synopsis lists them in
>> the BY before OF order), would be helpful.
> 
> A little late, I have done these two changes.

A belated +1.

Thanks,
Amit




Re: cached plans and enable_partition_pruning

2018-07-23 Thread Amit Langote
On Mon, Jul 23, 2018 at 11:20 PM, Andres Freund  wrote:
> Hi,
>
> On 2018-07-23 18:31:43 +0900, Amit Langote wrote:
>> It seems that because enable_partition_pruning's value is only checked
>> during planning, turning it off *after* a plan is created and cached does
>> not work as expected.

[ ... ]

>> Should we check its value during execution too, as done in the attached?
>
> I think it's correct to check the plan time value, rather than the
> execution time value. Other enable_* GUCs also take effect there, and I
> don't see a problem with that?

Ah, so that may have been intentional.  Although, I wonder if
enable_partition_pruning could be made to work differently than other
enable_* settings, because we *can* perform pruning which is an
optimization function even during execution, whereas we cannot modify
the plan in other cases?

Thanks,
Amit



Re: [report] memory leaks in COPY FROM on partitioned table

2018-07-24 Thread Amit Langote
Hi Kaigai-san,

Thanks for the report and the patch.

On 2018/07/24 11:43, Kohei KaiGai wrote:
> Further investigation I did:
> 
> CopyFrom() calls ExecFindPartition() to identify the destination child
> table of partitioned table.
> Then, it internally calls get_partition_for_tuple() to get partition
> index according to the key value.
> This invocation is not under the per-tuple context.
> 
> In case of hash-partitioning, get_partition_for_tuple() calls
> hash-function of key data type; which is hash_numeric in my case.
> The hash_numeric fetches the key value using PG_GETARG_NUMERIC(0). It
> internally calls pg_detoast_datum() which may allocate new memory if
> varlena datum is not uncompressed long (32bit) format.
> 
> Once this patch attached, PostgreSQL backend process has been working
> with about 130MB memory consumption for 20min right now (not finished
> yet...)
> Before the patch applied, its memory consumption grows up about
> 10BM/sec, then terminated a few hours later.
> 
> P.S,
> I think do_convert_tuple() in ExecFindPartition() and
> ConvertPartitionTupleSlot() may also allocate memory out of the
> per-tuple context, however, I could not confirmed yet, because my test
> case has TupleConversionMap == NULL.

Your patch takes care of allocation happening inside
get_partition_for_tuple, but as you mention there might be others in its
caller ExecFindPartition.  So, I think we should switch to the per-tuple
context in ExecFindPartition.

When I tried to do that, I discovered that we have to be careful about
releasing some of the memory that's allocated in ExecFindPartition
ourselves instead of relying on the reset of per-tuple context to take
care of it.  That's because some of the structures that ExecFindPartition
assigns the allocated memory to are cleaned up at the end of the query, by
when it's too late to try to release per-tuple memory.  So, the patch I
ended up with is slightly bigger than simply adding a
MemoryContextSwitchTo() call at the beginning of ExecFindPartition.
Please find it attached.

Thanks,
Amit
From 8c42a8dcab7e2f835eff5c001d9be2829301429d Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 24 Jul 2018 15:11:25 +0900
Subject: [PATCH v1] Use per-tuple memory in ExecFindPartition

---
 src/backend/executor/execPartition.c | 38 
 1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 7a4665cc4e..f5d9b1755a 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -196,6 +196,15 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
PartitionDispatch parent;
ExprContext *ecxt = GetPerTupleExprContext(estate);
TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
+   TupleTableSlot *myslot = NULL;
+   MemoryContext   oldcontext;
+   HeapTuple   tuple = ExecFetchSlotTuple(slot);
+
+   /*
+* Anything below that needs to allocate memory should use per-tuple
+* memory.
+*/
+   oldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
 
/*
 * First check the root table's partition constraint, if any.  No point 
in
@@ -209,7 +218,6 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
while (true)
{
PartitionDesc partdesc;
-   TupleTableSlot *myslot = parent->tupslot;
TupleConversionMap *map = parent->tupmap;
int cur_index = -1;
 
@@ -220,11 +228,9 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
 * Convert the tuple to this parent's layout so that we can do 
certain
 * things we do below.
 */
+   myslot = parent->tupslot;
if (myslot != NULL && map != NULL)
{
-   HeapTuple   tuple = ExecFetchSlotTuple(slot);
-
-   ExecClearTuple(myslot);
tuple = do_convert_tuple(tuple, map);
ExecStoreTuple(tuple, myslot, InvalidBuffer, true);
slot = myslot;
@@ -269,9 +275,32 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, 
PartitionDispatch *pd,
break;
}
else
+   {
parent = pd[-parent->indexes[cur_index]];
+
+   /*
+* If we used the dedicated slot, must call 
ExecClearTuple now
+* to release the tuple contained in it and set its
+* tts_shouldFree to false so that nobody attempts to 
release it
+* later.  We have to do that because the tuple uses 
per-tuple
+* memory.
+*/
+   if (slot == 

Re: code of partition split

2018-07-22 Thread Amit Langote
Hi.

On 2018/07/23 12:08, Chenxi Li wrote:
> Hi hackers,
> 
> I'm interested in partition split and want to figure out how it works.
> Could someone tell me where the code is located? Thanks very much.

Can you tell what you mean by "partition split"?  I don't know of an
existing feature / command that can be used to split a partition.

Thanks,
Amit




cached plans and enable_partition_pruning

2018-07-23 Thread Amit Langote
It seems that because enable_partition_pruning's value is only checked
during planning, turning it off *after* a plan is created and cached does
not work as expected.

create table p (a int) partition by list (a);
create table p1 partition of p for values in (1);
create table p1 partition of p for values in (2);

-- force a generic plan so that run-time pruning is used in the plan
reset enable_partition_pruning;
set plan_cache_mode to force_generic_plan;
prepare p as select * from p where a = $1;

explain (costs off, analyze) execute p (1);
   QUERY PLAN

 Append (actual time=0.079..0.106 rows=1 loops=1)
   Subplans Removed: 1
   ->  Seq Scan on p2 (actual time=0.058..0.068 rows=1 loops=1)
 Filter: (a = $1)
 Planning Time: 17.573 ms
 Execution Time: 0.396 ms
(6 rows)

set enable_partition_pruning to off;

explain (costs off, analyze) execute p (1);
   QUERY PLAN

 Append (actual time=0.108..0.135 rows=1 loops=1)
   Subplans Removed: 1
   ->  Seq Scan on p2 (actual time=0.017..0.028 rows=1 loops=1)
 Filter: (a = $1)
 Planning Time: 0.042 ms
 Execution Time: 0.399 ms
(6 rows)

Pruning still occurs, whereas one would expect it not to, because the plan
(the Append node) contains run-time pruning information, which was
initialized because enable_partition_pruning was turned on when the plan
was created.

Should we check its value during execution too, as done in the attached?

Thanks,
Amit
diff --git a/src/backend/executor/nodeAppend.c 
b/src/backend/executor/nodeAppend.c
index 1fc55e90d7..d67abed330 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -61,6 +61,7 @@
 #include "executor/execPartition.h"
 #include "executor/nodeAppend.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 
 /* Shared state for parallel-aware Append. */
 struct ParallelAppendState
@@ -128,8 +129,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
/* Let choose_next_subplan_* function handle setting the first subplan 
*/
appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
 
-   /* If run-time partition pruning is enabled, then set that up now */
-   if (node->part_prune_info != NULL)
+   /*
+* If run-time partition pruning is enabled, then set that up now.
+* However, enable_partition_pruning may have been turned off since when
+* the plan was created, so recheck its value.
+*/
+   if (node->part_prune_info != NULL && enable_partition_pruning)
{
PartitionPruneState *prunestate;
 
diff --git a/src/backend/executor/nodeMergeAppend.c 
b/src/backend/executor/nodeMergeAppend.c
index c7eb18e546..5252930b17 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -43,6 +43,7 @@
 #include "executor/nodeMergeAppend.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 
 /*
  * We have one slot for each item in the heap array.  We use SlotNumber
@@ -89,8 +90,12 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int 
eflags)
mergestate->ps.ExecProcNode = ExecMergeAppend;
mergestate->ms_noopscan = false;
 
-   /* If run-time partition pruning is enabled, then set that up now */
-   if (node->part_prune_info != NULL)
+   /*
+* If run-time partition pruning is enabled, then set that up now.
+* However, enable_partition_pruning may have been turned off since when
+* the plan was created, so recheck its value.
+*/
+   if (node->part_prune_info != NULL && enable_partition_pruning)
{
PartitionPruneState *prunestate;
 


Re: partition tree inspection functions

2018-07-19 Thread Amit Langote
Thanks for the review, Jesper.

On 2018/07/18 23:35, Jesper Pedersen wrote:
> On 06/28/2018 01:49 AM, Amit Langote wrote:
>> OK, I've added an example below the table of functions added by the patch.
>>
>> Attached updated patch.
>>
> 
> You forgot to remove the test output in create_table.out, so check-world
> is failing.

Oops, I'd noticed that but forgotten to post an updated patch.

Fixed.

> In pg_partition_parent
> 
> +    else
> +    /* Not a partition, return NULL. */
> +    PG_RETURN_NULL();
> 
> I would just remove the "else" such that PG_RETURN_NULL() is fall-through.

OK, done.

> I think pg_partition_tree_tables should have an option to exclude the
> table that is being queried from the result (bool include_self).

Doesn't sound too bad, so added include_self.

> Maybe a function like pg_partition_number_of_partitions() could be of
> benefit to count the number of actual partitions in a tree. Especially
> useful in complex scenarios,
> 
>  select pg_partition_number_of_partitions('p') as number;
> 
>    number
>  -
>   4
>  (1 row)

Okay, adding one more function at this point may not be asking for too
much.  Although, select count(*) from pg_partition_tree_tables('p') would
give you the count, a special function seems nice.

> New status: WfA

Attached updated patch.

Thanks,
Amit
>From 0e24829ac0bc261c79124d8fcd594f9d86655b71 Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 16 Jan 2018 19:02:13 +0900
Subject: [PATCH v4] Add assorted partition reporting functions

---
 doc/src/sgml/func.sgml   |  56 ++
 src/backend/catalog/partition.c  | 155 ++-
 src/backend/utils/cache/lsyscache.c  |  22 
 src/include/catalog/partition.h  |   1 +
 src/include/catalog/pg_proc.dat  |  24 +
 src/include/utils/lsyscache.h|   1 +
 src/test/regress/expected/partition_info.out |  88 +++
 src/test/regress/parallel_schedule   |   2 +-
 src/test/regress/serial_schedule |   1 +
 src/test/regress/sql/partition_info.sql  |  40 +++
 10 files changed, 385 insertions(+), 5 deletions(-)
 create mode 100644 src/test/regress/expected/partition_info.out
 create mode 100644 src/test/regress/sql/partition_info.sql

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index edc9be92a6..33119f2265 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19995,6 +19995,62 @@ postgres=# SELECT * FROM 
pg_walfile_name_offset(pg_stop_backup());
 The function returns the number of new collation objects it created.

 
+   
+Partitioning Information Functions
+
+ 
+  Name Return Type 
Description
+ 
+
+  
+  
+   
pg_partition_parent(regclass)
+   regclass
+   get parent if table is a partition, NULL 
otherwise
+  
+  
+   
pg_partition_root_parent(regclass)
+   regclass
+   get top-most parent of a partition within partition tree
+  
+  
+   
pg_partition_tree_tables(regclass,
 bool)
+   setof regclass
+   get all tables in partition tree with given table as top-most 
parent
+  
+  
+   
pg_partition_tree_number_of_tables(regclass)
+   integer
+   get number of tables in partition tree with given table as 
top-most parent
+  
+ 
+
+   
+
+   
+If the table passed to pg_partition_root_parent is not
+a partition, the same table is returned as the result.  Result of
+pg_partition_tree_tables also contains the table
+that's passed to it as the first row, unless explicitly requested not
+to be included by passing false for the second
+argument. 
+   
+
+   
+For example, to check the total size of the data contained in
+measurement table described in
+, use the following
+query:
+   
+
+
+select pg_size_pretty(sum(pg_relation_size(p))) as total_size from 
pg_partition_tree_tables('measurement', true) p;
+ total_size 
+
+ 24 kB
+(1 row)
+
+
   
 
   
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 558022647c..20635ce513 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -23,13 +23,16 @@
 #include "catalog/partition.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_partitioned_table.h"
+#include "funcapi.h"
 #include "nodes/makefuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/prep.h"
 #include "optimizer/var.h"
 #include "partitioning/partbounds.h"
 #include "rewrite/rewriteManip.h"
+#include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/lsyscache.h"
 #include "utils/partcache.h"
 #include "utils/rel.h"
 #include &

Re: partition tree inspection functions

2018-07-19 Thread Amit Langote
Hi Dilip,

Sorry it took me a while to reply.

On 2018/06/29 14:30, Dilip Kumar wrote:
> On Tue, Jun 26, 2018 at 10:38 AM, Amit Langote wrote:
>> As discussed a little while back [1] and also recently mentioned [2], here
>> is a patch that adds a set of functions to inspect the details of a
>> partition tree.  There are three functions:
>>
>> pg_partition_parent(regclass) returns regclass
>> pg_partition_root_parent(regclass) returns regclass
>> pg_partition_tree_tables(regclass) returns setof regclass
>>
>>
>> selectp as relname,
>>   pg_partition_parent(p) as parent,
>>   pg_partition_root_parent(p) as root_parent
>> from  pg_partition_tree_tables('p') p;
>>  relname | parent | root_parent
>> -++-
>>  p   || p
>>  p0  | p  | p
>>  p1  | p  | p
>>  p00 | p0 | p
>>  p01 | p0 | p
>>  p10 | p1 | p
>>  p11 | p1 | p
>> (7 rows)
>>
> 
> Is it a good idea to provide a function or an option which can provide
> partitions detail in hierarchical order?
> 
> i.e
> relname level
> p 0
> p0   1
> p00 2
> p01 2
> p1   1

Yeah, might be a good idea.  We could have a function
pg_partition_tree_level(OID) which will return the level of the table
that's passed to it the way you wrote above, meaning 0 for the root
parent, 1 for the root's immediate partitions, 2 for their partitions, and
so on.

Thanks,
Amit




Re: documentation about explicit locking

2018-07-18 Thread Amit Langote
On 2018/07/18 18:30, Peter Eisentraut wrote:
> On 06.07.18 04:00, Amit Langote wrote:
>> On 2018/07/05 23:02, Robert Haas wrote:
>>> On Wed, Jul 4, 2018 at 3:09 AM, Amit Langote
>>>  wrote:
>>>> I wonder why we mention on the following page that CREATE COLLATION
>>>> requires SHARE ROW EXCLUSIVE lock
>>>>
>>>> https://www.postgresql.org/docs/devel/static/explicit-locking.html
>>>>
>>>> I know that's the lock taken on the pg_collation catalog, but do we need
>>>> to mention locks taken by a DDL command on the catalogs it affects?  All
>>>> other commands mentioned on the page require to specify the table name
>>>> that the lock will be taken on.
>>>
>>> Yes, that looks odd.
>>
>> OK, here is a patch.
>>
>> I see that it was one of Peter E's commits that added that, so cc'd him.
> 
> The reason this is mentioned is that CREATE COLLATION takes a SHARE ROW
> EXCLUSIVE lock on pg_collation whereas similar CREATE commands only take
> a ROW EXCLUSIVE lock on their catalogs.  (So you can only have one
> CREATE COLLATION running at a time.  The reasons for this are explained
> in pg_collation.c.)  I think mentioning this was requested during patch
> review.

I see.  Although, which lock we take on the system catalog for
implementing a particular command seems to be an internal detail.  What's
clearly user-visible in this case is that CREATE COLLATION command cannot
be used simultaneously by concurrent sessions, so it should be pointed out
in the CREATE COLLATION command's documentation.  On a quick check, it
doesn't seem to be.  So, I have updated my patch to also add a line about
that on CREATE COLLATION page.  What do you think?

When playing with this, I observed that a less user-friendly error message
is emitted if multiple sessions race to create the same collation.

Session 1:
begin;
create collation collname (...);

Session 2:
create collation collname (...);


Session 1:
commit;

Session 2:
ERROR:  duplicate key value violates unique constraint
"pg_collation_name_enc_nsp_index"
DETAIL:  Key (collname, collencoding, collnamespace)=(collname, 6, 2200)
already exists.

I figured that's because the order in CollationCreate of locking the
catalog and checking in syscache whether a duplicate exists.  I think we
should check the syscache for duplicate *after* we have locked the
catalog, as done in the other patch that's attached.

Thanks,
Amit
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 24613e3c75..4ec853b77a 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -970,8 +970,7 @@ ERROR:  could not serialize access due to read/write 
dependencies among transact
 
 
 
- Acquired by CREATE COLLATION,
- CREATE TRIGGER, and many forms of
+ Acquired by CREATE TRIGGER, and many forms of
  ALTER TABLE (see ).
 

diff --git a/doc/src/sgml/ref/create_collation.sgml 
b/doc/src/sgml/ref/create_collation.sgml
index 5bc9af5499..84ff418e6d 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -163,6 +163,11 @@ CREATE COLLATION [ IF NOT EXISTS ] 
name FROM Notes
 
   
+   Note that only one of the concurrent sessions can run
+   CREATE COLLATION at a time.
+  
+
+  
Use DROP COLLATION to remove user-defined collations.
   
 
diff --git a/src/backend/catalog/pg_collation.c 
b/src/backend/catalog/pg_collation.c
index ce7e5fb5cc..33a5510bf9 100644
--- a/src/backend/catalog/pg_collation.c
+++ b/src/backend/catalog/pg_collation.c
@@ -70,6 +70,9 @@ CollationCreate(const char *collname, Oid collnamespace,
AssertArg(collcollate);
AssertArg(collctype);
 
+   /* open pg_collation; see below about the lock level */
+   rel = heap_open(CollationRelationId, ShareRowExclusiveLock);
+
/*
 * Make sure there is no existing collation of same name & encoding.
 *
@@ -83,9 +86,13 @@ CollationCreate(const char *collname, Oid collnamespace,
  
ObjectIdGetDatum(collnamespace)))
{
if (quiet)
+   {
+   heap_close(rel, NoLock);
return InvalidOid;
+   }
else if (if_not_exists)
{
+   heap_close(rel, NoLock);
ereport(NOTICE,
(errcode(ERRCODE_DUPLICATE_OBJECT),
 collencoding == -1
@@ -105,9 +112,6 @@ CollationCreate(const char *collname, Oid collnamespace,
  collname, 
pg_encoding_to_char(collencoding;
}
 
-   /* open pg_collation; see below about the lock level */
-   rel = heap_open(CollationRelationId, ShareRowExclusiveLock);
-
/*
 * Also forbid a specific-encoding collation shadowing an any-encoding
 * collation, or an any-encoding collation being shadowed (see


Re: Internal error XX000 with enable_partition_pruning=on, pg 11 beta1 on Debian

2018-07-25 Thread Amit Langote
On 2018/07/21 0:17, David Rowley wrote:
> On 20 July 2018 at 21:44, Amit Langote  wrote:
>> But I don't think the result of make_partition_pruneinfo itself has to be
>> List of PartitionedRelPruneInfo nested under PartitionPruneInfo.  I gather
>> that each PartitionPruneInfo corresponds to each root partitioned table
>> and a PartitionedRelPruneInfo contains the actual pruning information,
>> which is created for every partitioned table (non-leaf tables), including
>> the root tables.  I don't think such nesting is necessary.  I think that
>> just like flattened partitioned_rels list, we should put flattened list of
>> PartitionedRelPruneInfo into the Append or MergeAppend plan.  No need for
>> nesting PartitionedRelPruneInfo under PartitionPruneInfo.
> 
> To do that properly you'd need to mark the target partitioned table of
> each hierarchy. Your test of pg_class.relispartition is not going to
> work as you're assuming the query is always going to the root. The
> user might query some intermediate partitioned table (which will have
> relispartition = true). Your patch will fall flat in that case.

Yeah, I forgot to consider that.

> You could work around that by having some array that points to the
> target partitioned table of each hierarchy, but I don't see why that's
> better than having the additional struct.

Or it could be a Bitmapset called root_indexes that stores the offset of
the first Index value in every partitioned_rels list contained in turn in
the list that's passed to make_partition_pruneinfo.

> There's also some code
> inside ExecFindInitialMatchingSubPlans() which does a backward scan
> over the partitions. This must process children before their parents.
> Unsure how well that's going to work if we start mixing the
> hierarchies. I'm sure it can be made to work providing each hierarchy
> is stored together consecutively in the array, but it just seems
> pretty fragile to me. That code is already pretty hard to follow.

I don't see how removing a nested loop changes things for worse.  AIUI,
the code replaces index values contained in the subplan_map arrays of
various PartitionedRelPruningData structs to account for any pruned
sub-plans.  Removing a nesting level because of having removed the nesting
struct doesn't seem to affect anything about that translation.  But your
point here seems to be about the relative ordering of
PartitionedRelPruningData structs among themselves being affected due to
their now being put into a flat array, although I don't see that as being
any more fragile.  We already are assuming a bunch about the relative
ordering of sub-plans or of PartitionedRelPruningData structs to have been
relying on storing their indexes in subplan_map and subpart_map.  Also, it
occurred to me that the new subplan indexes that
ExecFindInitialMatchingSubPlans computes are based on where subplans are
actually stored in the AppendState.appendplans array, which, in turn, is
based on the Bitmapset of "valid subplans" that
ExecFindInitialMatchingSubPlans passes back to ExecInitAppend.

> What's the reason you don't like the additional level to represent
> multiple hierarchies?

I started thinking about flattening PartitionedRelPruneInfo after looking
at flatten_partitioned_rels() in your patch.  If we're flattening
partitioned_rels (that is, not having it as a List of Lists in the
finished plan), why not flatten the pruning info node too?  As I said
earlier, I get it that we need List of Lists within the planner to get
make_partition_pruneinfo to work correctly in these types of queries, but
once we have figured out the correct details to pass to executor to
perform run-time pruning, I don't see why we need to pass that info again
as a List of Lists.

I have attached v2 of the delta patch which adds a root_indexes field to
PartitionPruneInfo to track topmost parents in every partition hierarchy
contained whose pruning info is contained in the Append.

Also, I noticed a bug with how ExecFindInitialMatchingSubPlans handles
other_subplans.  While the indexes in subplan_map arrays are updated to
contain revised values after pruning, those in the other_subplans
Bitmapset are not, leading to crashes or possibly wrong result.  For example:

create table p (a int, b int, c int) partition by list (a);
create table p1 partition of p for values in (1);
create table p2 partition of p for values in (2);
create table q (a int, b int, c int) partition by list (a);
create table q1 partition of q for values in (1) partition by list (b);
create table q11 partition of q1 for values in (1) partition by list (c);
create table q111 partition of q11 for values in (1);
create table q2 partition of q for values in (2) partition by list (b);
create table q21 partition of q2 for values in (1);
create table q22 partition of q2 for values in (2);

prepare q (int, int) as
select *
from (

Re: ToDo: show size of partitioned table

2018-07-25 Thread Amit Langote
Hi Pavel.

On 2018/07/23 20:46, Pavel Stehule wrote:
> Hi
> 
> I am sending a prototype of patch. Now, it calculates size of partitioned
> tables with recursive query. When any more simple method will be possible,
> the size calculation will be changed.
> 
> postgres=# \dt+
>List of relations
> +++---+---+-+-+
> | Schema |Name| Type  | Owner |  Size   | Description |
> +++---+---+-+-+
> | public | data   | table | pavel | 0 bytes | |
> | public | data_2016  | table | pavel | 15 MB   | |
> | public | data_2017  | table | pavel | 15 MB   | |
> | public | data_other | table | pavel | 11 MB   | |
> +++---+---+-+-+
> (4 rows)
> 
> postgres=# \dP+
>   List of partitioned tables
> ++--+---+---+-+
> | Schema | Name | Owner | Size  | Description |
> ++--+---+---+-+
> | public | data | pavel | 42 MB | |
> ++--+---+---+-+
> (1 row)

This looks nice, although I haven't looked at the patch yet.  Also, as you
said, we could later replace the method of directly querying pg_inherits
by something else.

> p.s. Another patch can be replacement of relation type from "table" to
> "partitioned table"
> 
> postgres=# \dt+
>  List of relations
> +++---+---+-+-+
> | Schema |Name|   Type| Owner |  Size   | Description |
> +++---+---+-+-+
> | public | data   | partitioned table | pavel | 0 bytes | |
> | public | data_2016  | table | pavel | 15 MB   | |
> | public | data_2017  | table | pavel | 15 MB   | |
> | public | data_other | table | pavel | 11 MB   | |
> +++---+---+-+-+
> (4 rows)
> 
> A patch is simple for this case
> 
> diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
> index c3bdf8555d..491e58eb29 100644
> --- a/src/bin/psql/describe.c
> +++ b/src/bin/psql/describe.c
> @@ -3490,8 +3490,8 @@ listTables(const char *tabtypes, const char *pattern,
> bool verbose, bool showSys
>   gettext_noop("sequence"),
>   gettext_noop("special"),
>   gettext_noop("foreign table"),
> - gettext_noop("table"),/* partitioned table */
> - gettext_noop("index"),/* partitioned index */
> + gettext_noop("partitioned table"),/* partitioned
> table */
> + gettext_noop("partitioned index"),/* partitioned
> index */
>   gettext_noop("Type"),
>   gettext_noop("Owner"));

Inclined to +1 this, too.  Although, one might ask why partitioned tables
are called so only in the psql's output, but not in the backend's error
messages, for example, as was discussed in the past.

Thanks,
Amit




Re: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Amit Langote
On 2018/07/17 12:14, Ashutosh Bapat wrote:
> On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho  wrote:
>> On 2018/07/17 10:49, Amit Langote wrote:
>>> Perhaps, Kato-san only intended to report that the time that planner spends 
>>> for a partitioned table with 1100 partitions is just too high compared to 
>>> the time it spends on a non-partitioned table.
>>
>> yes, It is included for the purposes of this comparison.
>>
>> The purpose of this comparison is to find where the partitioning bottleneck 
>> is.
>> Using the bottleneck as a hint of improvement, I'd like to bring the 
>> performance of partitioned table closer to the performance of unpartitioned 
>> table as much as possible.
>>
> 
> That's a good thing, but may not turn out to be realistic. We should
> compare performance where partitioning matters and try to improve in
> those contexts. Else we might improve performance in scenarios which
> are never used.
> 
> In this case, even if we improve the planning time by 100%, it hardly
> matters since planning time is neglegible compared to the execution
> time because of huge data where partitioning is useful.

While I agree that it's a good idea to tell users to use partitioning only
if the overhead of having the partitioning in the first place is bearable,
especially the planner overhead, this benchmark shows us that even for
what I assume might be fairly commonly occurring queries (select .. from /
update .. / delete from partitioned_table where partkey = ?), planner
spends way too many redundant cycles.  Some amount of that overhead will
always remain and planning with partitioning will always be a bit slower
than without partitioning, it's *too* slow right now for non-trivial
number of partitions.

Thanks,
Amit




Re: pointless check in RelationBuildPartitionDesc

2018-09-04 Thread Amit Langote
On 2018/09/05 1:50, Alvaro Herrera wrote:
> Proposed patch.  Checking isnull in a elog(ERROR) is important, because
> the column is not marked NOT NULL.  This is not true for other columns
> where we simply do Assert(!isnull).

Looks good.  Thanks for taking care of other sites as well.

@@ -14705,7 +14705,9 @@ ATExecDetachPartition(Relation rel, RangeVar *name)

(void) SysCacheGetAttr(RELOID, tuple, Anum_pg_class_relpartbound,
   );
-   Assert(!isnull);
+   if (isnull)
+   elog(ERROR, "null relpartbound for relation %u",
+RelationGetRelid(partRel));

In retrospect, I'm not sure why this piece of code is here at all; maybe
just remove the SycCacheGetAttr and Assert?

Regards,
Amit




Re: pointless check in RelationBuildPartitionDesc

2018-09-04 Thread Amit Langote
On 2018/09/04 21:51, Alvaro Herrera wrote:
> On 2018-Sep-04, Amit Langote wrote:
> 
>> On 2018/09/04 13:08, Alvaro Herrera wrote:
> 
>>> I think it'd be pointless noise.  If we really want to protect against
>>> that, I think we should promote the Assert for relpartbound's isnull
>>> flag into an if test.
>>
>> So that we can elog(ERROR, ...) if isnull is true?  If so, I agree,
>> because that's what should've been done here anyway (for a value read from
>> the disk that is).
> 
> Yes.
> 
> I wonder now what's the value of using an Assert for this, BTW: if those
> are enabled, then we'll crash saying "this column is null and shouldn't!".
> If they are disabled, we'll instead crash (possibly in production)
> trying to decode the field.  What was achieved?

I can see how that the Assert was pointless all along.

> With an elog(ERROR) the reaction is the expected one: the current
> transaction is aborted, but the server continues to run.
> 
>> I think we should check relispartition then too.
> 
> Well, we don't check every single catalog flag in every possible code
> path just to ensure it's sane.  I don't see any reason to do differently
> here.  We rely heavily on the catalog contents to be correct, and
> install defenses against user-induced corruption that would cause us to
> crash; but going much further than that would be excessive IMO.

Okay, sure anyway.

Thanks,
Amit




Re: speeding up planning with partitions

2018-09-04 Thread Amit Langote
On 2018/09/04 22:24, David Rowley wrote:
> On 30 August 2018 at 00:06, Amit Langote  
> wrote:
>> With various overheads gone thanks to 0001 and 0002, locking of all
>> partitions via find_all_inheritos can be seen as the single largest
>> bottleneck, which 0003 tries to address.  I've kept it a separate patch,
>> because I'll need to think a bit more to say that it's actually to safe to
>> defer locking to late planning, due mainly to the concern about the change
>> in the order of locking from the current method.  I'm attaching it here,
>> because I also want to show the performance improvement we can expect with 
>> it.
> 
> For now, find_all_inheritors() locks the tables in ascending Oid
> order.  This makes sense with inheritance parent tables as it's much
> cheaper to sort on this rather than on something like the table's
> namespace and name.  I see no reason why what we sort on really
> matters, as long as we always sort on the same thing and the key we
> sort on is always unique so that the locking order is well defined.
> 
> For partitioned tables, there's really not much sense in sticking to
> the same lock by Oid order. The order of the PartitionDesc is already
> well defined so I don't see any reason why we can't just perform the
> locking in PartitionDesc order.

The reason we do the locking with find_all_inheritors for regular queries
(planner (expand_inherited_rtentry) does it for select/update/delete and
executor (ExecSetupPartitionTupleRouting) for insert) is that that's the
order used by various DDL commands when locking partitions, such as when
adding a column.  So, we'd have one piece of code locking partitions by
Oid order and others by canonical partition bound order or PartitionDesc
order.  I'm no longer sure if that's problematic though, about which more
below.

> This would mean you could perform the
> locking of the partitions once pruning is complete somewhere around
> add_rel_partitions_to_query(). Also, doing this would remove the need
> for scanning pg_inherits during find_all_inheritors() and would likely
> further speed up the planning of queries to partitioned tables with
> many partitions.

That's what happens with 0003; note the following hunk in it:

@@ -1555,14 +1555,15 @@ expand_inherited_rtentry(PlannerInfo *root,
RangeTblEntry *rte, Index rti)
 lockmode = AccessShareLock;

 /* Scan for all members of inheritance set, acquire needed locks */
-inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);
+if (rte->relkind != RELKIND_PARTITIONED_TABLE)
+inhOIDs = find_all_inheritors(parentOID, lockmode, NULL);

> I wrote a function named
> get_partition_descendants_worker() to do this in patch 0002 in [1] (it
> may have been a bad choice to have made this a separate function
> rather than just part of find_all_inheritors() as it meant I had to
> change a bit too much code in tablecmds.c). There might be something
> you can salvage from that patch to help here.
> 
> [1] 
> https://www.postgresql.org/message-id/CAKJS1f9QjUwQrio20Pi%3DyCHmnouf4z3SfN8sqXaAcwREG6k0zQ%40mail.gmail.com

Actually, I had written a similar patch to replace the usages of
find_all_inheritors and find_inheritance_children by different
partitioning-specific functions which would collect the the partition OIDs
from the already open parent table's PartitionDesc, more or less like the
patch you mention does.  But I think we don't need any new function(s) to
do that, that is, we don't need to change all the sites that call
find_all_inheritors or find_inheritance_children in favor of new functions
that return partition OIDs in PartitionDesc order, if only *because* we
want to change planner to lock partitions in the PartitionDesc order.  I'm
failing to see why the difference in locking order matters.  I understood
the concern as that locking partitions in different order could lead to a
deadlock if concurrent backends request mutually conflicting locks, but
only one of the conflicting backends, the one that got lock on the parent,
would be allowed to lock children.  Thoughts on that?

Thanks,
Amit




Re: speeding up planning with partitions

2018-09-10 Thread Amit Langote
On 2018/09/11 10:11, Peter Geoghegan wrote:
> On Wed, Aug 29, 2018 at 5:06 AM, Amit Langote
>  wrote:
>> It is more or less well known that the planner doesn't perform well with
>> more than a few hundred partitions even when only a handful of partitions
>> are ultimately included in the plan.  Situation has improved a bit in PG
>> 11 where we replaced the older method of pruning partitions one-by-one
>> using constraint exclusion with a much faster method that finds relevant
>> partitions by using partitioning metadata.  However, we could only use it
>> for SELECT queries, because UPDATE/DELETE are handled by a completely
>> different code path, whose structure doesn't allow it to call the new
>> pruning module's functionality.  Actually, not being able to use the new
>> pruning is not the only problem for UPDATE/DELETE, more on which further
>> below.
> 
> This was a big problem for the SQL MERGE patch. I hope that this
> problem can be fixed.

Sorry if I've missed some prior discussion about this, but could you
clarify which aspect of UPDATE/DELETE planning got in the way of SQL MERGE
patch?  It'd be great if you can point to an email or portion of the SQL
MERGE discussion thread where this aspect was discussed.

In the updated patch that I'm still hacking on (also need to look at
David's comments earlier today before posting it), I have managed to
eliminate the separation of code paths handling SELECT vs UPDATE/DELETE on
inheritance tables, but I can't be sure if the new approach (also) solves
any problems that were faced during SQL MERGE development.

Thanks,
Amit




Re: speeding up planning with partitions

2018-08-29 Thread Amit Langote
On 2018/08/30 10:09, Tsunakawa, Takayuki wrote:
> From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
>> I measured the gain in performance due to each patch on a modest virtual
>> machine.  Details of the measurement and results follow.
> 
> Amazing!

Thanks.

> UPDATE
>> nparts  master0001   0002   0003
>> ==  ==      
>> 0 28562893   2862   2816
>> 8  5071115   1447   1872
> 
> SELECT
>> nparts  master0001   0002   0003
>> ==  ==      
>> 0 22902329   2319   2268
>> 8 10581077   1414   1788
> 
> Even a small number of partitions still introduces a not-small overhead 
> (UPDATE:34%, SELECT:22%).

Yeah, that's true.

> Do you think this overhead can be reduced further?

We can definitely try, but I'm not immediately sure if the further
improvements will come from continuing to fix the planner.  Maybe, the
overhead of partitioning could be attributed to other parts of the system.

> What part do you guess would be relevant?  This one?
> 
>> that it is now performed after pruning. However, it doesn't do anything
>> about the fact that partitions are all still locked in the earlier phase.

Actually, I wrote that for patch 0002.  The next patch (0003) is meant to
fix that.  So, the overhead you're seeing is even after making sure that
only the selected partitions are locked.

Thanks,
Amit




Re: speeding up planning with partitions

2018-08-29 Thread Amit Langote
On 2018/08/30 7:27, David Rowley wrote:
> On 30 August 2018 at 00:06, Amit Langote  
> wrote:
>> nparts  master0001   0002   0003
>> ==  ==      
>> 0 28562893   2862   2816
>> 8  5071115   1447   1872
>> 16 260 765   1173   1892
>> 32 119 483922   1884
>> 64  59 282615   1881
>> 128 29 153378   1835
>> 256 14  79210   1803
>> 512  5  40113   1728
>> 1024 2  17 57   1616
>> 2048 0*  9 30   1471
>> 4096 0+  4 15   1236
>> 8192 0=  2  7975
> 
> Those look promising. Are you going to submit these patches to the
> September 'fest?

Thanks, I just did.

https://commitfest.postgresql.org/19/1778/

Regards,
Amit




Re: speeding up planning with partitions

2018-08-30 Thread Amit Langote
On 2018/08/29 21:06, Amit Langote wrote:
> I measured the gain in performance due to each patch on a modest virtual
> machine.  Details of the measurement and results follow.
>
> UPDATE:
>
> nparts  master0001   0002   0003
> ==  ==      
> 0 28562893   2862   2816
> 8  5071115   1447   1872
> 16 260 765   1173   1892
> 32 119 483922   1884
> 64  59 282615   1881
> 128 29 153378   1835
> 256 14  79210   1803
> 512  5  40113   1728
> 1024 2  17 57   1616
> 2048 0*  9 30   1471
> 4096 0+  4 15   1236
> 8192 0=  2  7975
>
> For SELECT:
>
> nparts  master0001   0002   0003
> ==  ==      
> 0 22902329   2319   2268
> 8 10581077   1414   1788
> 16 711 729   1124   1789
> 32 450 475879   1773
> 64 265 272603   1765
> 128146 149371   1685
> 256 76  77214   1678
> 512 39  39112   1636
> 102416  17 59   1525
> 2048 8   9 29   1416
> 4096 4   4 15   1195
> 8192 2   2  7932

Prompted by Tsunakawa-san's comment, I tried to look at the profiles when
running the benchmark with partitioning and noticed a few things that made
clear why, even with 0003 applied, tps numbers decreased as the number of
partitions increased.  Some functions that appeared high up in the
profiles were related to partitioning:

* set_relation_partition_info calling partition_bounds_copy(), which calls
  datumCopy() on N Datums, where N is the number of partitions.  The more
  the number of partitions, higher up it is in profiles.  I suspect that
  this copying might be redundant; planner can keep using the same pointer
  as relcache

There are a few existing and newly introduced sites in the planner where
the code iterates over *all* partitions of a table where processing just
the partition selected for scanning would suffice.  I observed the
following functions in profiles:

* make_partitionedrel_pruneinfo, which goes over all partitions to
  generate subplan_map and subpart_map arrays to put into the
  PartitionedRelPruneInfo data structure that it's in the charge of
  generating

* apply_scanjoin_target_to_paths, which goes over all partitions to adjust
  their Paths for applying required scanjoin target, although most of
  those are dummy ones that won't need the adjustment

* For UPDATE, a couple of functions I introduced in patch 0001 were doing
  the same thing as apply_scanjoin_target_to_paths, which is unnecessary

To fix the above three instances of redundant processing, I added a
Bitmapset 'live_parts' to the RelOptInfo which stores the set of indexes
of only the unpruned partitions (into the RelOptInfo.part_rels array) and
replaced the for (i = 0; i < rel->nparts; i++) loops in those sites with
the loop that iterates over the members of 'live_parts'.

Results looked were promising indeed, especially after applying 0003 which
gets rid of locking all partitions.

UPDATE:

nparts  master00010002   0003
==  ==   
0 285628932862   2816
8  50711151466   1845
16 260 7651161   1876
32 119 483 910   1862
64  59 282 609   1895
128 29 153 376   1884
256 14  79 212   1874
512  5  40 115   1859
1024 2  17  58   1847
2048 0   9  29   1883
4096 0   4  15   1867
8192 0   2   7   1826

SELECT:

nparts  master00010002   0003
==  ==   
0   2290  23292319   2268
8   1058  10771431   1800
16   711   7291158   1781
32   450   475 908   1777
64   265   272 612   1791
128  146   149 379   1777
256   7677 213   1785
512   3939 114   1776
1024  1617  59   1756
2048   8 9  30   1746
4096   4 4  15   1722
8192   2 2   7   1706

Note that with 0003, tps doesn't degrade as the number of partitions increase.

Attached updated patches, with 0002 containing the changes mentioned above.

Thanks,
Amit
From 060bd2445ea9cba9adadd73505689d6f06583ee8 Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 24 Aug 2018 12:39:36 +0900
Subject: [PATCH v2 1/3] Overhaul partitioned table update/delete planning

Current method, inheritance_planner, applies grouping_planner and
hence query_planner to the query repeatedly with each leaf partition
replacing the root parent as the query's result relation. One big
drawback of this approach

Re: Server crashed with dense_rank on partition table.

2018-07-04 Thread Amit Langote
On 2018/07/05 9:40, Andres Freund wrote:
> On 2018-07-02 17:14:14 +0900, Amit Langote wrote:
>> I studied this a bit and found a bug that's causing the crash.
>>
>> The above mentioned commit has this hunk:
>>
>> @@ -1309,6 +1311,9 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
>>  PG_RETURN_INT64(rank);
>>
>>  osastate = (OSAPerGroupState *) PG_GETARG_POINTER(0);
>> +econtext = osastate->qstate->econtext;
>> +if (!econtext)
>> +osastate->qstate->econtext = econtext =
>> CreateStandaloneExprContext();
>>
>> In CreateStandloneExprContext(), we have this:
>>
>> econtext->ecxt_per_query_memory = CurrentMemoryContext;
>>
>> /*
>>  * Create working memory for expression evaluation in this context.
>>  */
>> econtext->ecxt_per_tuple_memory =
>> AllocSetContextCreate(CurrentMemoryContext,
>>   "ExprContext",
>>   ALLOCSET_DEFAULT_SIZES);
>>
>> I noticed when debugging the crashing query that CurrentMemoryContext is
>> actually per-tuple memory context of some expression context of the
>> calling code, which would get reset before getting here again.  So, it's
>> wrong of hypothetical_dense_rank_final to call CreateStandloneExprContext
>> without first switching to an actual per-query context.
>>
>> Attached patch seems to fix the crash.
> 
> Thanks, that looks correct. Pushed!

Thank you.

Regards,
Amit




documentation about explicit locking

2018-07-04 Thread Amit Langote
Hi.

I wonder why we mention on the following page that CREATE COLLATION
requires SHARE ROW EXCLUSIVE lock

https://www.postgresql.org/docs/devel/static/explicit-locking.html

I know that's the lock taken on the pg_collation catalog, but do we need
to mention locks taken by a DDL command on the catalogs it affects?  All
other commands mentioned on the page require to specify the table name
that the lock will be taken on.

Thanks,
Amit




Re: why partition pruning doesn't work?

2018-07-04 Thread Amit Langote
On 2018/06/19 2:05, Tom Lane wrote:
> Amit Langote  writes:
>> [ 0001-Open-partitioned-tables-during-Append-initialization.patch ]
> 
> I took a look at this.  While I'm in agreement with the general idea of
> holding open the partitioned relations' relcache entries throughout the
> query, I do not like anything about this patch:

Thanks for taking a look at it and sorry about the delay in replying.

> * It seems to be outright broken for the case that any of the partitioned
> relations are listed in nonleafResultRelations.  If we're going to do it
> like this, we have to open every one of the partrels regardless of that.

Yes, that was indeed wrong.

> (I wonder whether we couldn't lose PlannedStmt.nonleafResultRelations
> altogether, in favor of merging the related code in InitPlan with this.

Hmm, PlannedStmt.nonleafResultRelations exists for the same reason as why
PlannedStmt.resultRelations does, that is,

/*
 * initialize result relation stuff, and open/lock the result rels.
 *
 * We must do this before initializing the plan tree, else we might try to
 * do a lock upgrade if a result rel is also a source rel.
 */

nonleafResultRelations contains members of partitioned_rels lists of all
ModifyTable nodes contained in a plan.

> That existing code is already a mighty ugly wart, and this patch makes
> it worse by adding new, related warts elsewhere.)

I just realized that there is a thinko in the following piece of code in
ExecLockNonLeafAppendTables

/* If this is a result relation, already locked in InitPlan */
foreach(l, stmt->nonleafResultRelations)
{
if (rti == lfirst_int(l))
{
is_result_rel = true;
break;
}
}

It should actually be:

/* If this is a result relation, already locked in InitPlan */
foreach(l, stmt->nonleafResultRelations)
{
Index   nonleaf_rti = lfirst_int(l);
Oid nonleaf_relid = getrelid(nonleaf_rti,
 estate->es_range_table);

if (relid == nonleaf_relid)
{
is_result_rel = true;
break;
}
}

RT indexes in, say, Append.partitioned_rels, are distinct from those in
PlannedStmt.nonleafResultRelations, so the existing test never succeeds,
as also evident from the coverage report:

https://coverage.postgresql.org/src/backend/executor/execUtils.c.gcov.html#864


I'm wondering if we couldn't just get rid of this code.  If an input
partitioned tables is indeed also a result relation, then we would've
locked it in InitPlan with RowExclusiveLock and heap_opening it with a
weaker lock (RowShare/AccessShare) wouldn't hurt.

> * You've got *far* too much intimate knowledge of the possible callers
> in ExecOpenAppendPartitionedTables.
> 
> Personally, what I would have this function do is return a List of
> the opened Relation pointers, and add a matching function to run through
> such a List and close the entries again, and make the callers responsible
> for stashing the List pointer in an appropriate field in their planstate.

OK, I rewrote it to work that way.

> Or maybe what we should do is drop ExecLockNonLeafAppendTables/
> ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it.

Hmm, for InitPlan to do what ExecOpenAppendPartitionedTables does, we'd
have to have all the partitioned tables (contained in partitioned_rels
fields of all Append/MergeAppend/ModifyTable nodes of a plan) also listed
in a global list like rootResultRelations and nonleafResultRelations of
PlannedStmt.

Attached updated patch.

Thanks,
Amit
From 0fd93b0b2108d4d12f483b96aab2c25c120173cb Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 14 Jun 2018 14:50:22 +0900
Subject: [PATCH v2] Fix opening/closing of partitioned tables in Append plans

---
 src/backend/executor/execPartition.c   |  49 --
 src/backend/executor/execUtils.c   | 114 +
 src/backend/executor/nodeAppend.c  |  26 
 src/backend/executor/nodeMergeAppend.c |  11 ++--
 src/include/executor/execPartition.h   |   4 +-
 src/include/executor/executor.h|   4 +-
 src/include/nodes/execnodes.h  |   4 ++
 7 files changed, 100 insertions(+), 112 deletions(-)

diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 7a4665cc4e..b9bc157bfa 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1399,13 +1399,19 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
  * PartitionPruningData for each item in that List.  This data can be re-used
  * each time we re-evaluate which partitions match the pruning steps provided
  * in each PartitionPruneInfo.
+ *
+ * 'partitioned_rels' is a List of same number of elements as there are in
+ * 'partitionprune

Re: documentation fixes for partition pruning, round three

2018-07-05 Thread Amit Langote
On 2018/07/06 6:55, David Rowley wrote:
> On 6 July 2018 at 09:41, Alvaro Herrera  wrote:
>> On 2018-Jul-05, Peter Eisentraut wrote:
>>> Committed.
>>
>> Thanks for handling this.
>>
>> Should we do this in REL_11_STABLE too?  I vote yes.
> 
> Sorry for now paying much attention to this, but I've read through
> what's been committed and I also think PG11 deserves this too.

+1

Thanks Justin and Peter.

Regards,
Amit




Re: Global shared meta cache

2018-07-05 Thread Amit Langote
On 2018/07/05 23:00, Robert Haas wrote:
> With respect to partitioning specifically, it seems like we might be
> able to come up with some way of planning that doesn't need a full
> relcache entry for every partition, particularly if there are no
> partition-local objects (indexes, triggers, etc.).
We won't know that there are no partition-local objects until we open them
though, right?  As you said, there might be a way to refactor things such
that just knowing that there are no partition-local objects becomes
cheaper than doing a full-fledged RelationBuildDesc.

Thanks,
Amit




Re: using expression syntax for partition bounds

2018-07-05 Thread Amit Langote
Horiguchi-san,

On 2018/07/06 14:26, Kyotaro HORIGUCHI wrote:
> Hello.
> 
> cf-bot compained on this but I wondered why it got so many
> errors. At the first look I found a spare semicolon before a bare
> then calause:p
> 
>> -if (!IsA(value, Const));
>> +if (!IsA(value, Const))
>>  elog(ERROR, "could not evaluate partition bound expression");
> 
> The attached is the fixed and rebsaed version.

Thanks for taking care of that.

Regards,
Amit




Re: documentation about explicit locking

2018-07-05 Thread Amit Langote
On 2018/07/05 23:02, Robert Haas wrote:
> On Wed, Jul 4, 2018 at 3:09 AM, Amit Langote
>  wrote:
>> I wonder why we mention on the following page that CREATE COLLATION
>> requires SHARE ROW EXCLUSIVE lock
>>
>> https://www.postgresql.org/docs/devel/static/explicit-locking.html
>>
>> I know that's the lock taken on the pg_collation catalog, but do we need
>> to mention locks taken by a DDL command on the catalogs it affects?  All
>> other commands mentioned on the page require to specify the table name
>> that the lock will be taken on.
> 
> Yes, that looks odd.

OK, here is a patch.

I see that it was one of Peter E's commits that added that, so cc'd him.

Thanks,
Amit
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
index 24613e3c75..4ec853b77a 100644
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -970,8 +970,7 @@ ERROR:  could not serialize access due to read/write 
dependencies among transact
 
 
 
- Acquired by CREATE COLLATION,
- CREATE TRIGGER, and many forms of
+ Acquired by CREATE TRIGGER, and many forms of
  ALTER TABLE (see ).
 



Re: no partition pruning when partitioning using array type

2018-07-08 Thread Amit Langote
On 2018/07/07 0:13, Andrew Dunstan wrote:
> 
> 
> On 05/08/2018 02:18 AM, Amit Langote wrote:
>> On 2018/03/01 17:16, Amit Langote wrote:
>>> Added this to CF (actually moved to the September one after first having
>>> added it to the CF that is just getting started).
>>>
>>> It seems to me that we don't need to go with my originally proposed
>>> approach to teach predtest.c to strip RelabelType nodes.  Apparently, it's
>>> only partition.c that's adding the RelableType node around partition key
>>> nodes in such cases.  That is, in the case of having an array, record,
>>> enum, and range type as the partition key.  No other part of the system
>>> ends up adding one as far as I can see.  Parser's make_op(), for example,
>>> that is used to generate an OpExpr for a qual involving the partition key,
>>> won't put RelabelType around the partition key node if the operator in
>>> question has "pseudo"-types listed as declared types of its left and right
>>> arguments.
>>>
>>> So I revised the patch to drop all the predtest.c changes and instead
>>> modify get_partition_operator() to avoid generating RelabelType in such
>>> cases.  Please find it attached.
>
>> I would like to revisit this as a bug fix for get_partition_operator() to
>> be applied to both PG 10 and HEAD.  In the former case, it fixes the bug
>> that constraint exclusion code will fail to prune correctly when partition
>> key is of array, enum, range, or record type due to the structural
>> mismatch between the OpExpr that partitioning code generates and one that
>> the parser generates for WHERE clauses involving partition key columns.
>>
>> In HEAD, since we already fixed that case in e5dcbb88a15d [1] which is a
>> different piece of code anyway, the patch only serves to improve the
>> deparse output emitted by ruleutils.c for partition constraint expressions
>> where pseudo-type partition key is involved.  The change can be seen in
>> the updated test output for create_table test.
>>
>> Attached are patches for PG 10 and HEAD, which implement a slightly
>> different approach to fixing this.  Now, instead of passing the partition
>> key's type as lefttype and righttype to look up the operator, the operator
>> class declared type is passed.  Also, we now decide whether to create a
>> RelabelType node on top based on whether the partition key's type is
>> different from the selected operator's input type, with exception for
>> pseudo-type types.
>>
>> Thanks,
>> Amit
>>
>> [1]
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e5dcbb88a15d
> 
> Since this email we have branched off REL_11_STABLE. Do we want to
> consider this as a bug fix for Release 11? If so, should we add it to the
> open items list?

The code being fixed with the latest patch is not new in PG 11, it's
rather PG 10 code.  That code affects how pruning works in PG 10 (used to
affect PG 11 until we rewrote the partition pruning code).  I think it's a
good idea to apply this patch to all branches, as the code is not specific
to partition pruning and has its benefits even for HEAD and PG 11.

For PG 11 and HEAD, the benefit of this patch can be thought of as just
cosmetic, because it only affects the ruleutils.c's deparse output for
partition constraint expressions when showing it in the psql output for
example.

In PG 10, it directly affects the planner behavior whereby having a
RelabelType redundantly in a partition constraint expression limits the
planner's ability to do partition pruning with it.

Thanks,
Amit




Re: no partition pruning when partitioning using array type

2018-07-08 Thread Amit Langote
Thanks for taking a look.

On 2018/07/07 9:19, Alvaro Herrera wrote:
> On 2018-May-08, Amit Langote wrote:
> 
>> I would like to revisit this as a bug fix for get_partition_operator() to
>> be applied to both PG 10 and HEAD.  In the former case, it fixes the bug
>> that constraint exclusion code will fail to prune correctly when partition
>> key is of array, enum, range, or record type due to the structural
>> mismatch between the OpExpr that partitioning code generates and one that
>> the parser generates for WHERE clauses involving partition key columns.
> 
> Interesting patchset.  Didn't read your previous v2, v3 versions; I only
> checked your latest, v1 (???).

Sorry, I think I messed up version numbering there.

> I'm wondering about the choice of OIDs in the new test.  I wonder if
> it's possible to get ANYNONARRAY (or others) by way of having a
> polymorphic function that passes its polymorphic argument in a qual.  I
> suppose it won't do anything in v10, or will it?  Worth checking :-)> Why not 
> use IsPolymorphicType?

Hmm, so IsPolymorphicType() test covers all of these pseudo-types except
RECORDOID.  I rewrote the patch to use IsPolymorphicType.

I'm not able to think of a case where the partition constraint expression
would have to contain ANYNONARRAY though.

> Also, I think it'd be good to have tests
> for all these cases (even in v10), just to make sure we don't break it
> going forward.

So, I had proposed this patch in last December, because partition pruning
using constraint exclusion was broken for these types and still is in PG
10.  I have added the tests back in the patch for PG 10 to test that
partition pruning (using constraint exclusion) works for these cases.  For
PG 11 and HEAD, we took care of that in e5dcbb88a15 (Rework code to
determine partition pruning procedure), so there does not appear to be any
need to add tests for pruning there.

> At least the array case is clearly broken today ...
> A test for the RECORDOID case would be particularly welcome, since it's
> somehow different from the other cases.  (I didn't understand why did
> you remove the test in the latest version.)

I have added the tests in the patch for PG 10.

I have marked both patches as v5.  One patch is for PG 10 and the other
applies unchanged to both HEAD and PG 11 branches.

Thanks,
Amit
From 1ec0c064adc34c12ae5615f0f2bca27a9c61c30f Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 8 May 2018 14:31:37 +0900
Subject: [PATCH v4] Fix how get_partition_operator looks up the operator

Instead of looking up using the underlying partition key's type as the
operator's lefttype and righttype, use the partition operator class
declared input type, which works reliably even in the cases where
pseudo-types are involved.  Also, make its decision whether a
RelableType is needed depend on a different criteria than what's
currently there.  That is, only add a RelabelType if the partition
key's type is different from the selected operator's input types.
---
 src/backend/catalog/partition.c| 47 ++
 src/test/regress/expected/create_table.out |  2 +-
 src/test/regress/expected/inherit.out  | 98 ++
 src/test/regress/sql/inherit.sql   | 43 +
 4 files changed, 161 insertions(+), 29 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index 17704f36b9..1f50b29a3f 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1176,40 +1176,31 @@ get_partition_operator(PartitionKey key, int col, 
StrategyNumber strategy,
Oid operoid;
 
/*
-* First check if there exists an operator of the given strategy, with
-* this column's type as both its lefttype and righttype, in the
-* partitioning operator family specified for the column.
+* Get the operator in the partitioning operator family using the 
operator
+* class declared input type as both its lefttype and righttype.
 */
operoid = get_opfamily_member(key->partopfamily[col],
- 
key->parttypid[col],
- 
key->parttypid[col],
+ 
key->partopcintype[col],
+ 
key->partopcintype[col],
  strategy);
+   if (!OidIsValid(operoid))
+   elog(ERROR, "missing operator %d(%u,%u) in partition opfamily 
%u",
+strategy, key->partopcintype[col], 
key->partopcintype[col],
+key->partopfamily[col]);
 
/*
-* If one doesn't exist, we must resort to using an ope

Re: no partition pruning when partitioning using array type

2018-07-10 Thread Amit Langote
On 2018/07/11 3:18, Alvaro Herrera wrote:
> On 2018-May-08, Amit Langote wrote:
> 
>> In HEAD, since we already fixed that case in e5dcbb88a15d [1] which is a
>> different piece of code anyway, the patch only serves to improve the
>> deparse output emitted by ruleutils.c for partition constraint expressions
>> where pseudo-type partition key is involved.  The change can be seen in
>> the updated test output for create_table test.
> 
> Actually, even in 11/master it also fixes this case:
> 
> alvherre=# explain update p set a = a || a where a = '{1}';
> QUERY PLAN
> ──
>  Update on p  (cost=0.00..54.03 rows=14 width=38)
>Update on p1
>Update on p2
>->  Seq Scan on p1  (cost=0.00..27.02 rows=7 width=38)
>  Filter: (a = '{1}'::integer[])
>->  Seq Scan on p2  (cost=0.00..27.02 rows=7 width=38)
>  Filter: (a = '{1}'::integer[])
> (7 filas)
> 
> Because UPDATE uses the predtest.c prune code, not partprune.  So it's
> not just some ruleutils beautification.

That's true.  Shame I totally missed that.

Thanks,
Amit




Re: no partition pruning when partitioning using array type

2018-07-10 Thread Amit Langote
On 2018/07/11 4:48, Alvaro Herrera wrote:
> On 2018-Jul-10, Alvaro Herrera wrote:
> 
>> alvherre=# explain update p set a = a || a where a = '{1}';
>> QUERY PLAN
>> ──
>>  Update on p  (cost=0.00..54.03 rows=14 width=38)
>>Update on p1
>>Update on p2
>>->  Seq Scan on p1  (cost=0.00..27.02 rows=7 width=38)
>>  Filter: (a = '{1}'::integer[])
>>->  Seq Scan on p2  (cost=0.00..27.02 rows=7 width=38)
>>  Filter: (a = '{1}'::integer[])
>> (7 filas)
>>
>> Because UPDATE uses the predtest.c prune code, not partprune.  So it's
>> not just some ruleutils beautification.
> 
> I added this test, modified some comments, and pushed.
> 
> Thanks for the patch.

Thank you for committing.

Regards,
Amit




Re: no partition pruning when partitioning using array type

2018-07-10 Thread Amit Langote
On 2018/07/11 4:50, Alvaro Herrera wrote:
> On 2018-Jul-10, Tom Lane wrote:
> 
>> And what about those partition bound values?  They are now illegal
>> for the domain, so I would expect a dump/reload to fail, regardless
>> of whether there are any values in the table.
> 
> Hmm, true.

There is a patch to overhaul how partition bound values are parsed:

https://commitfest.postgresql.org/18/1620/

With that patch, the error you reported upthread goes away (that is, one
can successfully create partitions), but the problem that Tom mentioned
above then appears.

What's the solution here then?  Prevent domains as partition key?

Thanks,
Amit




Re: no partition pruning when partitioning using array type

2018-07-10 Thread Amit Langote
On 2018/07/11 13:12, Alvaro Herrera wrote:
> On 2018-Jul-11, Amit Langote wrote:
> 
>> What's the solution here then?  Prevent domains as partition key?
> 
> Maybe if a domain is used in a partition key somewhere, prevent
> constraints from being added?

Maybe, but I guess you mean only prevent adding such a constraint
after-the-fact.

If a domain has a constraint before creating any partitions based on the
domain, then partition creation command would check that the partition
bound values satisfy those constraints.

> Another thing worth considering: are you prevented from dropping a
> domain that's used in a partition key?  If not, you get an ugly message
> when you later try to drop the table.

Yeah, I was about to write a message about that.  I think we need to teach
RemoveAttributeById, which dependency.c calls when dropping the
referencing domain with cascade option, to abort if the attribute passed
to it belongs to the partition key of the input relation.

I tried that in the attached, but not sure about the order of messages
that appear in the output of DROP DOMAIN .. CASCADE.  It contains a NOTICE
message followed by an ERROR message.

Thanks,
Amit
From aca92efe08a45c7645720bf7c47ee5ca221f0a80 Mon Sep 17 00:00:00 2001
From: amit 
Date: Wed, 11 Jul 2018 10:26:55 +0900
Subject: [PATCH v1] Prevent RemoveAttributeById from dropping partition key

dependency.c is currently be able to drop whatever is referencing
the partition key column, because RemoveAttributeById lacks guards
checks to see if it's actually dropping a partition key.
---
 src/backend/catalog/heap.c | 29 +
 src/test/regress/expected/create_table.out | 10 ++
 src/test/regress/sql/create_table.sql  |  8 
 3 files changed, 47 insertions(+)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index d223ba8537..b69b9d9436 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1585,6 +1585,35 @@ RemoveAttributeById(Oid relid, AttrNumber attnum)
else
{
/* Dropping user attributes is lots harder */
+   boolis_expr;
+
+   /*
+* Prevent dropping partition key attribute, making whatever 
that's
+* trying to do this fail.
+*/
+   if (has_partition_attrs(rel,
+bms_make_singleton(attnum - 
FirstLowInvalidHeapAttributeNumber),
+_expr))
+   {
+   if (!is_expr)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+errmsg("cannot drop column 
\"%s\" of table \"%s\"",
+   
NameStr(attStruct->attname),
+   
RelationGetRelationName(rel)),
+errdetail("Column \"%s\" is 
named in the partition key of \"%s\"",
+  
NameStr(attStruct->attname),
+  
RelationGetRelationName(rel;
+   else
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+errmsg("cannot drop column 
\"%s\" of table \"%s\"",
+   
NameStr(attStruct->attname),
+   
RelationGetRelationName(rel)),
+errdetail("Column \"%s\" is 
referenced in the partition key expression of \"%s\"",
+  
NameStr(attStruct->attname),
+  
RelationGetRelationName(rel;
+   }
 
/* Mark the attribute as dropped */
attStruct->attisdropped = true;
diff --git a/src/test/regress/expected/create_table.out 
b/src/test/regress/expected/create_table.out
index 8fdbca1345..ba32d781d1 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -909,3 +909,13 @@ ERROR:  cannot create a temporary relation as partition of 
permanent relation "p
 create temp table temp_part partition of temp_parted default; -- ok
 drop table perm_parted cascade;
 drop table temp_parted cascade;
+-- test domain partition key behavior
+create domain ct_overint as int;
+create table ct_domainpartkey (a ct_overint) partition by 

Re: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread Amit Langote
On 2018/07/13 14:49, Tsunakawa, Takayuki wrote:
> From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
>> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
>> is pretty significant and gets worse as the number of partitions grows.
>> I
>> had intended to fix that in PG 11, but we could only manage to get part
>> of
>> that work into PG 11, with significant help from David and others.  So,
>> while PG 11's overhead of partitioning during planning is less than PG
>> 10's due to improved pruning algorithm, it's still pretty far from ideal,
>> because it isn't just the pruning algorithm that had overheads.  In fact,
>> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
>> carry the overhead that was in PG 10.
> 
> David has submitted multiple patches for PG 12, one of which speeds up 
> pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)

I don't think he has posted a new patch for improving the speed for
UDPATE/DELETE planning yet, although I remember he had posted a PoC patch
back in February or March.

> What challenges are there for future versions, and which of them are being 
> addressed by patches in progress for PG 12, and which issues are untouched?

The immediate one I think is to refactor the planner such that the new
pruning code, that we were able to utilize for SELECT in PG 11, can also
be used for UPDATE/DELETE.  Refactoring needed to replace the pruning
algorithm was minimal for SELECT, but not so much for UPDATE/DELETE.

Once we're able to utilize the new pruning algorithm for all the cases, we
can move on to refactoring to avoid locking and opening of all partitions.
 As long as we're relying on constraint exclusion for partition pruning,
which we still do for UPDATE/DELETE, we cannot do that because constraint
exclusion has to look at each partition individually.

The UPDATE/DELETE planning for partitioning using huge memory and CPU is a
pretty big issue and refactoring planner to avoid that may be what's
hardest of all the problems to be solved here.

>> The overheads I mention stem from
>> the fact that for partitioning we still rely on the old planner code
>> that's used to perform inheritance planning, which requires to lock and
>> open *all* partitions.  We have so far been able to refactor just enough
>> to use the new code for partition pruning, but there is much refactoring
>> work left to avoid needlessly locking and opening all partitions.
> 
> I remember the issue of opening and locking all partitions was discussed 
> before.

Quite a few times I suppose.

> Does this relate only to the planning phase?

If the benchmark contains queries that will need to access just one
partition, then yes the planning part has is the biggest overhead.

Execution-time overhead is limited to having an extra, possibly needless,
Append node, but I know David has patch for that too.

Thanks,
Amit




Re: How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread Amit Langote
Kato-san,

On 2018/07/13 11:58, Kato, Sho wrote:
> Hi,
> 
> I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no 
> sub-partitioned tables.

Thanks for sharing the results.

> But, statement latencies on a partitioned table is much slower than on a 
> non-partitioned table.
> 
> UPDATE latency is 210 times slower than a non-partitioned table.
> SELECT latency is 36 times slower than a non-partitioned table.
> Surprisingly INSERT latency is almost same.

Yes, INSERT comes out ahead because there is no overhead of partitioning
in the planning phase.  As David Rowley reported on the nearby thread
("Speeding up INSERTs and UPDATEs to partitioned tables"), there is still
significant overhead during its execution, so we're still a bit a fair bit
away from the best possible performance.

For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
is pretty significant and gets worse as the number of partitions grows.  I
had intended to fix that in PG 11, but we could only manage to get part of
that work into PG 11, with significant help from David and others.  So,
while PG 11's overhead of partitioning during planning is less than PG
10's due to improved pruning algorithm, it's still pretty far from ideal,
because it isn't just the pruning algorithm that had overheads.  In fact,
PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
carry the overhead that was in PG 10.  The overheads I mention stem from
the fact that for partitioning we still rely on the old planner code
that's used to perform inheritance planning, which requires to lock and
open *all* partitions.  We have so far been able to refactor just enough
to use the new code for partition pruning, but there is much refactoring
work left to avoid needlessly locking and opening all partitions.

Thanks,
Amit




Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-13 Thread Amit Langote
Thanks for the review.

On 2018/07/12 22:01, Ashutosh Bapat wrote:
> On Thu, Jul 12, 2018 at 11:10 AM, Amit Langote
>  wrote:
>>>
>>> I think your fix is correct.  I slightly modified it along with updating
>>> nearby comments and added regression tests.
>>
>> I updated regression tests to reduce lines.  There is no point in
>> repeating tests like v2 patch did.
> 
> + *
> + * For hash partitioning however, it is possible to combine null and non-
> + * null keys in a pruning step, so do this only if *all* partition keys
> + * are involved in IS NULL clauses.
> 
> I don't think this is true. When equality conditions and IS NULL clauses cover
> all partition keys of a hash partitioned table and do not have contradictory
> clauses, we should be able to find the partition which will remain unpruned.

I was trying to say the same thing, but maybe the comment doesn't like it.
 How about this:

+ * For hash partitioning, if we found IS NULL clauses for some keys and
+ * OpExpr's for others, gen_prune_steps_from_opexps() will generate the
+ * necessary pruning steps if all partition keys are taken care of.
+ * If we found only IS NULL clauses, then we can generate the pruning
+ * step here but only if all partition keys are covered.

> I
> see that we already have this supported in get_matching_hash_bounds()
> /*
>  * For hash partitioning we can only perform pruning based on equality
>  * clauses to the partition key or IS NULL clauses.  We also can only
>  * prune if we got values for all keys.
>  */
> if (nvalues + bms_num_members(nullkeys) == partnatts)
> {
> 
>   */
> -if (!generate_opsteps)
> +if (!bms_is_empty(nullkeys) &&
> +(part_scheme->strategy != PARTITION_STRATEGY_HASH ||
> + bms_num_members(nullkeys) == part_scheme->partnatts))
> 
> So, it looks like we don't need bms_num_members(nullkeys) ==
> part_scheme->partnatts there.

Yes, we can perform pruning in all three cases for hash partitioning:

* all keys are covered by OpExprs providing non-null keys

* some keys are covered by IS NULL clauses, others by OpExprs (all keys
  covered)

* all keys are covered by IS NULL clauses

... as long as we generate a pruning step at all.  The issue here was that
we skipped generating the pruning step due to poorly coded condition in
gen_partprune_steps_internal in some cases.

> Also, I think, we don't know how some new partition strategy will treat NULL
> values so above condition looks wrong to me. Instead it should explicitly 
> check
> the strategies for which we know that the NULL values go to a single 
> partition.

How about if we explicitly spell out the strategies like this:

+if (!bms_is_empty(nullkeys) &&
+(part_scheme->strategy == PARTITION_STRATEGY_LIST ||
+ part_scheme->strategy == PARTITION_STRATEGY_RANGE ||
+ (part_scheme->strategy == PARTITION_STRATEGY_HASH &&
+  bms_num_members(nullkeys) == part_scheme->partnatts)))

The proposed comment also describes why the condition looks like that.

>  /*
> - * Note that for IS NOT NULL clauses, simply having step suffices;
> - * there is no need to propagate the exact details of which keys are
> - * required to be NOT NULL.  Hash partitioning expects to see actual
> - * values to perform any pruning.
> + * There are no OpExpr's, but there are IS NOT NULL clauses, which
> + * can be used to eliminate the null-partition-key-only partition.
> 
> I don't understand this. When there are IS NOT NULL clauses for all the
> partition keys, it's only then that we could eliminate the partition 
> containing
> NULL values, not otherwise.

Actually, there is only one case where the pruning step generated by that
block of code is useful -- to prune a list partition that accepts only
nulls.  List partitioning only allows one partition, key so this works,
but let's say only accidentally.  I modified the condition as follows:

+else if (!generate_opsteps && !bms_is_empty(notnullkeys) &&
+ bms_num_members(notnullkeys) == part_scheme->partnatts)

Attached updated patch.

Thanks,
Amit
diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index cdc61a8997..a89cec0759 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -854,44 +854,47 @@ gen_partprune_steps_internal(GeneratePruningStepsContext 
*context,
}
 
/*
-* If generate_opsteps is set to false it means no OpExprs were directly
-* present in the input list.
+* For list and range partitioning, null partition keys can only be 
found
+* in one designated par

Re: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread Amit Langote
On 2018/07/13 16:29, Kato, Sho wrote:
> I also benchmark PG10.
> Actually, SELECT latency on PG11beta2 + patch1 is faster than PG10.
> 
> SELECT latency with 800 leaf partition
> --
> PG10 5.62 ms
> PG11  3.869 ms  
> 
> But, even PG11, SELECT statement takes 21.102ms on benchmark with 1600 leaf 
> partitions.
> It takes a long time though partition pruning algorithm of PG11 is binary 
> search.

Yeah, pruning algorithm change evidently removed only a small percentage
of the overhead.

>> The overheads I mention stem from the fact that for partitioning we still 
>> rely on the old planner code that's used to perform inheritance planning, 
>> which requires to lock and open *all* partitions.
> 
> I debug update statement execution on partitioned table.
> range_table_mutator seems process all leaf partitions.

That's one of the the symptoms of it, yes.

With the existing code for UPDATE/DELETE planning of partitioned tables,
the whole Query structure is modified to translate the parent's attribute
numbers to partition attribute numbers and planned freshly *for every
partition*.  range_table_mutator() is invoked sometime during that
translation process and itself loops over all partitions, so I'd think it
is susceptible to being prominently visible in perf profiles.

Thanks,
Amit




Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-07-13 Thread Amit Langote
Hi David.

On 2018/06/22 15:28, David Rowley wrote:
> Hi,
> 
> As part of my efforts to make partitioning scale better for larger
> numbers of partitions, I've been looking at primarily INSERT VALUES
> performance.  Here the overheads are almost completely in the
> executor. Planning of this type of statement is very simple since
> there is no FROM clause to process.

Thanks for this effort.

> My benchmarks have been around a RANGE partitioned table with 10k leaf
> partitions and no sub-partitioned tables. The partition key is a
> timestamp column.
> 
> I've found that ExecSetupPartitionTupleRouting() is very slow indeed
> and there are a number of things slow about it.  The biggest culprit
> for the slowness is the locking of each partition inside of
> find_all_inheritors().

Yes. :-(

> For now, this needs to remain as we must hold
> locks on each partition while performing RelationBuildPartitionDesc(),
> otherwise, one of the partitions may get dropped out from under us.

We lock all partitions using find_all_inheritors to keep locking order
consistent with other sites that may want to lock tables in the same
partition tree but with a possibly conflicting lock mode.  If we remove
the find_all_inheritors call in ExecSetupPartitionPruneState (like your
0002 does), we may end up locking partitions in arbitrary order in a given
transaction, because input tuples will be routed to various partitions in
an order that's not predetermined.

But, maybe it's not necessary to be that paranoid.  If we've locked on the
parent, any concurrent lockers would have to wait for the lock on the
parent anyway, so it doesn't matter which order tuple routing locks the
partitions.

> The locking is not the only slow thing. I found the following to also be slow:
> 
> 1. RelationGetPartitionDispatchInfo uses a List and lappend() must
> perform a palloc() each time a partition is added to the list.
> 2. A foreach loop is performed over leaf_parts to search for subplans
> belonging to this partition. This seems pointless to do for INSERTs as
> there's never any to find.
> 3. ExecCleanupTupleRouting() loops through the entire partitions
> array. If a single tuple was inserted then all but one of the elements
> will be NULL.
> 4. Tuple conversion map allocates an empty array thinking there might
> be something to put into it. This is costly when the array is large
> and pointless when there are no maps to store.
> 5. During get_partition_dispatch_recurse(), get_rel_relkind() is
> called to determine if the partition is a partitioned table or a leaf
> partition. This results in a slow relcache hashtable lookup.
> 6. get_partition_dispatch_recurse() also ends up just building the
> indexes array with a sequence of numbers from 0 to nparts - 1 if there
> are no sub-partitioned tables. Doing this is slow when there are many
> partitions.
> 
> Besides the locking, the only thing that remains slow now is the
> palloc0() for the 'partitions' array. In my test, it takes 0.6% of
> execution time. I don't see any pretty ways to fix that.
> 
> I've written fixes for items 1-6 above.
> 
> I did:
> 
> 1. Use an array instead of a List.
> 2. Don't do this loop. palloc0() the partitions array instead. Let
> UPDATE add whatever subplans exist to the zeroed array.
> 3. Track what we initialize in a gapless array and cleanup just those
> ones. Make this array small and increase it only when we need more
> space.
> 4. Only allocate the map array when we need to store a map.
> 5. Work that out in relcache beforehand.
> 6. ditto

The issues you list seem all legitimate to me and also your proposed fixes
for each, except I think we could go a bit further.

Why don't we abandon the notion altogether that
ExecSetupPartitionTupleRouting *has to* process the whole partition tree?
ISTM, there is no need to determine the exact number of leaf partitions
and partitioned tables in the partition tree and allocate the arrays in
PartitionTupleRouting based on that.  I know that the indexes array in
PartitionDispatchData contains mapping from local partition indexes (0 to
partdesc->nparts - 1) to those that span *all* leaf partitions and *all*
partitioned tables (0 to proute->num_partitions or proute->num_dispatch)
in a partition tree, but we can change that.

The idea I had was inspired by looking at partitions_init stuff in your
patch.  We could allocate proute->partition_dispatch_info and
proute->partitions arrays to be of a predetermined size, which doesn't
require us to calculate the exact number of leaf partitions and
partitioned tables beforehand.  So, RelationGetPartitionDispatchInfo need
not recursively go over all of the partition tree.  Instead we create just
one PartitionDispatch object of the root parent table, whose indexes array
is initialized with -1 meaning none of the partitions has not been
encountered yet.  In ExecFindPartition, once tuple routing chooses a
partition, we create either a ResultRelInfo (if leaf) or a
PartitionDispatch for it and 

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-11 Thread Amit Langote
Thanks Ashutosh for reporting and Dilip for the analysis and the patch.

On 2018/07/11 21:39, Dilip Kumar wrote:
> On Wed, Jul 11, 2018 at 5:36 PM, amul sul  wrote:
>> On Wed, Jul 11, 2018 at 5:10 PM Dilip Kumar  wrote:
> 
>>>
>> I am not sure that I have understand the following comments
>>  11 +* Generate one prune step for the information derived from IS NULL,
>>  12 +* if any.  To prune hash partitions, we must have found IS NULL
>>  13 +* clauses for all partition keys.
>>  14  */
>>
>> I am not sure that I have understood this --  no such restriction
>> required to prune the hash partitions, if I am not missing anything.
> 
> Maybe it's not very clear but this is the original comments I have
> retained.  Just moved it out of the (!generate_opsteps) condition.
> 
> Just the explain this comment consider below example,
> 
> create table hp (a int, b text) partition by hash (a int, b text);
> create table hp0 partition of hp for values with (modulus 4, remainder 0);
> create table hp3 partition of hp for values with (modulus 4, remainder 3);
> create table hp1 partition of hp for values with (modulus 4, remainder 1);
> create table hp2 partition of hp for values with (modulus 4, remainder 2);
> 
> postgres=# insert into hp values (1, null);
> INSERT 0 1
> postgres=# insert into hp values (2, null);
> INSERT 0 1
> postgres=# select tableoid::regclass, * from hp;
>  tableoid | a | b
> --+---+---
>  hp1  | 1 |
>  hp2  | 2 |
> (2 rows)
> 
> Now, if we query based on "b is null" then we can not decide which
> partition should be pruned whereas in case
> of other schemes, it will go to default partition so we can prune all
> other partitions.

That's right.  By generating a pruning step with only nullkeys set, we are
effectively discarding OpExprs that may have been found for some partition
keys.  That's fine for list/range partitioning, because nulls can only be
found in a designated partition, so it's okay to prune all other
partitions and for that it's enough to generate the pruning step like
that.  For hash partitioning, nulls could be contained in any partition so
it's not okay to discard OpExpr's like that.  We can generate pruning
steps with combination of null and non-null keys in the hash partitioning
case if there are any OpExprs.

I think your fix is correct.  I slightly modified it along with updating
nearby comments and added regression tests.

Thanks,
Amit
diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index cdc61a8997..d9612bf65b 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -854,44 +854,42 @@ gen_partprune_steps_internal(GeneratePruningStepsContext 
*context,
}
 
/*
-* If generate_opsteps is set to false it means no OpExprs were directly
-* present in the input list.
+* For list and range partitioning, null partition keys can only be 
found
+* in one designated partition, so if there are IS NULL clauses 
containing
+* partition keys we should generate a pruning step that gets rid of all
+* partitions but the special null-accepting partitions.  For range
+* partitioning, that means we will end up disregarding OpExpr's that 
may
+* have been found for some other keys, but that's fine, because it is 
not
+* possible to prune range partitions with a combination of null and
+* non-null keys.
+*
+* For hash partitioning however, it is possible to combine null and 
non-
+* null keys in a pruning step, so do this only if *all* partition keys
+* are involved in IS NULL clauses.
 */
-   if (!generate_opsteps)
+   if (!bms_is_empty(nullkeys) &&
+   (part_scheme->strategy != PARTITION_STRATEGY_HASH ||
+bms_num_members(nullkeys) == part_scheme->partnatts))
+   {
+   PartitionPruneStep *step;
+
+   step = gen_prune_step_op(context, InvalidStrategy,
+false, NIL, 
NIL, nullkeys);
+   result = lappend(result, step);
+   }
+   else if (!generate_opsteps && !bms_is_empty(notnullkeys))
{
/*
-* Generate one prune step for the information derived from IS 
NULL,
-* if any.  To prune hash partitions, we must have found IS NULL
-* clauses for all partition keys.
+* There are no OpExpr's, but there are IS NOT NULL clauses, 
which
+* can be used to eliminate the null-partition-key-only 
partition.
 */
-   if (!bms_is_empty(nullkeys) &&
-   (part_scheme->strategy != PARTITION_STRATEGY_HASH ||
-bms_num_members(nullkeys) == part_scheme->partnatts))
-   {
-   PartitionPruneStep *step;
+   

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

2018-07-11 Thread Amit Langote
On 2018/07/12 14:32, Amit Langote wrote:
> Thanks Ashutosh for reporting and Dilip for the analysis and the patch.
> 
> On 2018/07/11 21:39, Dilip Kumar wrote:
>> On Wed, Jul 11, 2018 at 5:36 PM, amul sul  wrote:
>>> On Wed, Jul 11, 2018 at 5:10 PM Dilip Kumar  wrote:
>>
>>>>
>>> I am not sure that I have understand the following comments
>>>  11 +* Generate one prune step for the information derived from IS NULL,
>>>  12 +* if any.  To prune hash partitions, we must have found IS NULL
>>>  13 +* clauses for all partition keys.
>>>  14  */
>>>
>>> I am not sure that I have understood this --  no such restriction
>>> required to prune the hash partitions, if I am not missing anything.
>>
>> Maybe it's not very clear but this is the original comments I have
>> retained.  Just moved it out of the (!generate_opsteps) condition.
>>
>> Just the explain this comment consider below example,
>>
>> create table hp (a int, b text) partition by hash (a int, b text);
>> create table hp0 partition of hp for values with (modulus 4, remainder 0);
>> create table hp3 partition of hp for values with (modulus 4, remainder 3);
>> create table hp1 partition of hp for values with (modulus 4, remainder 1);
>> create table hp2 partition of hp for values with (modulus 4, remainder 2);
>>
>> postgres=# insert into hp values (1, null);
>> INSERT 0 1
>> postgres=# insert into hp values (2, null);
>> INSERT 0 1
>> postgres=# select tableoid::regclass, * from hp;
>>  tableoid | a | b
>> --+---+---
>>  hp1  | 1 |
>>  hp2  | 2 |
>> (2 rows)
>>
>> Now, if we query based on "b is null" then we can not decide which
>> partition should be pruned whereas in case
>> of other schemes, it will go to default partition so we can prune all
>> other partitions.
> 
> That's right.  By generating a pruning step with only nullkeys set, we are
> effectively discarding OpExprs that may have been found for some partition
> keys.  That's fine for list/range partitioning, because nulls can only be
> found in a designated partition, so it's okay to prune all other
> partitions and for that it's enough to generate the pruning step like
> that.  For hash partitioning, nulls could be contained in any partition so
> it's not okay to discard OpExpr's like that.  We can generate pruning
> steps with combination of null and non-null keys in the hash partitioning
> case if there are any OpExprs.
> 
> I think your fix is correct.  I slightly modified it along with updating
> nearby comments and added regression tests.

I updated regression tests to reduce lines.  There is no point in
repeating tests like v2 patch did.

Thanks,
Amit
diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index cdc61a8997..d9612bf65b 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -854,44 +854,42 @@ gen_partprune_steps_internal(GeneratePruningStepsContext 
*context,
}
 
/*
-* If generate_opsteps is set to false it means no OpExprs were directly
-* present in the input list.
+* For list and range partitioning, null partition keys can only be 
found
+* in one designated partition, so if there are IS NULL clauses 
containing
+* partition keys we should generate a pruning step that gets rid of all
+* partitions but the special null-accepting partitions.  For range
+* partitioning, that means we will end up disregarding OpExpr's that 
may
+* have been found for some other keys, but that's fine, because it is 
not
+* possible to prune range partitions with a combination of null and
+* non-null keys.
+*
+* For hash partitioning however, it is possible to combine null and 
non-
+* null keys in a pruning step, so do this only if *all* partition keys
+* are involved in IS NULL clauses.
 */
-   if (!generate_opsteps)
+   if (!bms_is_empty(nullkeys) &&
+   (part_scheme->strategy != PARTITION_STRATEGY_HASH ||
+bms_num_members(nullkeys) == part_scheme->partnatts))
+   {
+   PartitionPruneStep *step;
+
+   step = gen_prune_step_op(context, InvalidStrategy,
+false, NIL, 
NIL, nullkeys);
+   result = lappend(result, step);
+   }
+   else if (!generate_opsteps && !bms_is_empty(notnullkeys))
{
/*
-* Generate one prune step for the information derived from IS 
NULL,
-* if any.  To prune hash partitions, we must have

Re: Problem on pg_dump RANGE partition with expressions

2018-07-12 Thread Amit Langote
Nagata-san,

On 2018/07/12 16:59, Yugo Nagata wrote:
> Hi,
> 
> During looking into other thread[1], I found a problem on pg_dump of range
> partition table using expressions. When we create a range partitioned table, 
> we cannot use a column more than once in the partition key.
>  
>  postgres=# create table t (i int) partition by range(i,i);
>  ERROR:  column "i" appears more than once in partition key
> 
> On the other hand, the following query using expression is allowed.
> 
>  postgres=# create table test (i int) partition by range(i,(i));
>  CREATE TABLE
> 
> However, when we use pg_dump for this, we get
> 
>  CREATE TABLE public.test (
>  i integer
>  )
>  PARTITION BY RANGE (i, i);
> 
> , and we cannot restore this due to the error.

Oops.

> I can consider three approaches to resolve this.
> 
> 
> 1) Allow to appear more than once in range partition key
> 
> I don't understand why there is this restriction. If we have no clear reason, 
> can we rip out this restrition?

I can't recall exactly, but back when I wrote this code, I might have been
thinking that such a feature is useless and would actually just end up
being a foot gun for users.

Although, I'm open to ripping it out if it's seen as being overly restrictive.

> 2) Forbid this kind of partition key
> 
> If we can make more checks and forbid such partition keys as RANGE(i,(i)), 
> we can avoid the problem though it is a bit rigorous.

If there is a way for transformPartitionSpec to conclude that i and (i)
are in fact the same thing and error out, which is perhaps not that hard
to do, then maybe we could do that.  That is, if we don't want to rip the
error out as your proposed solution (1) above.

> 3) Treat expressions as it is
> 
> For some reasons, expressions like "(column)" or "(column COLLATE something)"
> is treated like simple attributes in the current implementation
> (in ComputePartitionAttr() ).
> 
> If we treat these expressions as it is, pg_dump result will be the same 
> expression as when the partition table is created, and we can restore this 
> successfully.

Hmm, I think it's better to simplify (column) into a simple column
reference instead of uselessly maintaining it as an expression.  There's
no other good reason to do this.

Thanks,
Amit




Re: no partition pruning when partitioning using array type

2018-07-12 Thread Amit Langote
On 2018/07/12 2:33, Alvaro Herrera wrote:
> On 2018-Jul-11, Amit Langote wrote:
> 
>> On 2018/07/11 13:12, Alvaro Herrera wrote:
>>> On 2018-Jul-11, Amit Langote wrote:
>>>
>>>> What's the solution here then?  Prevent domains as partition key?
>>>
>>> Maybe if a domain is used in a partition key somewhere, prevent
>>> constraints from being added?
>>
>> Maybe, but I guess you mean only prevent adding such a constraint
>> after-the-fact.
> 
> Yeah, any domain constraints added before won't be a problem.  Another
> angle on this problem is to verify partition bounds against the domain
> constraint being added; if they all pass, there's no reason to reject
> the constraint.

That's an idea.  It's not clear to me how easy it is to get hold of all
the partition bounds that contain domain values.  How do we get from the
domain on which a constraint is being added to the relpartbound which
would contain the partition bound value of the domain?

> But I worry that Tom was using domain constraints as just an example of
> a more general problem that we can get into.  
> 
> 
>>> Another thing worth considering: are you prevented from dropping a
>>> domain that's used in a partition key?  If not, you get an ugly message
>>> when you later try to drop the table.
>>
>> Yeah, I was about to write a message about that.  I think we need to teach
>> RemoveAttributeById, which dependency.c calls when dropping the
>> referencing domain with cascade option, to abort if the attribute passed
>> to it belongs to the partition key of the input relation.
> 
> Actually, here's another problem: why are we letting a column on a
> domain become partition key, if you'll never be able to create a
> partition?  It seems that for pg11 the right reaction is to check
> the partition key and reject this case.

Yeah, that seems much safer, and given how things stand today, no users
would complain if we do this.

> I'm not sure of this implementation -- I couldn't find any case where we
> reject the deletion on the function called from doDeletion (and we don't
> have a preliminary phase on which to check for this, either).  Maybe we
> need a pg_depend entry to prevent this, though it seems a little silly.
> Maybe we should add a preliminary verification phase in
> deleteObjectsInList(), where we invoke object-type-specific checks.

Doing pre-check based fix had crossed my mind, but I didn't try hard to
pursue it.  If we decide to just prevent domain partition keys, we don't
need to try too hard now to come up with a nice implementation for this,
right?

Thanks,
Amit




Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-07-12 Thread Amit Langote
Thanks Ashutosh.

On 2018/07/10 22:50, Ashutosh Bapat wrote:
> I didn't see any hackers thread linked to this CF entry. Hence sending this 
> mail through CF app.

Hmm, yes.  I hadn't posted the patch to -hackers.

> The patch looks good to me. It applies cleanly, compiles cleanly and make 
> check passes.
> 
> I think you could avoid condition
> +   /* Do not override parent's NOT NULL constraint. */
> +   if (restdef->is_not_null)
> +   coldef->is_not_null = restdef->is_not_null;
> 
> by rewriting this line as
> coldef->is_not_null = coldef->is_not_null || restdef->is_not_null;

Agreed, done like that.

> The comment looks a bit odd either way since we are changing 
> coldef->is_not_null based on restdef->is_not_null. That's because of the 
> nature of boolean variables. May be a bit of explanation is needed.

I have modified the comments around this code in the updated patch.

> On a side note, I see
> coldef->constraints = restdef->constraints;
> Shouldn't we merge the constraints instead of just overwriting those?

Actually, I checked that coldef->constraints is always NIL in this case.
coldef (ColumnDef) is constructed from a member in the parent's TupleDesc
earlier in that function and any constraints that may be present in the
parent's definition of the column are saved in a separate variable that's
returned to the caller as one list containing "old"/inherited constraints.
 So, no constraints are getting lost here.

Attached is an updated patch.  I've updated some nearby comments as the
code around here seems pretty confusing, which I thought after having
returned to it after a while.

I have attached both a patch for PG10 and PG11/HEAD branches, which are
actually not all that different from each other.

Thanks,
Amit
From 1158793b899bbc1f52e37b2255bd74121a293495 Mon Sep 17 00:00:00 2001
From: amit 
Date: Wed, 6 Jun 2018 16:46:46 +0900
Subject: [PATCH v2] Fix bug regarding partition column option inheritance

It appears that the coding pattern in MergeAttributes causes the
NOT NULL constraint and default value of a column from not being
properly inherited from the parent's definition.
---
 src/backend/commands/tablecmds.c   | 49 +++---
 src/test/regress/expected/create_table.out | 13 
 src/test/regress/sql/create_table.sql  |  7 +
 3 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 81c3845095..b7a6c1792e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2181,6 +2181,15 @@ MergeAttributes(List *schema, List *supers, char 
relpersistence,
 */
if (is_partition && list_length(saved_schema) > 0)
{
+   /*
+* Each member in 'saved_schema' contains a ColumnDef containing
+* partition-specific options for the column.  Below, we merge 
the
+* information from each into the ColumnDef of the same name 
found in
+* the original 'schema' list before deleting it from the list. 
 So,
+* once we've finished processing all the columns from the 
original
+* 'schema' list, there shouldn't be any ColumnDefs left that 
came
+* from the 'saved_schema' list.
+*/
schema = list_concat(schema, saved_schema);
 
foreach(entry, schema)
@@ -2208,14 +2217,44 @@ MergeAttributes(List *schema, List *supers, char 
relpersistence,
if (strcmp(coldef->colname, restdef->colname) 
== 0)
{
/*
-* merge the column options into the 
column from the
-* parent
+* Merge the column options specified 
for the partition
+* into the column definition inherited 
from the parent.
 */
if (coldef->is_from_parent)
{
-   coldef->is_not_null = 
restdef->is_not_null;
-   coldef->raw_default = 
restdef->raw_default;
-   coldef->cooked_default = 
restdef->cooked_default;
+   /*
+* Combine using OR so that the 
NOT NULL constraint
+* in the parent's definition 
doesn't get lost.
+*/
+   coldef->is_not_null = 
(coldef->is_not_null ||
+

Re: Incorrect comments in partition.c

2018-01-23 Thread Amit Langote
On 2018/01/23 20:43, Etsuro Fujita wrote:
> Here is a comment for get_qual_for_list in partition.c:
> 
>   * get_qual_for_list
>   *
>   * Returns an implicit-AND list of expressions to use as a list partition's
> - * constraint, given the partition key and bound structures.
> 
> I don't think the part "given the partition key and bound structures."
> is correct because we pass the *parent relation* and partition bound
> structure to that function.  So I think we should change that part as
> such.  get_qual_for_range has the same issue, so I think we need this
> change for that function as well.

Yeah.  It seems 6f6b99d1335 [1] changed their signatures to support
handling default partitions.

> Another one I noticed in comments in partition.c is:
> 
>  * get_qual_for_hash
>  *
>  * Given a list of partition columns, modulus and remainder
> corresponding to a
>  * partition, this function returns CHECK constraint expression Node for
> that
>  * partition.
> 
> I think the part "Given a list of partition columns, modulus and
> remainder corresponding to a partition" is wrong because we pass to that
> function the parent relation and partition bound structure the same way
> as for get_qual_for_list/get_qual_for_range.  So what about changing the
> above to something like this, similarly to
> get_qual_for_list/get_qual_for_range:
> 
>  Returns a CHECK constraint expression to use as a hash partition's
>  constraint, given the parent relation and partition bound structure.

Makes sense.

> Also:
> 
>  * The partition constraint for a hash partition is always a call to the
>  * built-in function satisfies_hash_partition().  The first two
> arguments are
>  * the modulus and remainder for the partition; the remaining arguments
> are the
>  * values to be hashed.
> 
> I also think the part "The first two arguments are the modulus and
> remainder for the partition;" is wrong (see satisfies_hash_partition).
> But I don't think we need to correct that here because we have described
> about the arguments in comments for that function.  So I'd like to
> propose removing the latter comment entirely from the above.

Here, too.  The comment seems to have been obsoleted by f3b0897a121 [2].

Thanks,
Amit

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6f6b99d1335

[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=f3b0897a121




Re: [HACKERS] path toward faster partition pruning

2018-01-18 Thread Amit Langote
Thank you Horiguchi-san!

On 2018/01/19 12:00, Kyotaro HORIGUCHI wrote:
> At Thu, 18 Jan 2018 11:41:00 -0800, Andres Freund  wrote:
>> Hi Amit,
>>
>> It seems your mail system continually adds "[Sender Address Forgery]"
>> prefixes to messages. E.g. this mail now has
>> Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [Sender 
>> Address Forgery]Re: [HACKERS] path toward faster partition pruning
>> as its subject, whereas the mail you're replying to only had
>> Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: 
>> [HACKERS] path toward faster partition pruning
>> two of them.
>>
>> I think the two previous occurances of this also are from you.
>>
>> This is somewhat annoying, could you try to figure out a) what the
>> problem is b) how to prevent the subject being edited like that?

Sorry about that.  I had noticed it and had manually edited the subject
line once or twice before, but failed to do it every time.

> Our mail server is failing to fetch SPF record for David's mails
> that received directly (not via -hakders ML) and the server adds
> the subject header.  It is failing to fetch SPF record for
> 2ndquadrant.com. The reason might be that the envelope-from of
> his mails is not consistent with his server's IP address.

I was able to notice that too.  It seems that the Received-SPF: PermError
and X-SPF-Status = fail/warn headers started appearing in the emails only
a few months ago, so it appears our mail server was changed to notice
these discrepancies only recently.  First email in this thread that got
such subject line was the following, which is my reply to David's email:

https://www.postgresql.org/message-id/b8094e71-2c73-ed8e-d8c3-53f232c8c049%40lab.ntt.co.jp

There are emails from David before that one and I couldn't see such
headers in all those emails, so no alterations of the subject line.

> Anyway, mails via -hackers ML doesn't suffer so, what Amit (and
> I) side can do by myself is one of the following.
> 
> - Being careful to reply to the mails comming via the ML.
> - Remove the added header by hand..

Yeah, will make sure to do that.

> And I'd like to ask David to check out his mail environment so
> that SPF record is available for his message.

That'd be nice too.

Thanks,
Amit




Re: [HACKERS] Adding column_constraint description in ALTER TABLE synopsis

2018-01-22 Thread Amit Langote
On 2018/01/23 8:57, Thomas Munro wrote:
> On Tue, Jan 23, 2018 at 12:41 PM, Thomas Munro
>  wrote:
>> On Mon, Jan 15, 2018 at 2:32 PM, Stephen Frost  wrote:
>>> If someone else would like to review it, that'd be great, otherwise I'll
>>> probably get it committed soon.
>>
>> FYI the v2 doesn't build:
>>
>> ref/alter_table.sgml:135: parser error : Opening and ending tag
>> mismatch: refentry line 6 and synopsis
>> 
> 
> Here's an update to move the new stuff to the correct side of the
> closing "".  Builds for me, and the typesetting looks OK.
> I'm not sure why the preexisting bogo-productions have inconsistent
> indentation levels (looking at table_constraint_using_index) but
> that's not this patch's fault.

Thanks for fixing that.

I noticed that partition_bound_spec in the patch is missing hash partition
bound syntax that was added after the original patch was written.  Fixed
in the attached.

Thanks,
Amit
diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index 286c7a8589..5cc0519c8c 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -85,6 +85,20 @@ ALTER TABLE [ IF EXISTS ] name
 OWNER TO { new_owner | 
CURRENT_USER | SESSION_USER }
 REPLICA IDENTITY { DEFAULT | USING INDEX index_name | FULL | NOTHING }
 
+and column_constraint 
is:
+
+[ CONSTRAINT constraint_name ]
+{ NOT NULL |
+  NULL |
+  CHECK ( expression ) [ NO 
INHERIT ] |
+  DEFAULT default_expr |
+  GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( 
sequence_options ) ] |
+  UNIQUE index_parameters |
+  PRIMARY KEY index_parameters |
+  REFERENCES reftable [ ( 
refcolumn ) ] [ MATCH FULL | MATCH 
PARTIAL | MATCH SIMPLE ]
+[ ON DELETE action ] [ ON 
UPDATE action ] }
+[ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
+
 and table_constraint 
is:
 
 [ CONSTRAINT constraint_name ]
@@ -96,11 +110,27 @@ ALTER TABLE [ IF EXISTS ] name
 [ MATCH FULL | MATCH PARTIAL | MATCH SIMPLE ] [ ON DELETE action ] [ ON UPDATE action ] }
 [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]
 
+index_parameters in 
UNIQUE, PRIMARY KEY, and 
EXCLUDE constraints are:
+
+[ WITH ( storage_parameter [= 
value] [, ... ] ) ]
+[ USING INDEX TABLESPACE tablespace_name ]
+
+exclude_element in an 
EXCLUDE constraint is:
+
+{ column_name | ( expression ) } [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST 
} ]
+
 and table_constraint_using_index is:
 
 [ CONSTRAINT constraint_name ]
 { UNIQUE | PRIMARY KEY } USING INDEX index_name
 [ DEFERRABLE | NOT DEFERRABLE ] [ INITIALLY DEFERRED | INITIALLY IMMEDIATE 
]
+
+and partition_bound_spec 
is:
+
+IN ( { numeric_literal | 
string_literal | NULL } [, ...] ) |
+FROM ( { numeric_literal | 
string_literal | MINVALUE | 
MAXVALUE } [, ...] )
+  TO ( { numeric_literal | 
string_literal | MINVALUE | 
MAXVALUE } [, ...] ) |
+WITH ( MODULUS numeric_literal, 
REMAINDER numeric_literal )
 
  
 


Re: ON CONFLICT DO UPDATE for partitioned tables

2018-02-28 Thread Amit Langote
On 2018/03/01 1:03, Robert Haas wrote:
> On Tue, Feb 27, 2018 at 7:46 PM, Alvaro Herrera
>  wrote:
>> I updated Amit Langote's patch for INSERT ON CONFLICT DO UPDATE[1].
>> Following the lead of edd44738bc88 ("Be lazier about partition tuple
>> routing.") this incarnation only does the necessary push-ups for the
>> specific partition that needs it, at execution time.  As far as I can
>> tell, it works as intended.
>>
>> I chose to refuse the case where the DO UPDATE clause causes the tuple
>> to move to another partition (i.e. you're updating the partition key of
>> the tuple).  While it's probably possible to implement that, it doesn't
>> seem a very productive use of time.
> 
> I would have thought that to be the only case we could support with
> the current infrastructure.  Doesn't a correct implementation for any
> other case require a global index?

I'm thinking that Alvaro is talking here about the DO UPDATE action part,
not the conflict checking part.  The latter will definitely require global
indexes if conflict were to be checked on columns not containing the
partition key.

The case Alvaro mentions arises after checking the conflict, presumably
using an inherited unique index whose keys must include the partition
keys. If the conflict action is DO UPDATE and its SET clause changes
partition key columns such that the row will have to change the partition,
then the current patch will result in an error.  I think that's because
making update row movement work in this case will require some adjustments
to what 2f178441044 (Allow UPDATE to move rows between partitions)
implemented.  We wouldn't have things set up in the ModifyTableState that
the current row-movement code depends on being set up; for example, there
wouldn't be per-subplan ResultRelInfo's in the ON CONFLICT DO UPDATE case.
 The following Assert in ExecUpdate() will fail for instance:

map_index = resultRelInfo - mtstate->resultRelInfo;
Assert(map_index >= 0 && map_index < mtstate->mt_nplans);

Thanks,
Amit




Re: [HACKERS] path toward faster partition pruning

2018-02-28 Thread Amit Langote
On 2018/02/28 19:14, Ashutosh Bapat wrote:
> On Wed, Feb 28, 2018 at 6:42 AM, Amit Langote wrote:
>> BTW, should there be a relevant test in partition_join.sql?  If yes,
>> attached a patch (partitionwise-join-collation-test-1.patch) to add one.
> 
> A partition-wise join path will be created but discarded because of
> higher cost. This test won't see it in that case. So, please add some
> data like other tests and add command to analyze the partitioned
> tables. That kind of protects from something like that.

Thanks for the review.

Hmm, the added test is such that the partition collations won't match, so
partition-wise join won't be considered at all due to differing
PartitionSchemes, unless I'm missing something.

Thanks,
Amit




Re: no partition pruning when partitioning using array type

2018-03-01 Thread Amit Langote
On 2018/02/02 0:20, Robert Haas wrote:
> On Thu, Feb 1, 2018 at 4:42 AM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>>> I hope someone else chimes in as well. :)
>>
>> Bug #15042 [1] seems to be caused by this same problem.  There, a
>> RelabelType node is being slapped (by the partitioning code) on a Var node
>> of a partition key on enum.
>>
>> Attached updated patch.
> 
> Can I get anyone else to weigh in on whether this is likely to be
> safe?  Paging people who understand constraint exclusion...

Added this to CF (actually moved to the September one after first having
added it to the CF that is just getting started).

It seems to me that we don't need to go with my originally proposed
approach to teach predtest.c to strip RelabelType nodes.  Apparently, it's
only partition.c that's adding the RelableType node around partition key
nodes in such cases.  That is, in the case of having an array, record,
enum, and range type as the partition key.  No other part of the system
ends up adding one as far as I can see.  Parser's make_op(), for example,
that is used to generate an OpExpr for a qual involving the partition key,
won't put RelabelType around the partition key node if the operator in
question has "pseudo"-types listed as declared types of its left and right
arguments.

So I revised the patch to drop all the predtest.c changes and instead
modify get_partition_operator() to avoid generating RelabelType in such
cases.  Please find it attached.

Thanks,
Amit
From b0ebef9526a2abb162e877bc3b73d8194ffb1d2f Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Fri, 8 Dec 2017 19:09:31 +0900
Subject: [PATCH v3] In get_partition_operator(), avoid RelabelType in some
 cases

We never have pg_am catalog entries with "real" array, enum, record,
range types as leftop and rightop types.  Instead, operators manipulating
such types have entries with "pseudo-types" listed as leftop and rightop
types.  For example, for enums, all operator entries are marked with
anyenum as their leftop and rightop types.  So, if we pass "real" type
OIDs to get_opfamily_member(), we get back an InvalidOid for the
aforementioned type categories.  We'd normally ask the caller of
get_partition_operator to add a RelabelType in such case.  But we don't
need to in these cases, as the rest of the system doesn't.

This fixes the issue that the constraint exclusion code would reject
matching partition key quals with restrictions due to these extraneous
RelabelType nodes surrounding partition key expressions generated
by partition.c.

Also, ruleutils.c can resolve the operator names properly when deparsing
an internally generated partition qual expressions.
---
 src/backend/catalog/partition.c   | 12 ++-
 src/test/regress/expected/create_table.out|  2 +-
 src/test/regress/expected/partition_prune.out | 45 +++
 src/test/regress/sql/partition_prune.sql  | 18 +++
 4 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index fcf763..cf71ba174e 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -1605,7 +1605,17 @@ get_partition_operator(PartitionKey key, int col, 
StrategyNumber strategy,
elog(ERROR, "missing operator %d(%u,%u) in opfamily %u",
 strategy, key->partopcintype[col], 
key->partopcintype[col],
 key->partopfamily[col]);
-   *need_relabel = true;
+
+   /*
+* For certain type categories, there don't exist pg_amop 
entries with
+* the given type OID as the operator's left and right operand 
type.
+* Instead, the entries have corresponding pseudo-types listed 
as the
+* left and right operand type; for example, anyenum for an 
enum type.
+* For such cases, it's not necessary to add the RelabelType 
node,
+* because other parts of the sytem won't add one either.
+*/
+   if (get_typtype(key->partopcintype[col]) != TYPTYPE_PSEUDO)
+   *need_relabel = true;
}
else
*need_relabel = false;
diff --git a/src/test/regress/expected/create_table.out 
b/src/test/regress/expected/create_table.out
index 39a963888d..7764da3152 100644
--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -880,6 +880,6 @@ CREATE TABLE arrlp12 PARTITION OF arrlp FOR VALUES IN 
('{1}', '{2}');
 
+---+---+--+-+--+--+-
  a  | integer[] |   |  | | extended |  
| 
 Partition of: arrlp FOR VALUES IN ('{1}', '{2}')
-Parti

Re: constraint exclusion and nulls in IN (..) clause

2018-03-14 Thread Amit Langote
On 2018/03/14 17:16, Amit Langote wrote:
> On 2018/03/10 13:40, Tom Lane wrote:
>> I wrote:
>>> I think it'd make more sense to see about incorporating that idea in
>>> predicate_implied_by_simple_clause/predicate_refuted_by_simple_clause.
>>
>> After further thought, it seems like the place to deal with this is
>> really operator_predicate_proof(), as in the attached draft patch
>> against HEAD.  This passes the smell test for me, in the sense that
>> it's an arguably correct and general extension of the proof rules,
>> but it could use more testing.
> 
> Thanks for the patch.  I agree it handles the case I presented my patch
> for in a more principled manner.  So, I've marked the CF entry for my
> patch as Rejected.

Oops, sorry I hadn't actually seen the CF entry before hitting send on
this email.  Seeing that Tom intends to attach his patch with this CF
entry, I will leave the entry alone for now.

Thanks,
Amit




Re: inserts into partitioned table may cause crash

2018-03-13 Thread Amit Langote
Fujita-san,

Thanks for the updates and sorry I couldn't reply sooner.

On 2018/03/06 21:26, Etsuro Fujita wrote:
> One thing I notice while working on this is this in ExecInsert/CopyFrom,
> which I moved to ExecPrepareTupleRouting as-is for the former:
>
> /*
>  * If we're capturing transition tuples, we might need to convert from
> the
>  * partition rowtype to parent rowtype.
>  */
> if (mtstate->mt_transition_capture != NULL)
> {
> if (resultRelInfo->ri_TrigDesc &&
> (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
>  resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
> {
> /*
>  * If there are any BEFORE or INSTEAD triggers on the partition,
>  * we'll have to be ready to convert their result back to
>  * tuplestore format.
>  */
> mtstate->mt_transition_capture->tcs_original_insert_tuple =
NULL;
> mtstate->mt_transition_capture->tcs_map =
> TupConvMapForLeaf(proute, rootRelInfo, leaf_part_index);
> }
>
> Do we need to consider INSTEAD triggers here?  The partition is either a
> plain table or a foreign table, so I don't think it can have those
> triggers.  Am I missing something?

I think you're right.  We don't need to consider INSTEAD OF triggers here
if we know we're dealing with a partition which cannot have those.

Just to make sure, a word from Thomas would help.

On 2018/03/12 12:25, Etsuro Fujita wrote:
> (2018/03/09 20:18), Etsuro Fujita wrote:
>> Here are updated patches for PG10 and HEAD.
>>
>> Other changes:
>> * Add regression tests based on your test cases shown upthread
> 
> I added a little bit more regression tests and revised comments.  Please
> find attached an updated patch.

Thanks for adding the test.

I was concerned that ExecMaterializeSlot will be called twice now -- first
in ExecPrepareTupleRouting and then in ExecInsert -- instead of only once
in ExecInsert with the existing code, but perhaps it doesn't matter much
because most of the work would be done in the 1st call anyway.

Sorry that this may be nitpicking that I should've brought up before, but
doesn't ExecPrepareTupleRouting do all the work that's needed for routing
a tuple and hence isn't the name a bit misleading?  Maybe,
ExecPerformTupleRouting?

Btw, I noticed that the patches place ExecPrepareTupleRouting (both the
declaration and the definition) at different relative locations within
nodeModifyTable.c in the HEAD branch vs. in the 10 branch.  It might be a
good idea to bring them to the same relative location to avoid hassle when
back-patching relevant patches in the future.  I did that in the attached
updated version (v4) of the patch for HEAD, which does not make any other
changes.  Although, the patch for PG-10 is unchanged, I still changed its
file name to contain v4.

Regards,
Amit
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***
*** 2656,2668  CopyFrom(CopyState cstate)
if (cstate->transition_capture != NULL)
{
if (resultRelInfo->ri_TrigDesc &&
!   
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
!
resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
{
/*
!* If there are any BEFORE or INSTEAD 
triggers on the
!* partition, we'll have to be ready to 
convert their
!* result back to tuplestore format.
 */

cstate->transition_capture->tcs_original_insert_tuple = NULL;
cstate->transition_capture->tcs_map =
--- 2656,2667 
if (cstate->transition_capture != NULL)
{
if (resultRelInfo->ri_TrigDesc &&
!   
resultRelInfo->ri_TrigDesc->trig_insert_before_row)
{
/*
!* If there are any BEFORE triggers on 
the partition,
!* we'll have to be ready to convert 
their result back to
!* tuplestore format.
 */

cstate->transition_capture->tcs_original_insert_tuple = NULL;
cstate->transition_capture->tcs_map =
***
*** 2803,2814  CopyFrom(CopyState cstate)
 * tuples inserted by an INSERT command.
 */
processed++;
  
!  

Re: constraint exclusion and nulls in IN (..) clause

2018-03-14 Thread Amit Langote
On 2018/03/10 13:40, Tom Lane wrote:
> I wrote:
>> I think it'd make more sense to see about incorporating that idea in
>> predicate_implied_by_simple_clause/predicate_refuted_by_simple_clause.
> 
> After further thought, it seems like the place to deal with this is
> really operator_predicate_proof(), as in the attached draft patch
> against HEAD.  This passes the smell test for me, in the sense that
> it's an arguably correct and general extension of the proof rules,
> but it could use more testing.

Thanks for the patch.  I agree it handles the case I presented my patch
for in a more principled manner.  So, I've marked the CF entry for my
patch as Rejected.


Looking at the patch, I guess it'd be right to think that the case for
which the following block of code will be executed only arises for
OpExpr's generated by predtest.c (that is, when iterating an SAOP):

 if (clause_const->constisnull)
+{
+/* If clause_op isn't strict, we can't prove anything */
+if (!op_strict(clause_op))
+return false;
+


That's because any other expr = null would've been reduced to
constant-false at some earlier point.

Then, in the same block I see that there is:

+/*
+ * For weak implication, it's still possible for the proof to
succeed,
+ * if the predicate can also be proven NULL.  In that case we've got
+ * NULL => NULL which is valid for this proof type.
+ */
+if (pred_const->constisnull && op_strict(pred_op))
+return true;

which, afaics, will never be able to return true, because
pred_const->constisnull will never be true here.  That's because the
following code that will run before the above code will determine the
result of operator_predicate_proof() in that case:

if (equal(pred_leftop, clause_leftop))
{
if (equal(pred_rightop, clause_rightop))
{
/* We have x op1 y and x op2 y */
return operator_same_subexprs_proof(pred_op, clause_op,
refute_it);

I wonder if the result would be valid in that case, because the proof
cache should only be looked up for non-null sub-expressions, no?

> TBH, the change in the existing regression test case in inherit.sql
> makes me itch.  We've got
> 
> create table list_parted (
>   a   varchar
> ) partition by list (a);
> ...
> create table part_null_xy partition of list_parted for values in (null, 'xy');
> ...
> explain (costs off) select * from list_parted where a = 'ab' or a in (null, 
> 'cd');
> 
> Now, the fact that "null" is included in this query's IN clause is a
> complete no-op, because the IN is using a strict equality operator.
> So it's nice that the planner can see that and realize that there's
> no point in scanning part_null_xy ... but this means that the syntax
> that's been chosen for list partitioning is seriously misleading.
> "in (null, 'xy')" in the CREATE TABLE command has nothing to do with
> the semantics of that identical clause in any other context, and indeed
> it seems chosen in a way to confuse even (or especially?) seasoned
> experts.

Hmm, yeah, it is confusing.  Actually, what it really internally becomes
is the following expression:

  (a IS NULL) OR (a = 'xy')

or if the CREATE TABLE had "in (null, 'xy', 'yz'), the expression will become:

  (a IS NULL) OR (a = ANY (ARRAY['xy'::text, 'yz'::text]))

which, as you say, is not the same thing as how the original expression is
interpreted elsewhere.

> I suppose it's too late to do anything about that now, but it sure
> seems like NULL should've been handled some other way.

Maybe, write something in the documentation about that?

Thanks,
Amit




Re: inserts into partitioned table may cause crash

2018-03-14 Thread Amit Langote
On 2018/03/14 17:35, Etsuro Fujita wrote:
> (2018/03/14 17:25), Etsuro Fujita wrote:
>> (2018/03/14 14:54), Amit Langote wrote:
>>> Sorry that this may be nitpicking that I should've brought up before, but
>>> doesn't ExecPrepareTupleRouting do all the work that's needed for routing
>>> a tuple and hence isn't the name a bit misleading? Maybe,
>>> ExecPerformTupleRouting?
>>
>> Actually, I thought of that name as a candidate, too. But I used
>> ExecPrepareTupleRouting because I didn't think it would actually perform
>> all the work; because it wouldn't do the main work of routing, ie, route
>> an inserted tuple to the target partition, which is ExecInsert()'s job.
>> I agree that it would do all the work *needed for routing*, though.
> 
> One thing to add: having said that, I don't have any strong opinion about
> that.  How about leaving that for the committer?

Sure.  I agree with your point that "routing" isn't finished in that
function if we also consider actual insertion of tuple into the selected
partition a part of it.

Thanks,
Amit





Re: [HACKERS] path toward faster partition pruning

2018-03-14 Thread Amit Langote
On 2018/03/14 8:26, Alvaro Herrera wrote:
> By the way, I checked whether patch 0002 (additional tests) had an
> effect on coverage, and couldn't detect any changes in terms of
> lines/functions.  Were you able to find any bugs in your code thanks to
> the new tests that would not have been covered by existing tests?

All tests except those for hash partitioning got added as bugs were found
in the patch and fixed.  As you may know, constraint exclusion doesn't
help with pruning hash partitions, so those tests don' exercise any
existing functionality but are there for the *new* code.

Thanks,
Amit




Re: [HACKERS] path toward faster partition pruning

2018-03-15 Thread Amit Langote
On 2018/03/14 20:50, David Rowley wrote:
> On 14 March 2018 at 00:37, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
>> Attached is a significantly revised version of the patch, although I admit
>> it could still use some work with regard to comments and other cleanup.
> 
> Thanks for making all those changes. There's been quite a bit of churn!

Thank you David and Jesper for the reviews.  I'm halfway done addressing
the comments and will submit an updated patch by tomorrow.

Thanks,
Amit




Re: constraint exclusion and nulls in IN (..) clause

2018-03-06 Thread Amit Langote
On 2018/03/06 18:46, Emre Hasegeli wrote:
>> Patch teaches it to ignore nulls when it's known that the operator being
>> used is strict.  It is harmless and has the benefit that constraint
>> exclusion gives an answer that is consistent with what actually running
>> such a qual against a table's rows would do.
> 
> Yes, I understood that.  I just meant that I don't know if it is the
> best way to skip NULLs inside "next_fn".  Maybe the caller of the
> "next_fn"s should skip them.  Anyway, the committer can judge this
> better.

I think putting that inside those next_fn functions tends to centralize
the logic and would run only only for the intended cases.

>> Yeah.  Rearranged the code to fix that.
> 
> This version looks correct to me.
> 
>> +   state->next = (state->next != NULL) ? lnext(state->next) : NULL;
>> +   node = (state->next != NULL) ? lfirst(state->next) : NULL;
> 
> I think it is unnecessary to check for (state->next != NULL) two
> times.  We can put those in a single if.

Hmm, state->next refers to two different pointer values on line 1 and line
2.  It may end up being set to NULL on line 1.  Am I missing something?

Thanks,
Amit




Re: [HACKERS] path toward faster partition pruning

2018-03-07 Thread Amit Langote
Hi.

On 2018/03/05 17:38, Amit Langote wrote:
> I'll
> post an update in a couple of days to report on how that works out.

I'm still working on this and getting most of the tests to pass with the
new code, but not all of them yet.

Thanks,
Amit




Re: Commit 4dba331cb3 broke ATTACH PARTITION behaviour.

2018-04-03 Thread Amit Langote
On 2018/04/03 14:45, Kyotaro HORIGUCHI wrote:
> Hello.
> 
> At Mon, 2 Apr 2018 16:11:12 -0300, Alvaro Herrera  
> wrote:
>> Why do we need AccessExclusiveLock on all children of a relation that we
>> want to scan to search for rows not satisfying the constraint?  I think
>> it should be enough to get ShareLock, which prevents INSERT, no?  I have
>> a feeling I'm missing something here, but I don't know what, and all
>> tests pass with that change.

Thinking on this a bit, I see no problem with locking the children with
just ShareLock.  It was just a paranoia that if we're going to lock the
table itself being attached with AEL, we must its children (if any) with
AEL, too.

> Mmm. I'm not sure if there's a lock-upgrade case but the
> following sentense is left at the last of one of the modified
> comments. ATREwriteTables is called with AEL after that (that has
> finally no effect in this case).
> 
> |   But we cannot risk a deadlock by taking
> | * a weaker lock now and the stronger one only when needed.
> 
> I don't actually find places where the children's lock can be
> raised but this seems just following the lock parent first
> principle.

No lock upgrade happen as of now.  The comment was added by the commit
972b6ec20bf [1], which removed the code that could cause such a deadlock.
The comment fragment is simply trying to explain why we don't postpone the
locking of children to a later time, say, to the point where we actually
know that they need to be scanned.  Previously the code next to the
comment used to lock the children using AccessShareLock, because at that
point we just needed to check if the table being attached is one of those
children and then later locked with AEL if it turned out that they need to
be scanned to check the partition constraint.

> By the way check_default_allows_bound() (CREATE TABLE case)
> contains a similar usage of find_all_inheritors(default_rel,
> AEL).

Good catch.  Its job is more or less same as
ValidatePartitionConstraints(), except all the children (if any) are
scanned right away instead of queuing it like in the AT case.

>> Also: the change proposed to remove the get_default_oid_from_partdesc()
>> call actually fixes the bug, but to me it was not at all obvious why.
> 
> CommandCounterIncrement updates the content of a relcache entry
> via invalidation. It can be surprising for callers of a function
> like StorePartitionBound.
> 
> CommandCounterIncrement
>  
>RelationCacheInvalidateEntry
>  RelationFlushRelation
>RelationClearRelation

Because of the CCI() after storing the default partition OID into the
system catalog, RelationClearRelation() would changes what
rel->rd_partdesc points to where 'rel' is the ATExecAttachPartition()'s
reference to the relcache entry of the parent that it passed to
StorePartitionBound.

So, whereas the rel->rd_partdesc wouldn't contain the default partition
before StorePartitionBound() was called, it would after.

>> To figure out why, I first had to realize that
>> ValidatePartitionConstraints was lying to me, both in name and in
>> comments: it doesn't actually validate anything, it merely queues
>> entries so that alter table's phase 3 would do the validation.  I found
>> this extremely confusing, so I fixed the comments to match reality, and
>> later decided to rename the function also.
> 
> It is reasonable. Removing exccessive extension of lower-level
> partitions is also reasonable. The previous code extended
> inheritances in different manner for levels at odd and even
> depth.

I like the new code including the naming, but I notice this changes the
order in which we do the locking now.  There are still sites in the code
where the locking order is breadth-first, that is, as determined by
find_all_inheritors(), as this function would too previously.

Also note that beside the breadth-first -> depth-first change, this also
changes the locking order of partitions for a given partitioned table.
The OIDs in partdesc->oids[] are canonically ordered (that is order of
their partition bounds), whereas find_inheritance_children() that's called
by find_all_inheritors() would lock them in the order in which the
individual OIDs were found in the system catalog.

Not sure if there is anything to be alarmed of here, but in all previous
discussions, this has been a thing to avoid.

>> At that point I was able to understand what the problem was: when
>> attaching the default partition, we were setting the constraint to be
>> validated for that partition to the correctly computed partition
>> constraint; and later in the same function we would set the constraint
>> to "the immature constraint" because we were now seeing that the default
>> partition OID was not invalid.  So it was rather a bug in
>> ValidatePartitionConstraints, in that it was accepting to set the
>> expression to validate when another expression had already been set!  I
>> added an assert to protect 

Re: [HACKERS] Runtime Partition Pruning

2018-04-04 Thread Amit Langote
Hi David.

On 2018/03/31 22:52, David Rowley wrote:
> The attached patchset is based on Amit's v45 faster partition pruning [1].
> 
> I've made a few changes since the v14 version. Since Amit's v45 patch
> now creates the partition pruning details in a data structure that can
> be copied from the planner over to the executor, it means this patch
> is now able to do much of the setup work for run-time pruning in the
> planner. Doing this now allows us to determine if run-time pruning is
> even possible at plan time, rather than the executor attempting it and
> sometimes wasting effort when it failed to find Params matching the
> partition key.
>
> Another change from the last version is that I've separated out the
> handling of exec Params and external Params.  The new patch now will
> perform a pruning step at executor startup if some external params
> match the partition key.  This is very beneficial to a PREPAREd OLTP
> type query against a partitioned table as it means we can skip
> sub-plan initialisation for all non-matching partitions.

This is quite nice.

> Initialising
> Append/MergeAppend/ModifyTable nodes with fewer subnodes than were
> originally planned did require a small change in explain.c in some
> code that was assuming the subnode arrays were the same length as the
> subplan list. I also ended up adding a Int property to EXPLAIN to show
> the number of subnodes that have been removed during execution.
> Unfortunately, there is a small corner case with this in the case
> where all subnodes are removed it leaves EXPLAIN with no subnodes to
> search for outer side Vars in. I didn't really see any sane way to
> handle this in EXPLAIN, so I think the best fix for this is what I've
> done, and that's just to always initalise at least 1 subnode, even
> when none match the external Params. Cost-wise, I don't think this is
> such a big deal as the cost savings here are coming from saving on
> initalising ten's or hundreds of subnodes, not 1.

I have wondered about the possibility of setting a valid (non-dummy)
targetlist in Append and MergeAppend if they're created for a partitioned
table.  Currently they're overwritten by a dummy one using
set_dummy_tlist_references() in set_plan_refs() citing the following reason:

 * set_dummy_tlist_references
 *Replace the targetlist of an upper-level plan node with a simple
 *list of OUTER_VAR references to its child.
 *
 * This is used for plan types like Sort and Append that don't evaluate
 * their targetlists.  Although the executor doesn't care at all what's in
 * the tlist, EXPLAIN needs it to be realistic.

In fact, when I had noticed that this EXPLAIN behavior I had wondered if
that's something we should have discussed when d3cc37f1d801a [1] went in.

> To put the new patch to the test, I tried pgbench -S -M prepared -s
> 100 with and without having modified pgbench_accounts to separate into
> 10 RANGE partitions of equal size.
> 
> A non-partitioned table was getting 12503 TPS.
> With partitioned tables, the old version of this patch was getting: 5470 TPS.
> With partitioned tables, the attached version gets 11247 TPS.
> For perspective, today's master with a partitioned table gets 4719 TPS.

Quite nice!

> So you can see it's a pretty good performance boost by skipping
> initialisation of the 9 non-matching subplans.  It's not hard to
> imagine the gains getting more significant with a larger number of
> partitions. Ideally, the performance of a partitioned table would be
> faster than the non-partitioned case, but please remember this query
> is a single row PK lookup SELECT, so is a very fast query in both
> cases. Partitioning cases should improve more as the table grows and
> indexes struggle to fit in RAM, or when many rows are being taken from
> the partition and being aggregated.
> 
> Unfortunately, the story is not quite as good for the non -M prepared
> version of the benchmark. This shows that planning time of partitioned
> table queries could still use some improvements. Amit's v45 patch
> certainly makes a dent in the planner slow performance here, but it's
> still nothing like as fast as the non-partitioned case. More work is
> required there in the future.

Certainly.  Things like the issue that we both replied to yesterday should
be addressed for starters [2].

> This patchset also contains a patch to improve the performance of
> inheritance planning of UPDATE/DELETE queries. This patch also
> implements run-time pruning for UPDATE/DELETE too. This also has a
> significant performance improvement for planning of UPDATE/DELETE
> operations on partitioned tables with a large number of partitions,
> performance is as follows:
> 
> Values in transactions per second.
> 
> Partitions = 1
> Unpatched: 7323.3
> Patched: 6573.2 (-10.24%)
> 
> Partitions = 2
> Unpatched: 6784.8
> Patched: 6377.1 (-6.01%)
> 
> Partitions = 4
> Unpatched: 5903.0
> Patched: 6106.8 (3.45%)
> 
> Partitions = 8
> Unpatched: 4582.0
> Patched: 5579.9 (21.78%)

Re: [HACKERS] path toward faster partition pruning

2018-04-06 Thread Amit Langote
On Sat, Apr 7, 2018 at 10:31 AM, David Rowley
<david.row...@2ndquadrant.com> wrote:
> On 7 April 2018 at 12:43, David Rowley <david.row...@2ndquadrant.com> wrote:
>> On 7 April 2018 at 12:35, Amit Langote <amitlangot...@gmail.com> wrote:
>>> So this same failure occurs on (noting the architecture):
>>>
>>> Seems to be due to that the hashing function used in partitioning
>>> gives different answer for a given set of partition key values than
>>> others.
>>
>> They all look like bigendian CPUs.
>
> I looked at all the regression test diffs for each of the servers you
> mentioned and I verified that the diffs match on each of the 7
> servers.
>
> Maybe the best solution is to pull those tests out of
> partition_prune.sql then create partition_prune_hash and just have an
> alternative .out file with the partitions which match on bigendian
> machines.
>
> We could also keep them in the same file, but that's a much bigger
> alternative file to maintain and more likely to get broken if someone
> forgets to update it.
>
> What do you think?

Yeah, that's an idea.

Is it alright though that same data may end up in different hash
partitions depending on the architecture?  IIRC, that's the way we
decided to go when using hash partitioning, but it would've been
clearer if there was already some evidence in regression tests that
that's what we've chosen, such as, some existing tests for tuple
routing.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

2018-04-06 Thread Amit Langote
Thank you Alvaro for rest of the cleanup and committing.

On Sat, Apr 7, 2018 at 5:28 AM, Alvaro Herrera  wrote:
> So I pushed this 25 minutes ago, and already there's a couple of
> buildfarm members complaining:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka=2018-04-06%2020%3A09%3A52
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite=2018-04-06%2019%3A55%3A07
>
> Both show exactly the same diff in test partition_prune:
>
> *** 
> /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/../pgsql/src/test/regress/expected/partition_prune.out
>  Fri Apr  6 15:55:08 2018
> --- 
> /home/pgbuildfarm/buildroot-termite/HEAD/pgsql.build/src/test/regress/results/partition_prune.out
>Fri Apr  6 16:01:40 2018
> ***
> *** 1348,1357 
>   --++-
>hp0  ||
>hp0  |  1 |
> !  hp0  |  1 | xxx
>hp3  | 10 | yyy
> !  hp1  || xxx
> !  hp2  | 10 | xxx
>   (6 rows)
>
>   -- partial keys won't prune, nor would non-equality conditions
> --- 1348,1357 
>   --++-
>hp0  ||
>hp0  |  1 |
> !  hp0  | 10 | xxx
> !  hp3  || xxx
>hp3  | 10 | yyy
> !  hp2  |  1 | xxx
>   (6 rows)
>
>   -- partial keys won't prune, nor would non-equality conditions
> ***
> *** 1460,1466 
>  QUERY PLAN
>   -
>Append
> !->  Seq Scan on hp0
>Filter: ((a = 1) AND (b = 'xxx'::text))
>   (3 rows)
>
> --- 1460,1466 
>  QUERY PLAN
>   -
>Append
> !->  Seq Scan on hp2
>Filter: ((a = 1) AND (b = 'xxx'::text))
>   (3 rows)
>
> ***
> *** 1468,1474 
>QUERY PLAN
>   -
>Append
> !->  Seq Scan on hp1
>Filter: ((a IS NULL) AND (b = 'xxx'::text))
>   (3 rows)
>
> --- 1468,1474 
>QUERY PLAN
>   -
>Append
> !->  Seq Scan on hp3
>Filter: ((a IS NULL) AND (b = 'xxx'::text))
>   (3 rows)
>
> ***
> *** 1476,1482 
>   QUERY PLAN
>   --
>Append
> !->  Seq Scan on hp2
>Filter: ((a = 10) AND (b = 'xxx'::text))
>   (3 rows)
>
> --- 1476,1482 
>   QUERY PLAN
>   --
>Append
> !->  Seq Scan on hp0
>Filter: ((a = 10) AND (b = 'xxx'::text))
>   (3 rows)
>
> ***
> *** 1494,1504 
>Append
>  ->  Seq Scan on hp0
>Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 
> 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
> -->  Seq Scan on hp2
> -  Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 
> 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
>  ->  Seq Scan on hp3
>Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 
> 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
> ! (7 rows)
>
>   -- hash partitiong pruning doesn't occur with <> operator clauses
>   explain (costs off) select * from hp where a <> 1 and b <> 'xxx';
> --- 1494,1502 
>Append
>  ->  Seq Scan on hp0
>Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 
> 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
>  ->  Seq Scan on hp3
>Filter: (((a = 10) AND (b = 'yyy'::text)) OR ((a = 10) AND (b = 
> 'xxx'::text)) OR ((a IS NULL) AND (b IS NULL)))
> ! (5 rows)

So this same failure occurs on (noting the architecture):

ppc64:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=quokka=2018-04-06%2020%3A09%3A52

ia64:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=anole=2018-04-06%2022%3A32%3A24

ppc64 (POWER7):
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tern=2018-04-06%2022%3A58%3A13

ppc64 (POWER7):
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet=2018-04-06%2023%3A02%3A13

powerpc:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog=2018-04-06%2023%3A05%3A08

powerpc:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=locust=2018-04-06%2023%3A13%3A23

powerpc 32-bit userspace on ppc64 host:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite=2018-04-06%2023%3A40%3A07

Seems to be due to that the hashing function used in partitioning
gives different answer for a given set of partition key values than
others.

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread Amit Langote
On Sat, Apr 7, 2018 at 11:26 AM, David Rowley
 wrote:
> Everything else looks fine from my point of view.

Me too, although I still think having struct names PartitionPruning
and PartitionRelPruning is going to be  a bit confusing.  We should
think about naming the latter to something else.  I suggested
PartitionPruningDispatch(Data), but David doesn't seem to like it.
Maybe, PartitionPruneState, because it parallels the
PartitionPruneInfo that comes from the planner for every partitioned
table in the tree.

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread Amit Langote
On Sat, Apr 7, 2018 at 1:58 PM, David Rowley
 wrote:
> Probably if we need to explain more there about how pruning works then
> it should be a fixup patch to 9fdb675fc, no?

Yes, I just replied and working on a patch.

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread Amit Langote
On Sat, Apr 7, 2018 at 1:31 PM, Andres Freund <and...@anarazel.de> wrote:
> On 2018-04-07 13:26:51 +0900, Amit Langote wrote:
>> On Sat, Apr 7, 2018 at 11:26 AM, David Rowley
>> <david.row...@2ndquadrant.com> wrote:
>> > Everything else looks fine from my point of view.
>>
>> Me too, although I still think having struct names PartitionPruning
>> and PartitionRelPruning is going to be  a bit confusing.  We should
>> think about naming the latter to something else.  I suggested
>> PartitionPruningDispatch(Data), but David doesn't seem to like it.
>> Maybe, PartitionPruneState, because it parallels the
>> PartitionPruneInfo that comes from the planner for every partitioned
>> table in the tree.
>
> I've not followed this thread/feature at all, but I don't find the
> comments atop partprune.c even remotely sufficient. Unless there's an
> README hidden or such hidden somewhere?

Sorry there isn't a README and I agree partprune.c's header comment
could be improved quite a bit.

Just to be clear, that's the fault of the patch that was already
committed earlier today (9fdb675fc "Faster partition pruning"), not
this patch, which just extends partition.c's functionality to
implement additional planner and executor support for runtime pruning.

I'm drafting a patch that expands the partprune.c comment and will post shortly.

Thanks,
Amit



Re: [HACKERS] path toward faster partition pruning

2018-04-06 Thread Amit Langote
On Sat, Apr 7, 2018 at 1:09 PM, Andres Freund  wrote:
> Hi,
>
> On 2018-04-07 15:49:54 +1200, David Rowley wrote:
>> Right, I suggest we wait and see if all members go green again as a
>> result of 40e42e1024c, and if they're happy then we could maybe leave
>> it as is with the 2 alternatives output files.
>
> At least the first previously borked animal came back green (termite).
>
>
>> I don't particularly think it matters which hash partition a tuple
>> goes into, as long as the hash function spreads the values out enough
>> and most importantly, the pruning code looks for the tuple in the
>> partition that it was actually inserted into in the first place.
>> Obviously, we also want to ensure we never do anything which would
>> change the matching partition in either minor or major version
>> upgrades too.
>
> +1

+1

Given that the difference only appeared on animals that David pointed
out have big-endian architecture, it seems we'd only need two output
files.  It does seem true that the extended hashing functions  that
were adding to support partitioning would somehow be affected by
endianness.

Thank you David for creating the patches and Andres for committing it.
Buildfarm seems to be turning green where it had gone red due to the
hashing differences.

> I've also attempted to fix rhinoceros's failure I remarked upon a couple
> hours ago in
> https://postgr.es/m/20180406210330.wmqw42wqgiick...@alap3.anarazel.de

Thanks Andres.

Regards,
Amit



Re: [HACKERS] Runtime Partition Pruning

2018-04-06 Thread Amit Langote
On Sat, Apr 7, 2018 at 1:58 PM, Amit Langote <amitlangot...@gmail.com> wrote:
> On Sat, Apr 7, 2018 at 1:31 PM, Andres Freund <and...@anarazel.de> wrote:
>> I've not followed this thread/feature at all, but I don't find the
>> comments atop partprune.c even remotely sufficient. Unless there's an
>> README hidden or such hidden somewhere?
>
> Sorry there isn't a README and I agree partprune.c's header comment
> could be improved quite a bit.
>
> Just to be clear, that's the fault of the patch that was already
> committed earlier today (9fdb675fc "Faster partition pruning"), not
> this patch, which just extends partition.c's functionality to
> implement additional planner and executor support for runtime pruning.
>
> I'm drafting a patch that expands the partprune.c comment and will post 
> shortly.

See if the attached makes it any better.

Now I know we don't have the runtime pruning in yet, but since the
proposed patch would extend its functionality I have included its
description in the comment.

Thanks,
Amit


expand-partprune-header-comment.patch
Description: Binary data


Re: [HACKERS] path toward faster partition pruning

2018-04-07 Thread Amit Langote
On Sat, Apr 7, 2018 at 1:39 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Amit Langote <amitlangot...@gmail.com> writes:
>> Given that the difference only appeared on animals that David pointed
>> out have big-endian architecture, it seems we'd only need two output
>> files.
>
> Dunno, I'm wondering whether 32 vs 64 bit will make a difference.

There was one 32-bit animal in the failing set, which apparently
produces the same hashes as others (allegedly due to endianness
difference).

powerpc 32-bit userspace on ppc64 host:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite=2018-04-06%2023%3A40%3A07

...and it has turned green since the alternative outputs fix went in.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=termite=2018-04-07%2004%3A06%3A09

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

2018-04-05 Thread Amit Langote
Hi David.

On 2018/04/05 22:41, David Rowley wrote:
>> * make_partition_pruneinfo has a parameter resultRelations that's not used
>> anywhere
> 
> It gets used in 0005.
> 
> I guess I could only add it in 0005, but make_partition_pruneinfo is
> only used in 0003, so you could say the same about that entire
> function.
> 
> Do you think I should delay adding that parameter until the 0005 patch?

Yes, I think.

>> * In make_partition_pruneinfo()
>>
>> allsubnodeindex = palloc(sizeof(int) * root->simple_rel_array_size);
>> allsubpartindex = palloc(sizeof(int) * root->simple_rel_array_size);
>>
>> I think these arrays need to have root->simple_rel_array_size + 1
>> elements, because they're subscripted using RT indexes which start at 1.
> 
> RT indexes are always 1-based. See setup_simple_rel_arrays. It already
> sets the array size to list_length(rtable) + 1.

Oh, I missed that simple_rel_array_size itself is set to consider 1-based
RT indexes.

relnode.c:73
root->simple_rel_array_size = list_length(root->parse->rtable) + 1;

>> * Also in make_partition_pruneinfo()
>>
>> /* Initialize to -1 to indicate the rel was not found */
>> for (i = 0; i < root->simple_rel_array_size; i++)
>> {
>> allsubnodeindex[i] = -1;
>> allsubpartindex[i] = -1;
>> }
>>
>> Maybe, allocate the arrays above mentioned using palloc0 and don't do this
>> initialization.  Instead make the indexes that are stored in these start
>> with 1 and consider 0 as invalid entries.
> 
> 0 is a valid subplan index. It is possible to make this happen, but
> I'd need to subtract 1 everywhere I used the map. That does not seem
> very nice. Seems more likely to result in bugs where we might forget
> to do the - 1.

You can subtract 1 right here in make_partition_pruneinfo before setting
the values in PartitionPruneInfo's subplan_indexes and parent_indexes.
I'm only proposing to make make_partition_pruneinfo() a bit faster by not
looping over both the local map arrays setting them to -1.

IOW, I'm not saying that we emit PartitionPruneInfo nodes that contain
1-based indexes.

> Did you want this because you'd rather have the palloc0() than the for
> loop setting the array elements to -1? Or is there another reason?

Yeah, that's it.

>> * Instead of nesting PartitionedRelPruning inside another, just store them
>> in a global flat array in the PartitionPruning, like PartitionTupleRouting
>> does for PartitionDispatch of individual partitioned tables in the tree.
>>
>> typedef struct PartitionedRelPruning
>> {
>> int nparts;
>> int*subnodeindex;
>> struct PartitionedRelPruning **subpartprune;
> 
> There is a flat array in PartitionPruning. subpartprune contains
> pointers into that array. I want to have this pointer array so I can
> directly reference the flat array in PartitionPruning.
> 
> Maybe I've misunderstood what you mean here.

I think we can save some space here by not having the pointers stored
here.  Instead of passing the pointer itself in the recursive calls to
find_subplans_for_extparams_recurse, et al, just pass the entire array and
an offset to use for the given recursive invocation.

If you look at ExecFindPartition used for tuple routing, you may see that
there no recursion at all.  Similarly find_subplans_for_extparams_recurse,
et al might be able to avoid recursion if similar tricks are used.

Finally about having two different functions for different sets of Params:
can't we have just named find_subplans_for_params_recurse() and use the
appropriate one based on the value of some parameter?  I can't help but
notice the duplication of code.

Thanks,
Amit




Re: [HACKERS] path toward faster partition pruning

2018-04-05 Thread Amit Langote
Hi.

On 2018/04/06 7:35, Alvaro Herrera wrote:
> I seems pretty clear that putting get_matching_partitions() in
> catalog/partition.c is totally the wrong thing; it belongs wholly in
> partprune. I think the reason you put it there is that it requires
> access to a lot of internals that are static in partition.c.  In the
> attached not yet cleaned version of the patch, I have moved a whole lot
> of what you added to partition.c to partprune.c; and for the functions
> and struct declarations that were required to make it work, I created
> catalog/partition_internal.h.

Yes, I really wanted for most of the new code that this patch adds to land
in the planner, especially after Robert's comments here:

https://www.postgresql.org/message-id/CA%2BTgmoabi-29Vs8H0xkjtYB%3DcU%2BGVCrNwPz7okpa3KsoLmdEUQ%40mail.gmail.com

It would've been nice if we'd gotten the "reorganizing partitioning code"
thread resolved sooner.

> I changed a lot of code also, but cosmetic changes only.
> 
> I'll clean this up a bit more now, and try to commit shortly (or early
> tomorrow); wanted to share current status now in case I have to rush
> out.

Some comments on the code reorganizing part of the patch:

* Did you intentionally not put PartitionBoundInfoData and its accessor
macros in partition_internal.h.  partprune.c would not need to include
partition.h if we do that.

* Also, I wonder why you left PartitionPruneContext in partition.h.  Isn't
it better taken out to partprune.h?

I have done that in the attached.

* Why isn't gen_partprune_steps() in partprune.h?  I see only
prune_append_rel_partitions() exported out of partprune.c, but the runtime
patch needs gen_partprune_steps() to be called from createplan.c.

* I don't see get_matching_partitions() exported either.  Runtime pruning
patch needs that too.

Maybe you've thought something about these two items though.

Thanks,
Amit
diff --git a/src/backend/catalog/partition.c b/src/backend/catalog/partition.c
index d5e91b111d..f89c99f544 100644
--- a/src/backend/catalog/partition.c
+++ b/src/backend/catalog/partition.c
@@ -24,7 +24,6 @@
 #include "catalog/indexing.h"
 #include "catalog/objectaddress.h"
 #include "catalog/partition.h"
-#include "catalog/partition_internal.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_inherits.h"
 #include "catalog/pg_inherits_fn.h"
diff --git a/src/backend/optimizer/util/partprune.c 
b/src/backend/optimizer/util/partprune.c
index b0638d5aa6..93553b5d13 100644
--- a/src/backend/optimizer/util/partprune.c
+++ b/src/backend/optimizer/util/partprune.c
@@ -19,8 +19,6 @@
 
 #include "access/hash.h"
 #include "access/nbtree.h"
-#include "catalog/partition.h"
-#include "catalog/partition_internal.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_opfamily.h"
 #include "catalog/pg_type.h"
@@ -35,6 +33,7 @@
 #include "parser/parse_coerce.h"
 #include "parser/parsetree.h"
 #include "rewrite/rewriteManip.h"
+#include "utils/array.h"
 #include "utils/lsyscache.h"
 
 /*
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 0bcaa36165..1c17553917 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -13,7 +13,7 @@
 #ifndef PARTITION_H
 #define PARTITION_H
 
-#include "fmgr.h"
+#include "catalog/partition_internal.h"
 #include "executor/tuptable.h"
 #include "nodes/execnodes.h"
 #include "parser/parse_node.h"
@@ -23,65 +23,6 @@
 #define HASH_PARTITION_SEED UINT64CONST(0x7A5B22367996DCFD)
 
 /*
- * PartitionBoundInfoData encapsulates a set of partition bounds. It is
- * usually associated with partitioned tables as part of its partition
- * descriptor, but may also be used to represent a virtual partitioned
- * table such as a partitioned joinrel within the planner.
- *
- * A list partition datum that is known to be NULL is never put into the
- * datums array. Instead, it is tracked using the null_index field.
- *
- * In the case of range partitioning, ndatums will typically be far less than
- * 2 * nparts, because a partition's upper bound and the next partition's lower
- * bound are the same in most common cases, and we only store one of them (the
- * upper bound).  In case of hash partitioning, ndatums will be same as the
- * number of partitions.
- *
- * For range and list partitioned tables, datums is an array of datum-tuples
- * with key->partnatts datums each.  For hash partitioned tables, it is an 
array
- * of datum-tuples with 2 datums, modulus and remainder, corresponding to a
- * given partition.
- *
- * The datums in datums array are arranged in increasing order as defined by
- * functions qsort_partition_rbound_cmp(), qsort_partition_list_value_cmp() and
- * qsort_partition_hbound_cmp() for range, list and hash partitioned tables
- * respectively. For range and list partitions this simply means that the
- * datums in the datums array are arranged in increasing order as defined by
- * the partition key's operator classes and collations.
- *
- * In the case of list 

Re: [HACKERS] Runtime Partition Pruning

2018-04-05 Thread Amit Langote
Hi David,

On 2018/04/06 12:27, David Rowley wrote:
> (sending my reply in parts for concurrency)
> 
> On 6 April 2018 at 14:39, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
>> I think we can save some space here by not having the pointers stored
>> here.  Instead of passing the pointer itself in the recursive calls to
>> find_subplans_for_extparams_recurse, et al, just pass the entire array and
>> an offset to use for the given recursive invocation.
> 
> hmm, where would those offsets be stored?  I don't want to have to do
> any linear searching to determine the offset, which is why I just
> stored the pointer to the flat array. It seems very efficient to me to
> do this. Remember that this pruning can occur per tuple in cases like
> parameterized nested loops.
> 
> Are you worried about memory consumption? It's one pointer per
> partition. I imagine there's lots more allocated for DML on a
> partitioned table as it needs to store maps to map attribute numbers.
> 
> Or are you thinking the saving of storing an array of 32-bit int
> values is better than the array of probably 64-bit pointers? So
> requires half the space?

Yeah, just copy it from the PartitionPruneInfo like you're doing for
subnodeindex:

   memcpy(partrelprune->subpartindex, pinfo->subpartindex,
  sizeof(int) * pinfo->nparts);

Instead I see ExecSetupPartitionPruning is now doing this:

  /*
   * Setup the PartitionedRelPruning's subpartprune so that we can
   * quickly find sub-PartitionedRelPruning details for any
   * sub-partitioned tables that this partitioned table contains.
   * We need to be able to find these quickly during our recursive
   * search to find all matching subnodes.
   */
   for (j = 0; j < pinfo->nparts; j++)
   {
   int subpartidx = pinfo->subpartindex[j];

   Assert(subpartidx < list_length(partitionpruneinfo));

   if (subpartidx >= 0)
   partrelprune->subpartprune[j] = [subpartidx];
   else
   partrelprune->subpartprune[j] = NULL;
   }


With that in place, pass the index/offset instead of the pointer to the
next recursive invocation of find_subplans_*, along with the array
containing all PartitionedRelPruning's.

So, where you have in each of find_subplans_*:

  if (partrelprune->subnodeindex[i] >= 0)
  *validsubplans = bms_add_member(*validsubplans,
  partrelprune->subnodeindex[i]);
  else if (partrelprune->subpartprune[i] != NULL)
  find_subplans_for_allparams_recurse(partrelprune->subpartprune[i],
  validsubplans);

I'm proposing that you do:

  if (partrelprune->subnodeindex[i] >= 0)
  *validsubplans = bms_add_member(*validsubplans,
  partrelprune->subnodeindex[i]);
  else if (partrelprune->subpartindex[i] >= 0)
  find_subplans_for_allparams_recurse(all_partrelprunes,
  partrelprune->subpartindex[i],
  validsubplans);

And at the top of each of  find_subplans_*:

  ParitionedRelPruning *partrelprune = all_partrelprunes[offset];

where the signature is:

  static void
  find_subplans_for_allparams_recurse(
 PartitionRelPruning **all_partrelprune,
 int offset,
 Bitmapset **validsubplans)

The all_partrelprune above refers to the flat array in PartitionPruning.
On the first call from ExecFindMatchingSubPlans, et al, you'd pass 0 for
offset to start pruning with the root parent's PartitionedRelPruning.  All
the values contained in subnodeindex and subpartindex are indexes into the
global array (whole-tree that is) anyway and that fact would be more
apparent if we use this code structure, imho.

>> If you look at ExecFindPartition used for tuple routing, you may see that
>> there no recursion at all.  Similarly find_subplans_for_extparams_recurse,
>> et al might be able to avoid recursion if similar tricks are used.
> 
> That seems pretty different. That's looking for a single node in a
> tree, so just is following a single path from the root, it never has
> to go back up a level and look down any other paths.
> 
> What we need for the find_subplans_for_extparams_recurse is to find
> all nodes in the entire tree which match the given clause. Doing this
> without recursion would require some sort of stack so we can go back
> up a level and search again down another branch.  There are ways
> around this without using recursion, sure, but I don't think any of
> them will be quite as convenient and simple. The best I can think of
> is to palloc some stack manually and use some depth_level to track
> which element to use.  An actual stack seems more simple. I can't

Re: Partitioned tables and covering indexes

2018-04-10 Thread Amit Langote
On 2018/04/10 16:07, Jaime Casanova wrote:
> Hi,
> 
> Trying covering indexes on partitioned tables i get this error
> """
> postgres=# create index on t1_part (i) include (t);
> ERROR:  cache lookup failed for opclass 0
> """
> 
> To reproduce:
> 
> create table t1_part (i int, t text) partition by hash (i);
> create table t1_part_0 partition of t1_part for values with (modulus
> 2, remainder 0);
> create table t1_part_1 partition of t1_part for values with (modulus
> 2, remainder 1);
> insert into t1_part values (1, repeat('abcdefquerty', 20));
> 
> create index on t1_part (i) include (t);

It seems that the bug is caused due to the original IndexStmt that
DefineIndex receives being overwritten when processing the INCLUDE
columns.  Also, the original copy of it should be used when recursing for
defining the index in partitions.

Does the attached fix look correct?  Haven't checked the fix with ATTACH
PARTITION though.

Maybe add this to open items list?

Thanks,
Amit
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 860a60d109..109d48bb2f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -366,6 +366,7 @@ DefineIndex(Oid relationId,
LOCKMODElockmode;
Snapshotsnapshot;
int i;
+   IndexStmt  *orig_stmt = copyObject(stmt);
 
if (list_intersection(stmt->indexParams, stmt->indexIncludingParams) != 
NIL)
ereport(ERROR,
@@ -886,8 +887,8 @@ DefineIndex(Oid relationId,
memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
 
parentDesc = CreateTupleDescCopy(RelationGetDescr(rel));
-   opfamOids = palloc(sizeof(Oid) * numberOfAttributes);
-   for (i = 0; i < numberOfAttributes; i++)
+   opfamOids = palloc(sizeof(Oid) * numberOfKeyAttributes);
+   for (i = 0; i < numberOfKeyAttributes; i++)
opfamOids[i] = 
get_opclass_family(classObjectId[i]);
 
heap_close(rel, NoLock);
@@ -987,11 +988,11 @@ DefineIndex(Oid relationId,
 */
if (!found)
{
-   IndexStmt  *childStmt = 
copyObject(stmt);
+   IndexStmt  *childStmt = 
copyObject(orig_stmt);
boolfound_whole_row;
 
childStmt->whereClause =
-   
map_variable_attnos(stmt->whereClause, 1, 0,
+   
map_variable_attnos(orig_stmt->whereClause, 1, 0,

attmap, maplen,

InvalidOid, _whole_row);
if (found_whole_row)


Re: [HACKERS] path toward faster partition pruning

2018-04-10 Thread Amit Langote
On 2018/04/10 13:27, Ashutosh Bapat wrote:
> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas  wrote:
>> On Fri, Apr 6, 2018 at 11:41 PM, Tom Lane  wrote:
>>> David Rowley  writes:
 Sounds like you're saying that if we have too many alternative files
 then there's a chance that one could pass by luck.
>>>
>>> Yeah, exactly: it passed, but did it pass for the right reason?
>>>
>>> If there's just two expected-files, it's likely not a big problem,
>>> but if you have a bunch it's something to worry about.
>>>
>>> I'm also wondering how come we had hash partitioning before and
>>> did not have this sort of problem.  Is it just that we added a
>>> new test that's more sensitive to the details of the hashing
>>> (if so, could it be made less so)?  Or is there actually more
>>> platform dependence now than before (and if so, why is that)?
>>
>> The existing hash partitioning tests did have some dependencies on the
>> hash function, but they took care not to use the built-in hash
>> functions.  Instead they did stuff like this:
>>
>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
>> CREATE TABLE mchash (a int, b text, c jsonb)
>>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>>
>> I think that this approach should also be used for the new tests.
>> Variant expected output files are a pain to maintain, and you
>> basically just have to take whatever output you get as the right
>> answer, because nobody knows what output a certain built-in hash
>> function should produce for a given input except by running the code.
>> If you do the kind of thing shown above, though, then you can easily
>> see by inspection that you're getting the right answer.

Thanks for the idea.  I think it makes sense and also agree that alternate
outputs approach is not perfectly reliable and maintainable.

> +1.

Attached find a patch that rewrites hash partition pruning tests that
away.  It creates two hash operator classes, one for int4 and another for
text type and uses them to create hash partitioned table to be used in the
tests, like done in the existing tests in hash_part.sql.  Since that makes
tests (hopefully) reliably return the same result always, I no longer see
the need to keep them in a separate partition_prune_hash.sql.  The
reasoning behind having the separate file was to keep the alternative
output file small as David explained in [1].

However, I noticed that there is a bug in RelationBuildPartitionKey that
causes a crash when using a SQL function as partition support function as
the revised tests do, so please refer to and apply the patches I posted
here before running the revised tests:

https://www.postgresql.org/message-id/3041e853-b1dd-a0c6-ff21-7cc5633bffd0%40lab.ntt.co.jp

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com
From c1508fc715a7783108f626c67c76fcc1f2303719 Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 10 Apr 2018 16:06:33 +0900
Subject: [PATCH v1] Rewrite hash partition pruning tests to use custom opclass

Relying on platform-provided hashing functions makes tests unreliable
as shown by buildfarm recently.

This adds adjusted tests to partition_prune.sql itself and hence
partition_prune_hash.sql is deleted along with two expected output
files.

Discussion: 
https://postgr.es/m/CA%2BTgmoZ0D5kJbt8eKXtvVdvTcGGWn6ehWCRSZbWytD-uzH92mQ%40mail.gmail.com
---
 src/test/regress/expected/partition_prune.out  | 202 -
 src/test/regress/expected/partition_prune_hash.out | 189 ---
 .../regress/expected/partition_prune_hash_1.out| 187 ---
 src/test/regress/parallel_schedule |   2 +-
 src/test/regress/serial_schedule   |   1 -
 src/test/regress/sql/partition_prune.sql   |  59 +-
 src/test/regress/sql/partition_prune_hash.sql  |  41 -
 7 files changed, 259 insertions(+), 422 deletions(-)
 delete mode 100644 src/test/regress/expected/partition_prune_hash.out
 delete mode 100644 src/test/regress/expected/partition_prune_hash_1.out
 delete mode 100644 src/test/regress/sql/partition_prune_hash.sql

diff --git a/src/test/regress/expected/partition_prune.out 
b/src/test/regress/expected/partition_prune.out
index df3fca025e..935e7dc79b 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -1330,7 +1330,207 @@ explain (costs off) select * from rparted_by_int2 where 
a > 100;
  Filter: (a > '100'::bigint)
 (3 rows)
 
-drop table lp, coll_pruning, rlp, mc3p, mc2p, boolpart, rp, 
coll_pruning_multi, like_op_noprune, lparted_by_int2, rparted_by_int2;

crash with sql language partition support function

2018-04-10 Thread Amit Langote
Hi.

I noticed that RelationBuildPartitionKey() is not doing the right thing
with respect to which context it passes to fmgr.c to set a the (cached)
partition support function's FmgrInfo's fn_mcxt.

I noticed a crash while trying to change partition pruning tests to use
manually created hash operator class per [1].

CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;

CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);

CREATE OR REPLACE FUNCTION hashtext_length(text, int8) RETURNS int8 AS
$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;

CREATE OPERATOR CLASS test_text_ops FOR TYPE text USING HASH AS
OPERATOR 1 = , FUNCTION 2 hashtext_length(text, int8);

create table hp (a int, b text) partition by hash (a test_int4_ops, b
test_text_ops);
create table hp0 partition of hp for values with (modulus 4, remainder 0);
create table hp3 partition of hp for values with (modulus 4, remainder 3);
create table hp1 partition of hp for values with (modulus 4, remainder 1);
create table hp2 partition of hp for values with (modulus 4, remainder 2);

insert into hp values (1, 'abcde');
INSERT 0 1
Time: 13.604 ms
postgres=# insert into hp values (null, 'abcde');
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Attached fixes it.  It teaches RelationBuildPartitionKey() to use
fmgr_info_cxt and pass rd_partkeycxt to it.

Since this bug also exists in the released PG 10 branch, I also created a
patch for that.  It's slightly different than the one for PG 11dev,
because there were some changes recently to how the memory context is
manipulated in RelationBuildPartitionKey.  That's
v1-PG10-0001-Fix-a-memory-context-bug-in-RelationBuildPartitio.patch.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoZ0D5kJbt8eKXtvVdvTcGGWn6ehWCRSZbWytD-uzH92mQ%40mail.gmail.com
From 81d8337760be175b37b5ffe84df06903173787c5 Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 10 Apr 2018 15:38:19 +0900
Subject: [PATCH v1] Fix a memory context bug in RelationBuildPartitionKey

We should pass rd_partkeycxt to set fn_mcxt, not CurrentMemoryContex.
---
 src/backend/utils/cache/relcache.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c 
b/src/backend/utils/cache/relcache.c
index a69b078f91..be7d5b7eef 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -849,6 +849,10 @@ RelationBuildPartitionKey(Relation relation)
if (!HeapTupleIsValid(tuple))
return;
 
+   partkeycxt = AllocSetContextCreate(CacheMemoryContext,
+  
RelationGetRelationName(relation),
+  
ALLOCSET_SMALL_SIZES);
+
key = (PartitionKey) palloc0(sizeof(PartitionKeyData));
 
/* Fixed-length attributes */
@@ -951,7 +955,7 @@ RelationBuildPartitionKey(Relation relation)
 BTORDER_PROC, opclassform->opcintype, 
opclassform->opcintype,
 opclassform->opcfamily);
 
-   fmgr_info(funcid, >partsupfunc[i]);
+   fmgr_info_cxt(funcid, >partsupfunc[i], partkeycxt);
 
/* Collation */
key->partcollation[i] = collation->values[i];
@@ -985,9 +989,6 @@ RelationBuildPartitionKey(Relation relation)
ReleaseSysCache(tuple);
 
/* Success --- now copy to the cache memory */
-   partkeycxt = AllocSetContextCreate(CacheMemoryContext,
-  
RelationGetRelationName(relation),
-  
ALLOCSET_SMALL_SIZES);
relation->rd_partkeycxt = partkeycxt;
oldcxt = MemoryContextSwitchTo(relation->rd_partkeycxt);
relation->rd_partkey = copy_partition_key(key);
-- 
2.11.0

From 57056b1ca511ef549ee3f28a8e17d2bdfbf757a5 Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 10 Apr 2018 15:38:19 +0900
Subject: [PATCH v1] Fix a memory context bug in RelationBuildPartitionKey

We should pass rd_partkeycxt to set fn_mcxt, not CurrentMemoryContex.
---
 src/backend/utils/cache/relcache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/cache/relcache.c 
b/src/backend/utils/cache/relcache.c
index e81c4691ec..2a74601fc1 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1034,7 +1034,7 @@ RelationBuildPartitionKey(Relation relation)
procnum,
  

Re: crash with sql language partition support function

2018-04-10 Thread Amit Langote
I have added this in the Older Bugs section of open items page.

https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Older_Bugs

Thanks,
Amit




Re: Boolean partitions syntax

2018-04-10 Thread Amit Langote
Horiguchi-san,

Thanks for working on this.

On 2018/04/11 13:20, Kyotaro HORIGUCHI wrote:
> At Wed, 11 Apr 2018 11:27:17 +0900, Amit Langote wrote:
>> On 2018/04/11 10:44, Tom Lane wrote:
>>> Kyotaro HORIGUCHI <horiguchi.kyot...@lab.ntt.co.jp> writes:
>>>> At least partition bound *must* be a constant. Any expression
>>>> that can be reduced to a constant at parse time ought to be
>>>> accepted but must not be accepted if not.
>>>
>>> My point is that *any* expression can be reduced to a constant,
>>> we just have to do so.
> 
> Agreed in that sense. What was in my mind was something like
> column reference, random() falls into reducible category of
> course.
> 
> # I regard the change in gram.y is regarded as acceptable as a
> # direction, so I'll continue to working on this.

I haven't yet reviewed the grammar changes in detail yet...

>> Currently transformPartitionBoundValue() applies eval_const_expressions()
>> by way of calling expression_planner().  However passing to it, say, an
>> expression representing random() is unable to reduce it to a Const because
>> simplify_function/evaluate_function won't compute a mutable function like
>> random().  So, that currently results in an error like this (note that
>> this is after applying Horiguchi-san's latest patch that enables
>> specifying random() as a partition bound in the first place):
>>
>> create table foo_part partition of foo for values in ((random())::int);
>> ERROR:  specified value cannot be cast to type integer for column "a"
>> LINE 1: ...table foo_random partition of foo for values in ((random()):...
>>  ^
>> DETAIL:  The cast requires a non-immutable conversion.
>> HINT:  Try putting the literal value in single quotes.
>>
>> The error is output after the following if check in
>> transformPartitionBoundValue fails:
>>
>> /* Fail if we don't have a constant (i.e., non-immutable coercion) */
>> if (!IsA(value, Const))
>>
>> I think what Tom is proposing here, instead of bailing out upon
>> eval_const_expressions() failing to simplify the input expression to a
>> Const, is to *invoke the executor* to evaluate the expression, like the
>> optimizer does in evaluate_expr, and cook up a Const with whatever comes
>> out to store it into the catalog (that is, in relpartbound).
> 
> Yes. In the attached I used evaluate_expr by making it non-static
> function. a_expr used instead of partbound_datum is changed to
> u_expr, which is the same with range bounds.
> 
>> =# create table c1 partition of p for values in (random() * 100);
>> CREATE TABLE
>> =# \d c1
> ...
>> Partition of: p FOR VALUES IN (97)

I looked at the non-gram.y portions of the patch for now as I was also
playing with this.  Some comments on your patch:

* You missed adding a break here for the EXPR_KIND_PARTITION_EXPRESSION case

 case EXPR_KIND_PARTITION_EXPRESSION:
 err = _("window functions are not allowed in partition key
expressions");
+case EXPR_KIND_PARTITION_BOUNDS:
+err = _("window functions are not allowed in partition bounds");
 break;

So, the following is the wrong error message that you probably failed to
notice:

--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -308,7 +308,7 @@ CREATE TABLE partitioned (
 a int,
 b int
 ) PARTITION BY RANGE ((avg(a) OVER (PARTITION BY b)));
-ERROR:  window functions are not allowed in partition key expressions
+ERROR:  window functions are not allowed in partition bounds

* I think the new ParseExprKind you added should omit the last "S", that
is, name it EXPR_KIND_PARTITION_BOUND, because these are expressions to
represent individual bound values.  And so adjust the comment to say "bound".

 +EXPR_KIND_PARTITION_BOUNDS, /* partition bounds value */

* When looking at the changes to transformPartitionBoundValue, I noticed
that there is no new argument Oid colTypColl

 static Const *
-transformPartitionBoundValue(ParseState *pstate, A_Const *con,
+transformPartitionBoundValue(ParseState *pstate, Node *val,
  const char *colName, Oid colType, int32
colTypmod)

that's because you seem to pass the expression's type, typmod, and typcoll
to the newly added call to evaluate_expr.  I wonder if we should really
pass the partition key specified values here.  We already receive first
two from the caller.

* In the changes to create_table.out

@@ -450,13 +450,9 @@ CREATE TABLE part_1 PARTITION OF list_parted FOR
VALUES IN ('1');
 CREATE TABLE part_2 PARTITION OF list_parted FOR

Re: Boolean partitions syntax

2018-04-10 Thread Amit Langote
On 2018/04/11 13:39, David Rowley wrote:
> On 11 April 2018 at 05:22, Tom Lane  wrote:
>> David Rowley  writes:
>>> On 11 April 2018 at 03:34, Tom Lane  wrote:
 Well, that just begs the question: why do these expressions need to
 be immutable?  What we really want, I think, is to evaluate them
 and reduce them to constants.  After that, it hardly matters whether
 the original expression was volatile.  I see absolutely no moral
 difference between "for values in (random())" and cases like
 this, which works today:
>>
>>> I'd personally be pretty surprised if this worked.
>>
>> Well, my point is that we're *already* behaving that way in some cases,
>> simply because of the existence of macro-like inputs for some of these
>> datatypes.  I'm not sure that users are going to perceive a big difference
>> between 'now'::timestamptz and now(), for example.  If we take one but
>> not the other, I don't think anybody will see that as a feature.
> 
> To me, it seems a bit inconsistent to treat 'now'::timestamp and now()
> the same way.
> 
> I found this in the 7.4 release notes [1]:
> 
> "String literals specifying time-varying date/time values, such as
> 'now' or 'today' will no longer work as expected in column default
> expressions; they now cause the time of the table creation to be the
> default, not the time of the insertion. Functions such as now(),
> current_timestamp, or current_dateshould be used instead.
> 
> In previous releases, there was special code so that strings such as
> 'now' were interpreted at INSERT time and not at table creation time,
> but this work around didn't cover all cases. Release 7.4 now requires
> that defaults be defined properly using functions such as now() or
> current_timestamp. These will work in all situations."
> 
> So isn't 'now' being different from now() in DDL something users
> should be quite used to by now?
> 
> I've got to admit, I'm somewhat less concerned about evaluating
> volatile functions in this case because you don't seem that concerned,
> but if you happen to be wrong, then it's going to be a much harder
> thing to fix.  Really, is anyone going to complain if we don't
> evaluate these and reject them with an error instead? It seems like a
> safer option to me, also less work, and we're probably less likely to
> regret it.

As someone said upthread, we should just document that we *always*
evaluate the expression specified for a partition bound during create
table and not some other time.  That seems easier than figuring out what
to say in the error message; saying "cannot use immutable expression for
partition bound" is likely to confuse a user even more by introducing the
ambiguity about when partition bounds are evaluated.  Most users would
expect it to be create table time anyway.

> We also need to decide what of this we can backpatch to PG10 to fix
> [2].  Ideally what goes into PG10 and PG11 would be the same, so
> perhaps that's another reason to keep it more simple.

Backpatch all of it?  Newly introduced syntax and evaluation semantics
does not break inputs that PG 10 allows.  But I may be missing something.

Thanks,
Amit




Re: [HACKERS] path toward faster partition pruning

2018-04-10 Thread Amit Langote
Thanks for the comment.

On 2018/04/10 21:11, Ashutosh Bapat wrote:
> On Tue, Apr 10, 2018 at 5:32 PM, David Rowley
> <david.row...@2ndquadrant.com> wrote:
>> Apart from that confusion, looking at the patch:
>>
>> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
>> +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;
>> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
>> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
>> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
>> +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;
>>
>>
>> Why coalesce here? Maybe I've not thought of something, but coalesce
>> only seems useful to me if there's > 1 argument. Plus the function is
>> strict, so not sure it's really doing even if you added a default.
> 
> I think Amit Langote wanted to write coalesce($1, $2), $2 being the
> seed for hash function. See how hash operator class functions are
> defined in sql/insert.sql.

Actually, I referenced functions and operator classes defined in
hash_part.sql, not insert.sql.  Although as you point out, I didn't think
very hard about the significance of $2 passed to coalesce in those
functions.  I will fix that and add it back, along with some other changes
that makes them almost identical with definitions in hash_part.sql.

> May be we should just use the same
> functions or even the same tables.

Because hash_part.sql and partition_prune.sql tests run in parallel, I've
decided to rename the functions, operator classes, and the tables in
partition_prune.sql.  It seems like a good idea in any case.  Also, since
the existing pruning tests were written with that table, I decided not to
change that.

Will post an updated patch after addressing David's comment.

Regards,
Amit




Re: [HACKERS] path toward faster partition pruning

2018-04-11 Thread Amit Langote
Thanks for the review.

On 2018/04/10 21:02, David Rowley wrote:
> On 10 April 2018 at 20:56, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
>> On 2018/04/10 13:27, Ashutosh Bapat wrote:
>>> On Mon, Apr 9, 2018 at 8:56 PM, Robert Haas <robertmh...@gmail.com> wrote:
>>>> CREATE OR REPLACE FUNCTION hashint4_noop(int4, int8) RETURNS int8 AS
>>>> $$SELECT coalesce($1,0)::int8$$ LANGUAGE sql IMMUTABLE;
>>>> CREATE OPERATOR CLASS test_int4_ops FOR TYPE int4 USING HASH AS
>>>> OPERATOR 1 = , FUNCTION 2 hashint4_noop(int4, int8);
>>>> CREATE TABLE mchash (a int, b text, c jsonb)
>>>>   PARTITION BY HASH (a test_int4_ops, b test_text_ops);
>>
>> Thanks for the idea.  I think it makes sense and also agree that alternate
>> outputs approach is not perfectly reliable and maintainable.
>>
>>> +1.
>>
>> Attached find a patch that rewrites hash partition pruning tests that
>> away.  It creates two hash operator classes, one for int4 and another for
>> text type and uses them to create hash partitioned table to be used in the
>> tests, like done in the existing tests in hash_part.sql.  Since that makes
>> tests (hopefully) reliably return the same result always, I no longer see
>> the need to keep them in a separate partition_prune_hash.sql.  The
>> reasoning behind having the separate file was to keep the alternative
>> output file small as David explained in [1].
>> [1]
>> https://www.postgresql.org/message-id/CAKJS1f-SON_hAekqoV4_WQwJBtJ_rvvSe68jRNhuYcXqQ8PoQg%40mail.gmail.com
> 
> I had a quick look, but I'm still confused about why a function like
> hash_uint32_extended() is susceptible to varying results depending on
> CPU endianness but hash_combine64 is not.

It might as well be the combination of both that's sensitive to
endianness.  I too am not sure exactly which part.  They're are both used
in succession in compute_hash_value:

/*
 * Compute hash for each datum value by calling respective
 * datatype-specific hash functions of each partition key
 * attribute.
 */
hash = FunctionCall2([i], values[i], seed);

/* Form a single 64-bit hash value */
rowHash = hash_combine64(rowHash, DatumGetUInt64(hash));

> Apart from that confusion, looking at the patch:
> 
> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
> +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;
> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
> +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;
> 
> 
> Why coalesce here? Maybe I've not thought of something, but coalesce
> only seems useful to me if there's > 1 argument. Plus the function is
> strict, so not sure it's really doing even if you added a default.

After reading Ashutosh's comment, I realized I didn't really mean to add
the STRICT to those function definitions.  As these are not operators, but
support (hash) procedures, it's insignificant to the pruning code whether
they are STRICT or not, unlike clause operators where it is.

Also, I've adopted the coalesce-based hashing function from hash_part.sql,
albeit with unnecessary tweaks.  I've not read anywhere about why the
coalesce was used in the first place, but it's insignificant for our
purpose here anyway.

> I know this one was there before, but I only just noticed it:
> 
> +-- pruning should work if non-null values are provided for all the keys
> +explain (costs off) select * from hp where a is null and b is null;
> 
> The comment is a bit misleading given the first test below it is
> testing for nulls. Maybe it can be changed to
> 
> +-- pruning should work if values or is null clauses are provided for
> all partition keys.
I have adjusted the comments.

Updated patch attached.

Thanks,
Amit
From 5a6d00d4d9d6aa8bb84dc9699646ee5c4fa77719 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Tue, 10 Apr 2018 16:06:33 +0900
Subject: [PATCH v2] Rewrite hash partition pruning tests to use custom opclass

Relying on platform-provided hashing functions makes tests unreliable
as shown by buildfarm recently.

This adds adjusted tests to partition_prune.sql itself and hence
partition_prune_hash.sql is deleted along with two expected output
files.

Discussion: 
https://postgr.es/m/CA%2BTgmoZ0D5kJbt8eKXtvVdvTcGGWn6ehWCRSZbWytD-uzH92mQ%40mail.gmail.com
---
 src/test/regress/expected/partition_prune.out  | 201 +
 src/test/regress/expected/partition_prune_hash.out | 189 ---
 .../regress/expect

Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-05 Thread Amit Langote
Fuiita-san,

On 2018/04/05 15:56, Etsuro Fujita wrote:
> (2018/04/05 15:37), Amit Langote wrote:
>> I noticed that the 2nd patch (foreign-routing-fdwapi-5.patch) fails to
>> apply to copy.c:
> 
> I forgot to mention this: the second patch is created on top of the first
> patch (postgres-fdw-refactoring-5.patch) and the patch in [1] as before.

Ah, sorry I hadn't noticed that in your previous email.

Might be a good idea to attach the bug-fix patch here as well, and perhaps
add numbers to the file names like:

0001_postgres-fdw-refactoring-5.patch
0002_BUGFIX-copy-from-check-constraint-fix.patch
0003_foreign-routing-fdwapi-5.patch


Just one minor comment:

I wonder why you decided not to have the CheckValidResultRel() call and
the statement that sets ri_PartitionReadyForRouting inside the newly added
ExecInitRoutingInfo itself.  If ExecInitRoutingInfo does the last
necessary steps for a ResultRelInfo (and hence the partition) to be ready
to be used for routing, why not finish everything there.  So the changes
to ExecPrepareTupleRouting which look like this in the patch:

+if (!partrel->ri_PartitionReadyForRouting)
+{
+CheckValidResultRel(partrel, CMD_INSERT);
+
+/* Set up information needed for routing tuples to the partition */
+ExecInitRoutingInfo(mtstate, estate, proute, partrel, partidx);
+
+partrel->ri_PartitionReadyForRouting = true;
+}

will become:

+if (!partrel->ri_PartitionReadyForRouting)
+ExecInitRoutingInfo(mtstate, estate, proute, partrel, partidx);


As I see no other issues, I will mark this as Ready for Committer.

Thanks,
Amit




Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-05 Thread Amit Langote
On 2018/04/05 16:31, Amit Langote wrote:
> Fuiita-san,

Oops, sorry about misspelling your name here, Fujita-san.

- Amit




Re: [HACKERS] path toward faster partition pruning

2018-04-05 Thread Amit Langote
Hi.

On 2018/04/05 0:45, Jesper Pedersen wrote:
> Hi,
> 
> On 04/04/2018 09:29 AM, David Rowley wrote:
>> Thanks for updating. I've made a pass over v49 and I didn't find very
>> much wrong with it.
>>
>> The only real bug I found was a missing IsA(rinfo->clause, Const) in
>> the pseudoconstant check inside
>> generate_partition_pruning_steps_internal.

Fixed.

>> Most of the changes are comment fixes with a few stylistic changes
>> thrown which are pretty much all there just to try to shrink the code
>> a line or two or reduce indentation.
>>
>> I feel pretty familiar with this code now and assuming the attached is
>> included I'm happy for someone else, hopefully, a committer to take a
>> look at it.

Thank you, your changes look good to me.

>> I'll leave the following notes:
>>
>> 1. Still not sure about RelOptInfo->has_default_part. This flag is
>> only looked at in generate_partition_pruning_steps. The RelOptInfo and
>> the boundinfo is available to look at, it's just that the
>> partition_bound_has_default macro is defined in partition.c rather
>> than partition.h.

Hmm, it might not be such a bad idea to bring out the
PartitionBoundInfoData into partition.h.  If we do that, we won't need the
has_default_part that the patch adds to RelOptInfo.

In the Attached v50 set, 0002 does that.

>> 2. Don't really like the new isopne variable name. It's not very
>> simple to decode, perhaps something like is_not_eq is better?

isopne does sound a bit unintelligible.  I propose op_is_ne so that it
sounds consistent with the preceding member of the struct that's called
opno.  I want to keep "ne" and not start calling it not_eq, as a few other
places use the string "ne" to refer to a similar thing, like:

/* inequality */
Datum
range_ne(PG_FUNCTION_ARGS)

Datum
timestamptz_ne_date(PG_FUNCTION_ARGS)

Since the field is local to partprune.c, I guess that it's fine as the
comment where it's defined tells what it is.

>> 3. The part of the code I'm least familiar with is
>> get_steps_using_prefix_recurse(). I admit to not having had time to
>> fully understand that and consider ways to break it.

The purpose of that code is to generate *all* needed steps to be combined
using COMBINE_INTERSECT such that the pruning will occur using the most
restrictive set of clauses in cases where the same key is referenced in
multiple restriction clauses containing non-equality operators.  So, for a
range partitioned table on (a, b):

For a query like

explain select * from foo a <= 1 and a <= 3 and b < 5 and b <= 10

Pruning steps generated to be combined with an enclosing INTERSECT step
will be as follows:

<= (1, 10)
<  (1, 5)
<= (3, 10)
<  (3, 5)

>> Marking as ready for committer.

Thank you!

> Passes check-world, and CommitFest app has been updated to reflect the
> current patch set. Trivial changes attached.

Merged these changes.  Thanks again Jesper.

Attached v50.

Thanks,
Amit




Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-05 Thread Amit Langote
Fujita-san,

On 2018/04/05 15:02, Etsuro Fujita wrote:
> (2018/04/04 19:31), Etsuro Fujita wrote:
>> Attached is an updated version of the patch set:
>> * As before, patch foreign-routing-fdwapi-4.patch is created on top of
>> patch postgres-fdw-refactoring-4.patch and the bug-fix patch [1].
> 
> I did a bit of cleanup and comment-rewording to patch
> foreign-routing-fdwapi-4.patch.  Attached is a new version of the patch
> set.  I renamed the postgres_fdw refactoring patch but no changes to that
> patch.

I noticed that the 2nd patch (foreign-routing-fdwapi-5.patch) fails to
apply to copy.c:

Hunk #9 FAILED at 2719.
Hunk #10 succeeded at 2752 (offset -1 lines).
Hunk #11 succeeded at 2795 (offset -1 lines).
Hunk #12 succeeded at 2843 (offset -1 lines).
1 out of 12 hunks FAILED -- saving rejects to file
src/backend/commands/copy.c.rej

cat src/backend/commands/copy.c.rej
*** src/backend/commands/copy.c
--- src/backend/commands/copy.c
***
*** 2719,2727 
  
resultRelInfo->ri_TrigDesc->trig_insert_before_row))
check_partition_constr = false;

!   /* Check the constraints of the tuple */
!   if 
(resultRelInfo->ri_RelationDesc->rd_att->constr ||
!   check_partition_constr)
ExecConstraints(resultRelInfo, slot, 
estate, true);

if (useHeapMultiInsert)
--- 2730,2742 
  
resultRelInfo->ri_TrigDesc->trig_insert_before_row))
check_partition_constr = false;

!   /*
!* If the target is a plain table, check the 
constraints of
!* the tuple.
!*/
!   if (resultRelInfo->ri_FdwRoutine == NULL &&
!   
(resultRelInfo->ri_RelationDesc->rd_att->constr ||
!check_partition_constr))
ExecConstraints(resultRelInfo, slot, 
estate, true);

if (useHeapMultiInsert)

Can you check?

Thanks,
Amit




Re: [HACKERS] path toward faster partition pruning

2018-04-11 Thread Amit Langote
Hi David.

Thanks for the review.

On 2018/04/11 17:59, David Rowley wrote:
> On 11 April 2018 at 18:04, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
>> Updated patch attached.
> 
> Thanks for the updated patch.
> 
> The only thing I'm not sure about is the chances you've made to the
> COALESCE function.
> 
> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
> +$$SELECT coalesce($1, $2)::int8$$ LANGUAGE sql IMMUTABLE;
> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
> +$$SELECT length(coalesce($1, ''))::int8$$ LANGUAGE sql IMMUTABLE;
> 
> Why does one default to the seed and the other to an empty string?
> Shouldn't they both do the same thing? If you were to copy the
> hash_part.sql you'd just coalesce($1, 0) and coalesce($1, ''), any
> special reason not to do that?

Oops, so I hadn't actually restored it to the way it is in hash_part.sql.

Also, Ashutosh was talking about the custom hashing function used in
insert.sql, not hash_part.sql, which I based my revision upon.

Fixed it now.

> Also just wondering if it's worth adding some verification that we've
> actually eliminated the correct partitions by backing the tests up
> with a call to satisfies_hash_partition.
> 
> I've attached a delta patch that applies to your v2 which does this.
> Do you think it's worth doing?

We can see check by inspection that individual values are in appropriate
partitions, which is the point of having the inserts and the select just
above the actual pruning related tests.  So, I'm not sure if adding the
satisfies_hash_partition against each pruning tests adds much.

Attached revised patch.

Thanks,
Amit
From 4685448a7eb2eaf5feceea2206d136c135b2dea7 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Tue, 10 Apr 2018 16:06:33 +0900
Subject: [PATCH v3] Rewrite hash partition pruning tests to use custom opclass

Relying on platform-provided hashing functions makes tests unreliable
as shown by buildfarm recently.

This adds adjusted tests to partition_prune.sql itself and hence
partition_prune_hash.sql is deleted along with two expected output
files.

Discussion: 
https://postgr.es/m/CA%2BTgmoZ0D5kJbt8eKXtvVdvTcGGWn6ehWCRSZbWytD-uzH92mQ%40mail.gmail.com
---
 src/test/regress/expected/partition_prune.out  | 237 +
 src/test/regress/expected/partition_prune_hash.out | 189 
 .../regress/expected/partition_prune_hash_1.out| 187 
 src/test/regress/parallel_schedule |   2 +-
 src/test/regress/serial_schedule   |   1 -
 src/test/regress/sql/partition_prune.sql   |  70 +-
 src/test/regress/sql/partition_prune_hash.sql  |  41 
 7 files changed, 307 insertions(+), 420 deletions(-)
 delete mode 100644 src/test/regress/expected/partition_prune_hash.out
 delete mode 100644 src/test/regress/expected/partition_prune_hash_1.out
 delete mode 100644 src/test/regress/sql/partition_prune_hash.sql

diff --git a/src/test/regress/expected/partition_prune.out 
b/src/test/regress/expected/partition_prune.out
index df3fca025e..d13389b9c2 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -1332,6 +1332,243 @@ explain (costs off) select * from rparted_by_int2 where 
a > 100;
 
 drop table lp, coll_pruning, rlp, mc3p, mc2p, boolpart, rp, 
coll_pruning_multi, like_op_noprune, lparted_by_int2, rparted_by_int2;
 --
+-- Test Partition pruning for HASH partitioning
+-- We roll our own operator classes to use for tests, because depending on the
+-- platform-provided hashing functions makes tests unreliable
+--
+CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
+$$SELECT coalesce($1, 0)::int8$$ LANGUAGE sql IMMUTABLE;
+CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
+OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
+CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
+$$SELECT length(coalesce($1, ''))::int8$$ LANGUAGE sql IMMUTABLE;
+CREATE OPERATOR CLASS pp_test_text_ops FOR TYPE text USING HASH AS
+OPERATOR 1 = , FUNCTION 2 pp_hashtext_length(text, int8);
+create table hp (a int, b text) partition by hash (a pp_test_int4_ops, b 
pp_test_text_ops);
+create table hp0 partition of hp for values with (modulus 4, remainder 0);
+create table hp3 partition of hp for values with (modulus 4, remainder 3);
+create table hp1 partition of hp for values with (modulus 4, remainder 1);
+create table hp2 partition of hp for values with (modulus 4, remainder 2);
+insert into hp values (null, null);
+insert into hp values (1, null);
+insert into hp values (1, 'xxx');
+insert into hp values (null, 'xxx');
+insert into hp values (2, 

Re: partitioning code reorganization

2018-04-14 Thread Amit Langote
Hi.

Thanks for taking care of few things I left like those PartitionKey
accessors in rel.h.

On Sat, Apr 14, 2018 at 8:51 PM, Alvaro Herrera  wrote:
> Here's a final version.
>
> The one thing I don't like about this is having put
> PartitionRangeDatumKind in partdefs.h, which forces us to #include that
> file in parsenodes.h.  I had to do this in order to avoid #including
> parsenodes.h in partbounds.h.  Now maybe that is not so bad, since that
> file isn't *that* widely used anyway; it wouldn't cause any unnecessary
> bleeding of parsenodes.h into any other headers.  So maybe I'll put the
> enum back in parsenodes.  Any opinions on that?

I'm fine with keeping it where it was, that is, parsenodes.h.  I can
see that parsenodes.h is pretty heavily included in other headers
anyway.

Also, +1 to moving compute_hash_value() and satisfies_hash_partition()
to partbounds.c.

Thanks,
Amit



Re: partitioning code reorganization

2018-04-14 Thread Amit Langote
On Sat, Apr 14, 2018 at 11:48 PM, Amit Langote <amitlangot...@gmail.com> wrote:
> Hi.
>
> Thanks for taking care of few things I left like those PartitionKey
> accessors in rel.h.

Forgot to mention -- there are some files that still include
catalog/partition.h but no longer need to.  Find a delta patch
attached that applies on your v6.

Thanks,
Amit


v6_delta-adjust-partition-h-include.patch
Description: Binary data


Re: ON CONFLICT DO UPDATE for partitioned tables

2018-04-09 Thread Amit Langote
On 2018/03/27 13:27, Amit Langote wrote:
> On 2018/03/26 23:20, Alvaro Herrera wrote:
>> The one thing I wasn't terribly in love with is the four calls to
>> map_partition_varattnos(), creating the attribute map four times ... but
>> we already have it in the TupleConversionMap, no?  Looks like we could
>> save a bunch of work there.
> 
> Hmm, actually we can't use that map, assuming you're talking about the
> following map:
> 
>   TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
> 
> We can only use that to tell if we need converting expressions (as we
> currently do), but it cannot be used to actually convert the expressions.
> The map in question is for use by do_convert_tuple(), not to map varattnos
> in Vars using map_variable_attnos().
> 
> But it's definitely a bit undesirable to have various
> map_partition_varattnos() calls within ExecInitPartitionInfo() to
> initialize the same information (the map) multiple times.

I will try to think of doing something about this later this week.

>> And a final item is: can we have a whole-row expression in the clauses?
>> We currently don't handle those either, not even to throw an error.
>> [figures a test case] ... and now that I test it, it does crash!
>>
>> create table part (a int primary key, b text) partition by range (a);
>> create table part1 (b text, a int not null);
>> alter table part attach partition part1 for values from (1) to (1000);
>> insert into part values (1, 'two') on conflict (a)
>>   do update set b = format('%s (was %s)', excluded.b, part.b)
>>   where part.* *<> (1, text 'two');
>>
>> I think this means we should definitely handle found_whole_row.  (If you
>> create part1 in the normal way, it works as you'd expect.)
> 
> I agree.  That means we simply remove the Assert after the
> map_partition_varattnos call.
> 
>> I'm going to close a few other things first, then come back to this.
> 
> Attached find a patch to fix the whole-row expression issue.  I added your
> test to insert_conflict.sql.

Adding this to the open items list.

Thanks,
Amit




Re: [sqlsmith] Failed assertion on pfree() via perform_pruning_combine_step

2018-04-09 Thread Amit Langote
On 2018/04/09 22:59, Alvaro Herrera wrote:
> Hello,
> 
> Amit Langote wrote:
> 
>> I have reproduced this and found that the problem is that
>> perform_pruning_combine_step forgets to *copy* the bitmapset of the first
>> step in the handling of an COMBINE_INTERSECT step.
> 
> Pushed, thanks Amit and Andreas!

Thanks!

Regards,
Amit




Re: pruning disabled for array, enum, record, range type partition keys

2018-04-09 Thread Amit Langote
Thanks for the comment.

On 2018/04/09 23:22, Tom Lane wrote:
> Amit Langote <langote_amit...@lab.ntt.co.jp> writes:
>> I noticed that the newly added pruning does not work if the partition key
>> is of one of the types that have a corresponding pseudo-type.
> 
> While I don't recall the details due to acute caffeine shortage,
> there are specific coding patterns that are used in the planner
> (e.g. in indxpath.c) to ensure that the code works with pseudotype
> opclasses as well as simple ones.  The existence of this bug
> indicates that those conventions were not followed in the pruning
> code.  I wonder whether this patch makes the pruning code look
> more like the pre-existing code, or even less like it.

Ah, I think I got it after staring at the (btree) index code for a bit.

What pruning code got wrong is that it's comparing the expression type
(type of the constant arg that will be compared with partition bound
datums when pruning) with the partopcintype to determine if we should look
up the cross-type comparison/hashing procedure, whereas what the latter
should be compare with is the clause operator's oprighttype.  ISTM, if
op_in_opfamily() passed for the operator, that's the correct thing to do.

Once I changed the code to do it that way, no special considerations are
needed to handle pseudo-type type partition key.

Attached please find the updated patch.

Thanks,
Amit
From 1c136a321153cca04c0842807c2cd166e79f2556 Mon Sep 17 00:00:00 2001
From: amit <amitlangot...@gmail.com>
Date: Fri, 8 Dec 2017 19:09:31 +0900
Subject: [PATCH v2] Fix pruning code to determine comparison function
 correctly

It's unreliable to determine one using the constant expression's
type directly (for example, it doesn't work correctly when the
expression contains an array, enum, etc.).  Instead, use righttype
of the operator, the one that supposedly passed the op_in_opfamily
test using the partitioning opfamily.
---
 src/backend/partitioning/partprune.c  | 29 +---
 src/test/regress/expected/partition_prune.out | 98 +++
 src/test/regress/sql/partition_prune.sql  | 44 +++-
 3 files changed, 161 insertions(+), 10 deletions(-)

diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index 7666c6c412..2655d2caa2 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -1426,10 +1426,12 @@ match_clause_to_partition_key(RelOptInfo *rel,
OpExpr *opclause = (OpExpr *) clause;
Expr   *leftop,
   *rightop;
-   Oid commutator = InvalidOid,
+   Oid op_lefttype,
+   op_righttype,
+   commutator = InvalidOid,
negator = InvalidOid;
Oid cmpfn;
-   Oid exprtype;
+   int op_strategy;
boolis_opne_listp = false;
PartClauseInfo *partclause;
 
@@ -1517,10 +1519,20 @@ match_clause_to_partition_key(RelOptInfo *rel,
return PARTCLAUSE_UNSUPPORTED;
}
 
-   /* Check if we're going to need a cross-type comparison 
function. */
-   exprtype = exprType((Node *) expr);
-   if (exprtype != part_scheme->partopcintype[partkeyidx])
+   /*
+* Check if we're going to need a cross-type comparison 
function to
+* use during pruning.
+*/
+   get_op_opfamily_properties(OidIsValid(negator)
+   ? 
negator : opclause->opno,
+  
partopfamily, false,
+  
_strategy, _lefttype, _righttype);
+   /* Use the cached one if no cross-type comparison. */
+   if (op_righttype == part_scheme->partopcintype[partkeyidx])
+   cmpfn = part_scheme->partsupfunc[partkeyidx].fn_oid;
+   else
{
+   /* Otherwise, look the correct one up in the catalog. */
switch (part_scheme->strategy)
{
case PARTITION_STRATEGY_LIST:
@@ -1528,13 +1540,14 @@ match_clause_to_partition_key(RelOptInfo *rel,
cmpfn =

get_opfamily_proc(part_scheme->partopfamily[partkeyidx],

  part_scheme->partopcintype[partkeyidx],
-

Re: [sqlsmith] Failed assertion on pfree() via perform_pruning_combine_step

2018-04-10 Thread Amit Langote
On 2018/04/10 13:55, Michael Paquier wrote:
> On Mon, Apr 09, 2018 at 10:59:48AM -0300, Alvaro Herrera wrote:
>> Amit Langote wrote:
>>> I have reproduced this and found that the problem is that
>>> perform_pruning_combine_step forgets to *copy* the bitmapset of the first
>>> step in the handling of an COMBINE_INTERSECT step.
>>
>> Pushed, thanks Amit and Andreas!
> 
> Álvaro, didn't you forget to actually push the patch?

I too thought the same yesterday, as I couldn't see the commit even after
about 35 minutes Alvaro actually had sent the email to -hackers (quoted
above).  But I found that the commit moments after:

* Add missed bms_copy() in perform_pruning_combine_step *
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7ba6ee815dc90d4fab7226d343bf72aa28c9aa5c

The mail to pgsql-committers may also have been delayed, but I did see it
yesterday.

https://www.postgresql.org/message-id/E1f5XK3-0004Em-QZ%40gemulon.postgresql.org

Thanks,
Amit




pruning disabled for array, enum, record, range type partition keys

2018-04-09 Thread Amit Langote
Hi.

I noticed that the newly added pruning does not work if the partition key
is of one of the types that have a corresponding pseudo-type.

-- array type list partition key
create table arrpart (a int[]) partition by list (a);
create table arrpart1 partition of arrpart for values in ('{1}');
create table arrpart2 partition of arrpart for values in ('{2, 3}', '{4, 5}');
explain (costs off) select * from arrpart where a = '{1}';
   QUERY PLAN

 Append
   ->  Seq Scan on arrpart1
 Filter: (a = '{1}'::integer[])
   ->  Seq Scan on arrpart2
 Filter: (a = '{1}'::integer[])
(5 rows)

For pruning, we normally rely on the type's operator class information in
the system catalogs to be up-to-date, which if it isn't we give up on
pruning.  For example, if pg_amproc entry for a given type and AM type
(btree, hash, etc.) has not been populated, we may fail to prune using a
clause that contains an expression of the said type.  While this is the
theory for the normal cases, we should make an exception for the
pseudo-type types.  For those types, we never have pg_amproc entries with
the "real" type listed.  Instead, the pg_amproc entries contain the
corresponding pseudo-type.  For example, there aren't pg_amproc entries
with int4[] (really, its OID) as amproclefttype and/or amprocrighttype,
instead anyarray is listed there.

Attached find a patch that tries to fix that and adds relevant tests.

Thanks,
Amit
From c7945da855973b606b5aa012295e2c0ae93c39c5 Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 8 Dec 2017 19:09:31 +0900
Subject: [PATCH v1] Fix pruning when partition key of array, enum, record type

We never have pg_amproc catalog entries with "real" array, enum, record,
range types as leftop and rightop types.  Instead, AM procedures
manipulating such types have entries with the corresponding "pseudo-types"
listed as leftop and rightop types.  For example, for enums, all
procedures entries are marked with anyenum as their leftop and rightop
types.  So, if we pass "real" type OIDs to get_opfamily_member() or
get_opfamily_proc(), we get back an InvalidOid for these type categories.
Whereas we'd normally give up on performing pruning in that case, don't
do that in this case.
---
 src/backend/partitioning/partprune.c  |  4 +-
 src/test/regress/expected/partition_prune.out | 98 +++
 src/test/regress/sql/partition_prune.sql  | 44 +++-
 3 files changed, 144 insertions(+), 2 deletions(-)

diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index 417e1fee81..bd1b99102d 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -1519,7 +1519,9 @@ match_clause_to_partition_key(RelOptInfo *rel,
 
/* Check if we're going to need a cross-type comparison 
function. */
exprtype = exprType((Node *) expr);
-   if (exprtype != part_scheme->partopcintype[partkeyidx])
+   if (exprtype != part_scheme->partopcintype[partkeyidx] &&
+   get_typtype(part_scheme->partopcintype[partkeyidx]) !=
+   TYPTYPE_PSEUDO)
{
switch (part_scheme->strategy)
{
diff --git a/src/test/regress/expected/partition_prune.out 
b/src/test/regress/expected/partition_prune.out
index df3fca025e..69e679d930 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -2399,3 +2399,101 @@ select * from boolp where a = (select value from 
boolvalues where not value);
 
 drop table boolp;
 reset enable_indexonlyscan;
+--
+-- check that pruning works properly when the partition key is of array, enum,
+-- or record type
+--
+-- array type list partition key
+create table arrpart (a int[]) partition by list (a);
+create table arrpart1 partition of arrpart for values in ('{1}');
+create table arrpart2 partition of arrpart for values in ('{2, 3}', '{4, 5}');
+explain (costs off) select * from arrpart where a = '{1}';
+   QUERY PLAN   
+
+ Append
+   ->  Seq Scan on arrpart1
+ Filter: (a = '{1}'::integer[])
+(3 rows)
+
+explain (costs off) select * from arrpart where a = '{1, 2}';
+QUERY PLAN
+--
+ Result
+   One-Time Filter: false
+(2 rows)
+
+explain (costs off) select * from arrpart where a in ('{4, 5}', '{1}');
+  QUERY PLAN  
+--
+ Append
+   ->  Seq Scan on arrpart1
+ Filter: ((a = '{4,5}'::integer[]) OR (a = '{1}'::integer[]))
+   ->  Seq Scan on arrpart2
+ Filter: ((a = '{4,5}'::integer[]) OR (a = '{1}'::integer[]))
+(5 rows)
+
+drop table arrpart;
+-- enum type list partition key

Re: pruning disabled for array, enum, record, range type partition keys

2018-04-09 Thread Amit Langote
On 2018/04/09 19:14, Amit Langote wrote:
> Hi.
> 
> I noticed that the newly added pruning does not work if the partition key
> is of one of the types that have a corresponding pseudo-type.
> 
> -- array type list partition key
> create table arrpart (a int[]) partition by list (a);
> create table arrpart1 partition of arrpart for values in ('{1}');
> create table arrpart2 partition of arrpart for values in ('{2, 3}', '{4, 5}');
> explain (costs off) select * from arrpart where a = '{1}';
>QUERY PLAN
> 
>  Append
>->  Seq Scan on arrpart1
>  Filter: (a = '{1}'::integer[])
>->  Seq Scan on arrpart2
>  Filter: (a = '{1}'::integer[])
> (5 rows)
> 
> For pruning, we normally rely on the type's operator class information in
> the system catalogs to be up-to-date, which if it isn't we give up on
> pruning.  For example, if pg_amproc entry for a given type and AM type
> (btree, hash, etc.) has not been populated, we may fail to prune using a
> clause that contains an expression of the said type.  While this is the
> theory for the normal cases, we should make an exception for the
> pseudo-type types.  For those types, we never have pg_amproc entries with
> the "real" type listed.  Instead, the pg_amproc entries contain the
> corresponding pseudo-type.  For example, there aren't pg_amproc entries
> with int4[] (really, its OID) as amproclefttype and/or amprocrighttype,
> instead anyarray is listed there.
> 
> Attached find a patch that tries to fix that and adds relevant tests.

Added to the open items list.

https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues

Thanks,
Amit




Re: [HACKERS] path toward faster partition pruning

2018-04-09 Thread Amit Langote
Hi David.

On 2018/04/09 12:48, David Rowley wrote:
> While looking at the docs in [1], I saw that we still mention:
> 
> 4. Ensure that the constraint_exclusion configuration parameter is not
> disabled in postgresql.conf. If it is, queries will not be optimized
> as desired.
> 
> This is no longer true. The attached patch removed it.
> 
> [1] https://www.postgresql.org/docs/10/static/ddl-partitioning.htm
Thanks.  I was aware of the changes that would need to be made, but you
beat me to writing the patch itself.

About the patch:

While the users no longer need to enable constraint_exclusion true for
select queries, one would still need it for update/delete queries, because
the new pruning logic only gets invoked for the former.  Alas...

Also, further down on that page, there is a 5.10.4 Partitioning and
Constraint Exclusion sub-section.  I think it would also need some tweaks
due to new developments.

I updated your patch to fix that.  Please take a look.

Thanks,
Amit
From a4fe924936fe623ff95e6aa050b8fd7d22dbbb84 Mon Sep 17 00:00:00 2001
From: "dgrow...@gmail.com" 
Date: Mon, 9 Apr 2018 15:43:32 +1200
Subject: [PATCH v2] Remove mention of constraint_exclusion in partitioning
 docs

As of 9fdb675fc5d2, this GUC now no longer has an affect on partition pruning.
---
 doc/src/sgml/ddl.sgml | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index feb2ab7792..eed8753e24 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3194,13 +3194,14 @@ CREATE INDEX ON measurement (logdate);
   
  
 
-  
-   
-Ensure that the 
-configuration parameter is not disabled in 
postgresql.conf.
-If it is, queries will not be optimized as desired.
-   
-  
+ 
+  
+   Ensure that the 
+   configuration parameter is not disabled in 
postgresql.conf.
+   While enabling it is not required for select queries, not doing so will 
result
+   in update and delete queries to not be optimized as desired.
+  
+ 
 

 
@@ -3767,10 +3768,12 @@ ANALYZE measurement;

 

-Constraint exclusion is a query optimization 
technique
-that improves performance for partitioned tables defined in the
-fashion described above (both declaratively partitioned tables and those
-implemented using inheritance).  As an example:
+Constraint exclusion is a query optimization
+technique that improves performance for partitioned tables defined in the
+fashion described above.  While it is used only for update and delete
+queries in the case of declaratively partitioned tables, it is used for all
+queries in the case of table partitioning implemented using inheritance.
+As an example:
 
 
 SET constraint_exclusion = on;
-- 
2.11.0



Re: [sqlsmith] Failed assertion on pfree() via perform_pruning_combine_step

2018-04-09 Thread Amit Langote
On 2018/04/09 17:50, Amit Langote wrote:
> Attached fixes that.  I see that Michael Paquier has added this to the
> open items list.  Thanks, Michael.
> 
> https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues

Oops, it was Tom who added that.  Thank you!

Regards,
Amit




Re: [HACKERS] MERGE SQL Statement for PG11

2018-04-06 Thread Amit Langote
On 2018/04/06 5:00, Andres Freund wrote:
> On 2018-04-05 11:31:48 +0530, Pavan Deolasee wrote:
>>> +   /*
>>> +* If there are not WHEN MATCHED actions, we are done.
>>> +*/
>>> +   if (mergeMatchedActionStates == NIL)
>>> +   return true;
>>>
>>> Maybe I'm confused, but why is mergeMatchedActionStates attached to
>>> per-partition info? How can this differ in number between partitions,
>>> requiring us to re-check it below fetching the partition info above?
>>>
>>>
>> Because each partition may have a columns in different order, dropped
>> attributes etc. So we need to give treatment to the quals/targetlists. See
>> ON CONFLICT DO UPDATE for similar code.
> 
> But the count wouldn't change, no? So we return before building the
> partition info if there's no MATCHED action?

Yeah, I think we should return at the top if there are no matched actions.

With the current code, even if there aren't any matched actions we're
doing ExecFindPartitionByOid() and ExecInitPartitionInfo() to only then
see in the partition's resultRelInfo that the action states list is empty.

I think there should be an Assert where there currently is the check for
empty action states list and the check itself should be at the top of the
function, as done in the attached.

Thanks,
Amit
diff --git a/src/backend/executor/execMerge.c b/src/backend/executor/execMerge.c
index d39ddd3034..2ed041644f 100644
--- a/src/backend/executor/execMerge.c
+++ b/src/backend/executor/execMerge.c
@@ -174,6 +174,12 @@ ExecMergeMatched(ModifyTableState *mtstate, EState *estate,
ListCell   *l;
TupleTableSlot *saved_slot = slot;
 
+   /*
+* If there are not WHEN MATCHED actions, we are done.
+*/
+   if (resultRelInfo->ri_mergeState->matchedActionStates == NIL)
+   return true;
+
if (mtstate->mt_partition_tuple_routing)
{
Datum   datum;
@@ -220,12 +226,7 @@ ExecMergeMatched(ModifyTableState *mtstate, EState *estate,
 */
mergeMatchedActionStates =
resultRelInfo->ri_mergeState->matchedActionStates;
-
-   /*
-* If there are not WHEN MATCHED actions, we are done.
-*/
-   if (mergeMatchedActionStates == NIL)
-   return true;
+   Assert(mergeMatchedActionStates != NIL);
 
/*
 * Make tuple and any needed join variables available to ExecQual and


Re: [sqlsmith] Failed assertion on pfree() via perform_pruning_combine_step

2018-04-09 Thread Amit Langote
Hi Andreas.

On 2018/04/08 3:33, Andreas Seltenreich wrote:
> Hi,
> 
> testing with master at 039eb6e92f yielded another query triggering an
> assertion.

Thanks for the report.

> Backtrace and query against the regression database below.
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x7f25474cf42a in __GI_abort () at abort.c:89
> #2  0x556c14b75bb3 in ExceptionalCondition (
> conditionName=conditionName@entry=0x556c14d11510 "!(((context) != ((void 
> *)0) && (const Node*)((context)))->type) == T_AllocSetContext) || 
> const Node*)((context)))->type) == T_SlabContext) || const 
> Node*)((context)))->type) == T_Generatio"..., 
> errorType=errorType@entry=0x556c14bcac95 "BadArgument", 
> fileName=fileName@entry=0x556c14d11480 
> "../../../../src/include/utils/memutils.h", lineNumber=lineNumber@entry=129)
> at assert.c:54
> #3  0x556c14ba28cb in GetMemoryChunkContext (pointer=) at 
> ../../../../src/include/utils/memutils.h:129
> #4  pfree (pointer=) at mcxt.c:1033
> #5  0x556c1495fc01 in bms_int_members (a=, b= out>) at bitmapset.c:917
> #6  0x556c149d910a in perform_pruning_combine_step 
> (context=0x7ffe80889b20, step_results=0x7f253e3efcc0, cstep=0x7f253e3f13a0)
> at partprune.c:2697
> #7  get_matching_partitions (context=0x7ffe80889b20, pruning_steps= out>) at partprune.c:317
> #8  0x556c149d9596 in prune_append_rel_partitions 
> (rel=rel@entry=0x7f253e3eb3e8) at partprune.c:262
> #9  0x556c14989e21 in set_append_rel_size (rte=0x7f254819d810, rti=2, 
> rel=0x7f253e3eb3e8, root=0x7f254819d3c8) at allpaths.c:907

[  ]

> select
>   sample_0.dd as c0,
>   subq_1.c3 as c1,
>   subq_1.c0 as c2,
>   subq_1.c2 as c3,
>   subq_1.c3 as c4,
>   sample_0.bb as c5,
>   subq_1.c0 as c6,
>   pg_catalog.pg_current_wal_flush_lsn() as c7,
>   public.func_with_bad_set() as c8,
>   sample_0.bb as c9,
>   sample_0.aa as c10,
>   sample_0.dd as c11
> from
>   public.d as sample_0 tablesample bernoulli (2.8) ,
>   lateral (select
>   subq_0.c1 as c0,
>   sample_0.aa as c1,
>   subq_0.c0 as c2,
>   sample_0.cc as c3,
>   subq_0.c0 as c4,
>   subq_0.c1 as c5
>   from
>   (select
> sample_1.a as c0,
> (select s from public.reloptions_test limit 1 offset 2)
>as c1
>   from
> public.pagg_tab_ml as sample_1 tablesample system (3.6)
>   where select c from public.test_tbl3 limit 1 offset 2)
>  <= cast(null as test_type3))
> or (((select n from testxmlschema.test2 limit 1 offset 1)
><= true)
>   or (sample_0.bb is not NULL)))
>   and ((true)
> or ((cast(null as varbit) >= (select varbitcol from 
> public.brintest limit 1 offset 3)
>   )
>   and ((select macaddrcol from public.brintest limit 1 offset 
> 6)
><> cast(null as macaddr)
> or ((sample_1.a is NULL)
>   and ((sample_1.c is not NULL)
> or (sample_1.c is NULL as subq_0
>   where (select salary from public.rtest_emp limit 1 offset 3)
>  = (select pg_catalog.min(newsal) from public.rtest_emplog)
> 
>   limit 13) as subq_1
> where sample_0.aa is NULL
> limit 140;

I have reproduced this and found that the problem is that
perform_pruning_combine_step forgets to *copy* the bitmapset of the first
step in the handling of an COMBINE_INTERSECT step.

Attached fixes that.  I see that Michael Paquier has added this to the
open items list.  Thanks, Michael.

https://wiki.postgresql.org/wiki/PostgreSQL_11_Open_Items#Open_Issues

Thanks,
Amit
diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index 417e1fee81..0f0080928a 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -2923,7 +2923,8 @@ perform_pruning_combine_step(PartitionPruneContext 
*context,
if (firststep)
{
/* Copy step's result the first time. */
-   result->bound_offsets = 
step_result->bound_offsets;
+   result->bound_offsets =
+   
bms_copy(step_result->bound_offsets);
result->scan_null = 
step_result->scan_null;
result->scan_default = 
step_result->scan_default;
firststep = false;


Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-04-07 Thread Amit Langote
On Sun, Apr 8, 2018 at 5:37 AM, Andres Freund  wrote:
> On 2018-04-06 19:25:20 -0400, Robert Haas wrote:
>> On Thu, Apr 5, 2018 at 6:21 AM, Etsuro Fujita
>>  wrote:
>> > Attached is an updated version of the patch set plus the patch in [1]. 
>> > Patch
>> > 0003_foreign-routing-fdwapi-6.patch can be applied on top of patch
>> > 0001_postgres-fdw-refactoring-6.patch and
>> > 0002_copy-from-check-constraint-fix.patch.
>>
>> Committed.  Let me say that you do nice work.
>
> The CF entry for this is still marked as 'ready for committer'. Does anything 
> remain?
>
> https://commitfest.postgresql.org/17/1184/

Nothing remains, so marked as committed.

Thanks,
Amit



Re: [HACKERS] Runtime Partition Pruning

2018-04-12 Thread Amit Langote
On 2018/04/13 1:57, Robert Haas wrote:
>> It might be possible to do something better in each module by keeping
>> an array indexed by RTI which have each entry NULL initially then on
>> first relation_open set the element in the array to that pointer.
> 
> I'm not sure that makes a lot of sense in the planner, but in the
> executor it might be a good idea.   See also
> https://www.postgresql.org/message-id/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com
> for related discussion.  I think that a coding pattern where we rely
> on relation_open(..., NoLock) is inherently dangerous -- it's too easy
> to be wrong about whether the lock is sure to have been taken.  It
> would be much better to open the relation once and hold onto the
> pointer, not just for performance reasons, but for robustness.

About the specific relation_open(.., NoLock) under question, I think there
might be a way to address this by opening the tables with the appropriate
lock mode in partitioned_rels list in ExecLockNonleafAppendTables and
keeping the Relation pointers around until ExecEndNode.  Then instead of
ExecSetupPartitionPruneState doing relation_open/close(.., NoLock), it
just reuses the one that's passed by the caller.

Attached a PoC patch.  David, thoughts?

Thanks,
Amit
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 11139f743d..e3b3911945 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1244,7 +1244,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
  * PartitionPruneInfo.
  */
 PartitionPruneState *
-ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo)
+ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo,
+Relation 
*partitioned_rels)
 {
PartitionPruningData *prunedata;
PartitionPruneState *prunestate;
@@ -1303,10 +1304,11 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
pprune->subpart_map = pinfo->subpart_map;
 
/*
-* Grab some info from the table's relcache; lock was already 
obtained
-* by ExecLockNonLeafAppendTables.
+* ExecLockNonLeafAppendTables already opened the relation for 
us.
 */
-   rel = relation_open(pinfo->reloid, NoLock);
+   Assert(partitioned_rels[i] != NULL);
+   rel = partitioned_rels[i];
+   Assert(RelationGetRelid(rel) == pinfo->reloid);
 
partkey = RelationGetPartitionKey(rel);
partdesc = RelationGetPartitionDesc(rel);
@@ -1336,8 +1338,6 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
prunestate->execparams = bms_add_members(prunestate->execparams,

 pinfo->execparams);
 
-   relation_close(rel, NoLock);
-
i++;
}
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index b963cae730..58a0961eb1 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -858,22 +858,49 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 /*
  * ExecLockNonLeafAppendTables
  *
- * Locks, if necessary, the tables indicated by the RT indexes contained in
- * the partitioned_rels list.  These are the non-leaf tables in the partition
- * tree controlled by a given Append or MergeAppend node.
+ * Opens and/or locks, if necessary, the tables indicated by the RT indexes
+ * contained in the partitioned_rels list.  These are the non-leaf tables in
+ * the partition tree controlled by a given Append or MergeAppend node.
  */
 void
-ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate)
+ExecLockNonLeafAppendTables(PlanState *planstate,
+   EState *estate,
+   List *partitioned_rels)
 {
PlannedStmt *stmt = estate->es_plannedstmt;
ListCell   *lc;
+   int i;
 
+   /*
+* If we're going to be performing pruning, allocate space for Relation
+* pointers to be used later when setting up partition pruning state in
+* ExecSetupPartitionPruneState.
+*/
+   if (IsA(planstate, AppendState))
+   {
+   AppendState *appendstate = (AppendState *) planstate;
+   Append *node = (Append *) planstate->plan;
+
+   if (node->part_prune_infos != NIL)
+   {
+   Assert(list_length(node->part_prune_infos) ==
+  list_length(partitioned_rels));
+   appendstate->partitioned_rels = (Relation *)
+   

Re: [HACKERS] Runtime Partition Pruning

2018-04-12 Thread Amit Langote
On 2018/04/13 14:38, Amit Langote wrote:
> About the specific relation_open(.., NoLock) under question, I think there
> might be a way to address this by opening the tables with the appropriate
> lock mode in partitioned_rels list in ExecLockNonleafAppendTables

That may have sounded a bit confusing:

I meant to say: "by opening the tables in partitioned_rels list with the
appropriate lock mode in ExecLockNonleafAppendTables"

> Attached a PoC patch.
There were a couple of unnecessary hunks, which removed in the attached.

Thanks,
Amit
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 11139f743d..e3b3911945 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1244,7 +1244,8 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
  * PartitionPruneInfo.
  */
 PartitionPruneState *
-ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo)
+ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo,
+Relation 
*partitioned_rels)
 {
PartitionPruningData *prunedata;
PartitionPruneState *prunestate;
@@ -1303,10 +1304,11 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
pprune->subpart_map = pinfo->subpart_map;
 
/*
-* Grab some info from the table's relcache; lock was already 
obtained
-* by ExecLockNonLeafAppendTables.
+* ExecLockNonLeafAppendTables already opened the relation for 
us.
 */
-   rel = relation_open(pinfo->reloid, NoLock);
+   Assert(partitioned_rels[i] != NULL);
+   rel = partitioned_rels[i];
+   Assert(RelationGetRelid(rel) == pinfo->reloid);
 
partkey = RelationGetPartitionKey(rel);
partdesc = RelationGetPartitionDesc(rel);
@@ -1336,8 +1338,6 @@ ExecSetupPartitionPruneState(PlanState *planstate, List 
*partitionpruneinfo)
prunestate->execparams = bms_add_members(prunestate->execparams,

 pinfo->execparams);
 
-   relation_close(rel, NoLock);
-
i++;
}
 
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index b963cae730..58a0961eb1 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -858,22 +858,49 @@ ShutdownExprContext(ExprContext *econtext, bool isCommit)
 /*
  * ExecLockNonLeafAppendTables
  *
- * Locks, if necessary, the tables indicated by the RT indexes contained in
- * the partitioned_rels list.  These are the non-leaf tables in the partition
- * tree controlled by a given Append or MergeAppend node.
+ * Opens and/or locks, if necessary, the tables indicated by the RT indexes
+ * contained in the partitioned_rels list.  These are the non-leaf tables in
+ * the partition tree controlled by a given Append or MergeAppend node.
  */
 void
-ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate)
+ExecLockNonLeafAppendTables(PlanState *planstate,
+   EState *estate,
+   List *partitioned_rels)
 {
PlannedStmt *stmt = estate->es_plannedstmt;
ListCell   *lc;
+   int i;
 
+   /*
+* If we're going to be performing pruning, allocate space for Relation
+* pointers to be used later when setting up partition pruning state in
+* ExecSetupPartitionPruneState.
+*/
+   if (IsA(planstate, AppendState))
+   {
+   AppendState *appendstate = (AppendState *) planstate;
+   Append *node = (Append *) planstate->plan;
+
+   if (node->part_prune_infos != NIL)
+   {
+   Assert(list_length(node->part_prune_infos) ==
+  list_length(partitioned_rels));
+   appendstate->partitioned_rels = (Relation *)
+   
palloc(sizeof(Relation) *
+  
list_length(node->part_prune_infos));
+   appendstate->num_partitioned_rels =
+  
list_length(node->part_prune_infos);
+   }
+   }
+
+   i = 0;
foreach(lc, partitioned_rels)
{
ListCell   *l;
Index   rti = lfirst_int(lc);
boolis_result_rel = false;
Oid relid = getrelid(rti, 
estate->es_range_table);
+   int  

<    1   2   3   4   5   6   7   8   9   10   >