Re: [HACKERS] Merge Append Patch merged up to 85devel

2009-07-26 Thread Tom Lane
Gregory Stark st...@enterprisedb.com writes:
 Here's a copy of the merge-append patch that I sent months ago merged up to
 head. I haven't really added any additional functionality since then.

I looked at the planner part of this a little bit.  I think that it's
confusing an append that produces an ordered result with an append
that's doing a merge.  Those concepts need to be kept separate, because
as soon as we have a real concept of partitioned tables it will be
possible to have known-ordered output from a simple append of indexscans
on the partition key.  So you need an explicit flag to say we'll do
a merge, not just rely on whether the path has pathkeys.

As your comments note, the current approach to deciding which ordered
paths to generate is pretty unworkable --- it won't scale nicely at all
for large numbers of child tables.  One random idea is to take some
specific one of the children (the largest, probably) as the leader
and consider only ordered-appends generating the same pathkeys as are
available for the leader.  I approve of the fact that the code will
consider force-sorting children that are missing a way to match the
pathkeys, but presumably we don't want to do that on any but small
tables, so this seems like a possibly usable approach.

Speaking of sorting, it's not entirely clear to me how the patch ensures
that all the child plans produce the necessary sort keys as output
columns, and especially not how it ensures that they all get produced in
the *same* output columns.  This might accidentally manage to work
because of the throwaway call to make_sort_from_pathkeys(), but at the
very least that's a misleading comment.

 The other pending question is the same I had back when I originally submitted
 it. I don't really understand what's going on with eclasses and what
 invariants we're aiming to maintain with them.

For non-equivalence-class qualifications, we translate the parent rel's
quals to match the child's columns using adjust_appendrel_attrs().
(This isn't necessarily a no-op because child rels might have different
physical numbers for inherited columns.)  The child-eclass stuff is just
a mechanism to be able to generate suitably translated quals for the
cases where quals are being deduced from eclasses instead of presented
directly.  I don't remember offhand whether there are any special
considerations for the associated pathkeys, but it seems possible that
it would Just Work --- especially if the patch did anything useful for
you at all ;-).  In any case, I'm amazed that it's not failing
regression tests all over the place with those critical tests in
make_sort_from_pathkeys lobotomized by random #ifdef FIXMEs.  Perhaps
we need some more regression tests...

In the same vein, the hack to short circuit the append stuff for
a single child node is simply wrong, because it doesn't allow for column
number variances.  Please remove it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge Append Patch merged up to 85devel

2009-07-25 Thread Jaime Casanova
On Sun, Jul 5, 2009 at 7:23 PM, Greg Starkst...@mit.edu wrote:

 Here's a copy of the merge-append patch that I sent months ago merged up to
 head. I haven't really added any additional functionality since then.

 Can you provide some more details about the objective of this patch?  Or a
 link to previous discussion?


i was trying to test this one but i can't find a query that produces a
diferent plan than in 8.4.0, attached my current test just in case...
what kind of query is this intended to help?

something, maybe style dependant, that i don't like is the definition
of  LAPPEND_PATH_FLATTEN_APPENDPATHS macro at some point in the middle
of the file i prefer they be defined at the top (just my
preference)... or there is a reason for doing it there?

another thing a don't like is those #ifdef FIXME surrounding already
existing if, why are those? and if they need to be fixed why there
isn't a comment explaining what the fix is or what it should behave?

-- 
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157
drop database if exists mergeappend;
create database mergeappend;

\c mergeappend

create table tab1 (col1 int);
create table tab1_part1  (check (col1 between1 and 1000)) inherits (tab1);
create table tab1_part2  (check (col1 between 1001 and 2000)) inherits (tab1);
create table tab1_part3  (check (col1 between 2001 and 3000)) inherits (tab1);
create table tab1_part4  (check (col1 between 3001 and 4000)) inherits (tab1);
create table tab1_part5  (check (col1 between 4001 and 5000)) inherits (tab1);
create table tab1_part6  (check (col1 between 5001 and 6000)) inherits (tab1);
create table tab1_part7  (check (col1 between 6001 and 7000)) inherits (tab1);
create table tab1_part8  (check (col1 between 7001 and 8000)) inherits (tab1);
create table tab1_part9  (check (col1 between 8001 and 9000)) inherits (tab1);
create table tab1_part10 (check (col1 between 9001 and )) inherits (tab1);

insert into tab1_part1  select generate_series(   1, 1000); 
insert into tab1_part2  select generate_series(1001, 2000); 
insert into tab1_part3  select generate_series(2001, 3000); 
insert into tab1_part4  select generate_series(3001, 4000); 
insert into tab1_part5  select generate_series(4001, 5000); 
insert into tab1_part6  select generate_series(5001, 6000); 
insert into tab1_part7  select generate_series(6001, 7000); 
insert into tab1_part8  select generate_series(7001, 8000); 
insert into tab1_part9  select generate_series(8001, 9000); 
insert into tab1_part10 select generate_series(9001, ); 

create index idx1_tab1_part1  on tab1_part1(col1);
create index idx1_tab1_part2  on tab1_part2(col1);
create index idx1_tab1_part3  on tab1_part3(col1);
create index idx1_tab1_part4  on tab1_part4(col1);
create index idx1_tab1_part5  on tab1_part5(col1);
create index idx1_tab1_part6  on tab1_part6(col1);
create index idx1_tab1_part7  on tab1_part7(col1);
create index idx1_tab1_part8  on tab1_part8(col1);
create index idx1_tab1_part9  on tab1_part9(col1);
create index idx1_tab1_part10 on tab1_part10(col1);

analyze;
set enable_sort to off;

explain analyze 
select a.* from tab1 a, tab1 b
 where a.col1 = b.col1
--   and ((a.col1 =  997 and a.col1 = 1000) or 
--(a.col1 = 4999 and a.col1 = 5000) or
--(a.col1 = 5995 and a.col1 = 6000))
 order by a.col1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge Append Patch merged up to 85devel

2009-07-25 Thread Greg Stark
On Sat, Jul 25, 2009 at 8:12 PM, Jaime
Casanovajcasa...@systemguards.com.ec wrote:
 i was trying to test this one but i can't find a query that produces a
 diferent plan than in 8.4.0, attached my current test just in case...
 what kind of query is this intended to help?

You may have to disable enable_seqscan to get simple examples like this to work:

select * from partitioned_table order by indexed_column;

more complex examples would trigger it naturally such as:

select * from partitioned_table where active order by indexed_column
(with an index on indexed_column where active)

or

select * from partitioned_table where indexed_column between x and y
order by indexed_column


 something, maybe style dependant, that i don't like is the definition
 of  LAPPEND_PATH_FLATTEN_APPENDPATHS macro at some point in the middle
 of the file i prefer they be defined at the top (just my
 preference)... or there is a reason for doing it there?

well it's only used in that one function, it's just some code which is
repeated three times and would obscure what's going on if it were
inlined.

 another thing a don't like is those #ifdef FIXME surrounding already
 existing if, why are those? and if they need to be fixed why there
 isn't a comment explaining what the fix is or what it should behave?

Yeah, if I knew how to fix them then this patch wouldn't be stuck
waiting for feedback... :(


-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge Append Patch merged up to 85devel

2009-07-25 Thread Jaime Casanova
On Sat, Jul 25, 2009 at 2:26 PM, Greg Starkst...@mit.edu wrote:

 more complex examples would trigger it naturally such as:

 select * from partitioned_table where active order by indexed_column
 (with an index on indexed_column where active)

 or

 select * from partitioned_table where indexed_column between x and y
 order by indexed_column


look at the example, i had three OR'ed conditions on col1 wich is
indexed (in the script those where commented but i tried it first) and
i get the same plan than in 8.4

but you're right with enable_seqscan to off i get a better plan,
attaching explain analyze in 8.4 and 8.5 (with seqscan to on and off)


 another thing a don't like is those #ifdef FIXME surrounding already
 existing if, why are those? and if they need to be fixed why there
 isn't a comment explaining what the fix is or what it should behave?

 Yeah, if I knew how to fix them then this patch wouldn't be stuck
 waiting for feedback... :(


and what's the problem with those if? as someone says before feel free
to speak slowly and draw pictures ;)

-- 
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157


explain_analyze_84.out.gz
Description: GNU Zip compressed data


explain_analyze_85dev_seqscan_off.out.gz
Description: GNU Zip compressed data


explain_analyze_85dev_seqscan_on.out.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge Append Patch merged up to 85devel

2009-07-14 Thread Heikki Linnakangas
I've added this to the July commitfest.

Gregory Stark wrote:
 Here's a copy of the merge-append patch that I sent months ago merged up to
 head. I haven't really added any additional functionality since then.
 
 Heikki suggested I separate the Append and MergeAppend nodes into two executor
 nodes. I had that half done in my tree but looking it over it leads to a lot
 of duplicated code and a strange effect that there's on Path node but two
 Executor nodes which seems strange. I'm not sure which way to go here but at
 least for now I'm leaving it this way since it's less code to write. If we
 want it the other way to commit then I'll do it.
 
 The other pending question is the same I had back when I originally submitted
 it. I don't really understand what's going on with eclasses and what
 invariants we're aiming to maintain with them. I don't see a problem tossing
 all the child relation attributes into the same eclass even though they're not
 strictly speaking equivalent. No join above the append path is going to see
 the child attributes anyways. But that might be shortsighted as I'm not really
 sure what the consequences are and what other uses we have envisioned for
 eclasses in the future.
 
 
 
 
 
 
 
 
 
 


-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge Append Patch merged up to 85devel

2009-07-07 Thread Teodor Sigaev
Can you provide some more details about the objective of this patch?  Or 
a link to previous discussion?


Suppose, Greg's patch could be modified to support order OR index scans:
SELECT ... WHERE (c  10 AND c  20) OR (c  100 AND C  110) ORDER BY c DESC

with plan:
Result
   - Append
   - Backward index scan (c  100 AND C  110)
   - Backward index scan (c  10 AND C  20)

I suggested a similar patch two years ago: 
http://archives.postgresql.org/message-id/45742c51.9020...@sigaev.ru (4-th 
point) and subsequent discussion 
http://archives.postgresql.org/pgsql-patches/2006-12/msg00029.php


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Merge Append Patch merged up to 85devel

2009-07-05 Thread Gregory Stark

Here's a copy of the merge-append patch that I sent months ago merged up to
head. I haven't really added any additional functionality since then.

Heikki suggested I separate the Append and MergeAppend nodes into two executor
nodes. I had that half done in my tree but looking it over it leads to a lot
of duplicated code and a strange effect that there's on Path node but two
Executor nodes which seems strange. I'm not sure which way to go here but at
least for now I'm leaving it this way since it's less code to write. If we
want it the other way to commit then I'll do it.

The other pending question is the same I had back when I originally submitted
it. I don't really understand what's going on with eclasses and what
invariants we're aiming to maintain with them. I don't see a problem tossing
all the child relation attributes into the same eclass even though they're not
strictly speaking equivalent. No join above the append path is going to see
the child attributes anyways. But that might be shortsighted as I'm not really
sure what the consequences are and what other uses we have envisioned for
eclasses in the future.

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 0938e94..cf0f3a1 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,9 +59,25 @@
 
 #include executor/execdebug.h
 #include executor/nodeAppend.h
+#include utils/lsyscache.h
+#include access/nbtree.h
+
+/* It gets quite confusing having a heap array (indexed by integers) which
+ * contains integers which index into the slots array. These typedefs try to
+ * clear it up but without making simple inline accessing functions they don't
+ * actually produce any warnings on mistakes */
+
+typedef int SlotNumber;
+typedef int HeapPosition;
+
+#define WHICHPLAN_PLANS_UNINITIALIZED (-1)
 
 static bool exec_append_initialize_next(AppendState *appendstate);
 
+static int heap_compare_slots(AppendState *node, SlotNumber slot1, SlotNumber slot2);
+
+static void heap_siftup_slot(AppendState *node);
+static void heap_insert_slot(AppendState *node, SlotNumber new_slot);
 
 /* 
  *		exec_append_initialize_next
@@ -177,12 +193,14 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 
 		appendstate-as_firstplan = tplan;
 		appendstate-as_lastplan = tplan;
+		appendstate-as_is_ordered = false;
 	}
 	else
 	{
 		/* normal case, scan all subplans */
 		appendstate-as_firstplan = 0;
 		appendstate-as_lastplan = nplans - 1;
+		appendstate-as_is_ordered = node-isOrdered;
 	}
 
 	/*
@@ -224,11 +242,50 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	ExecAssignResultTypeFromTL(appendstate-ps);
 	appendstate-ps.ps_ProjInfo = NULL;
 
-	/*
-	 * return the result from the first subplan's initialization
-	 */
-	appendstate-as_whichplan = appendstate-as_firstplan;
-	exec_append_initialize_next(appendstate);
+	if (!appendstate-as_is_ordered) 
+	{
+		/*
+		 * return the result from the first subplan's initialization
+		 */
+		appendstate-as_whichplan = appendstate-as_firstplan;
+		exec_append_initialize_next(appendstate);
+	} else {
+		/* set up scan keys and initialize *all* the subnodes */
+		int i;
+
+		appendstate-as_nkeys = node-numCols;
+		appendstate-as_scankeys = palloc(sizeof(ScanKeyData) *  node-numCols);
+		appendstate-as_slots = palloc(sizeof(TupleTableSlot *) * nplans);
+		appendstate-as_heap = palloc(sizeof(int) * nplans);
+		appendstate-as_heap_size = 0;
+
+		for (i=0; i  nplans; i++)
+		{
+			appendstate-as_whichplan = i;
+			exec_append_initialize_next(appendstate);
+		}
+
+		appendstate-as_whichplan = WHICHPLAN_PLANS_UNINITIALIZED;
+
+		for (i=0; i  node-numCols; i++)
+		{
+			Oid sortFunction;
+			bool reverse;
+
+			get_compare_function_for_ordering_op(node-sortOperators[i],
+ sortFunction, reverse);
+
+			ScanKeyInit(appendstate-as_scankeys[i],
+		node-sortColIdx[i],
+		InvalidStrategy,
+		sortFunction,
+		(Datum)0);
+			if (reverse)
+appendstate-as_scankeys[i].sk_flags |= SK_BT_DESC;
+			if (node-nullsFirst[i])
+appendstate-as_scankeys[i].sk_flags |= SK_BT_NULLS_FIRST;
+		}
+	}
 
 	return appendstate;
 }
@@ -253,47 +310,168 @@ ExecCountSlotsAppend(Append *node)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	for (;;)
-	{
-		PlanState  *subnode;
-		TupleTableSlot *result;
+	if (!node-as_is_ordered)
+		for (;;)
+		{
+			PlanState  *subnode;
+			TupleTableSlot *result;
 
-		/*
-		 * figure out which subplan we are currently processing
-		 */
-		subnode = node-appendplans[node-as_whichplan];
+			/*
+			 * figure out which subplan we are currently processing
+			 */
+			subnode = node-appendplans[node-as_whichplan];
 
-		/*
-		 * get a tuple from the subplan
-		 */
-		result = ExecProcNode(subnode);
+			/*
+			 * get a tuple from the subplan
+			 */
+			result = ExecProcNode(subnode);
+
+			if (!TupIsNull(result))
+			{
+/*
+ * If the subplan gave 

[HACKERS] Merge Append Patch merged up to 85devel

2009-07-05 Thread Gregory Stark

Here's a copy of the merge-append patch that I sent months ago merged up to
head. I haven't really added any additional functionality since then.

Heikki suggested I separate the Append and MergeAppend nodes into two executor
nodes. I had that half done in my tree but looking it over it leads to a lot
of duplicated code and a strange effect that there's on Path node but two
Executor nodes which seems strange. I'm not sure which way to go here but at
least for now I'm leaving it this way since it's less code to write. If we
want it the other way to commit then I'll do it.

The other pending question is the same I had back when I originally submitted
it. I don't really understand what's going on with eclasses and what
invariants we're aiming to maintain with them. I don't see a problem tossing
all the child relation attributes into the same eclass even though they're not
strictly speaking equivalent. No join above the append path is going to see
the child attributes anyways. But that might be shortsighted as I'm not really
sure what the consequences are and what other uses we have envisioned for
eclasses in the future.

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 0938e94..cf0f3a1 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,9 +59,25 @@
 
 #include executor/execdebug.h
 #include executor/nodeAppend.h
+#include utils/lsyscache.h
+#include access/nbtree.h
+
+/* It gets quite confusing having a heap array (indexed by integers) which
+ * contains integers which index into the slots array. These typedefs try to
+ * clear it up but without making simple inline accessing functions they don't
+ * actually produce any warnings on mistakes */
+
+typedef int SlotNumber;
+typedef int HeapPosition;
+
+#define WHICHPLAN_PLANS_UNINITIALIZED (-1)
 
 static bool exec_append_initialize_next(AppendState *appendstate);
 
+static int heap_compare_slots(AppendState *node, SlotNumber slot1, SlotNumber slot2);
+
+static void heap_siftup_slot(AppendState *node);
+static void heap_insert_slot(AppendState *node, SlotNumber new_slot);
 
 /* 
  *		exec_append_initialize_next
@@ -177,12 +193,14 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 
 		appendstate-as_firstplan = tplan;
 		appendstate-as_lastplan = tplan;
+		appendstate-as_is_ordered = false;
 	}
 	else
 	{
 		/* normal case, scan all subplans */
 		appendstate-as_firstplan = 0;
 		appendstate-as_lastplan = nplans - 1;
+		appendstate-as_is_ordered = node-isOrdered;
 	}
 
 	/*
@@ -224,11 +242,50 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	ExecAssignResultTypeFromTL(appendstate-ps);
 	appendstate-ps.ps_ProjInfo = NULL;
 
-	/*
-	 * return the result from the first subplan's initialization
-	 */
-	appendstate-as_whichplan = appendstate-as_firstplan;
-	exec_append_initialize_next(appendstate);
+	if (!appendstate-as_is_ordered) 
+	{
+		/*
+		 * return the result from the first subplan's initialization
+		 */
+		appendstate-as_whichplan = appendstate-as_firstplan;
+		exec_append_initialize_next(appendstate);
+	} else {
+		/* set up scan keys and initialize *all* the subnodes */
+		int i;
+
+		appendstate-as_nkeys = node-numCols;
+		appendstate-as_scankeys = palloc(sizeof(ScanKeyData) *  node-numCols);
+		appendstate-as_slots = palloc(sizeof(TupleTableSlot *) * nplans);
+		appendstate-as_heap = palloc(sizeof(int) * nplans);
+		appendstate-as_heap_size = 0;
+
+		for (i=0; i  nplans; i++)
+		{
+			appendstate-as_whichplan = i;
+			exec_append_initialize_next(appendstate);
+		}
+
+		appendstate-as_whichplan = WHICHPLAN_PLANS_UNINITIALIZED;
+
+		for (i=0; i  node-numCols; i++)
+		{
+			Oid sortFunction;
+			bool reverse;
+
+			get_compare_function_for_ordering_op(node-sortOperators[i],
+ sortFunction, reverse);
+
+			ScanKeyInit(appendstate-as_scankeys[i],
+		node-sortColIdx[i],
+		InvalidStrategy,
+		sortFunction,
+		(Datum)0);
+			if (reverse)
+appendstate-as_scankeys[i].sk_flags |= SK_BT_DESC;
+			if (node-nullsFirst[i])
+appendstate-as_scankeys[i].sk_flags |= SK_BT_NULLS_FIRST;
+		}
+	}
 
 	return appendstate;
 }
@@ -253,47 +310,168 @@ ExecCountSlotsAppend(Append *node)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	for (;;)
-	{
-		PlanState  *subnode;
-		TupleTableSlot *result;
+	if (!node-as_is_ordered)
+		for (;;)
+		{
+			PlanState  *subnode;
+			TupleTableSlot *result;
 
-		/*
-		 * figure out which subplan we are currently processing
-		 */
-		subnode = node-appendplans[node-as_whichplan];
+			/*
+			 * figure out which subplan we are currently processing
+			 */
+			subnode = node-appendplans[node-as_whichplan];
 
-		/*
-		 * get a tuple from the subplan
-		 */
-		result = ExecProcNode(subnode);
+			/*
+			 * get a tuple from the subplan
+			 */
+			result = ExecProcNode(subnode);
+
+			if (!TupIsNull(result))
+			{
+/*
+ * If the subplan gave 

Re: [HACKERS] Merge Append Patch merged up to 85devel

2009-07-05 Thread Robert Haas

On Jul 5, 2009, at 10:02 AM, Gregory Stark st...@mit.edu wrote:
Here's a copy of the merge-append patch that I sent months ago  
merged up to

head. I haven't really added any additional functionality since then.

Heikki suggested I separate the Append and MergeAppend nodes into  
two executor
nodes. I had that half done in my tree but looking it over it leads  
to a lot
of duplicated code and a strange effect that there's on Path node  
but two
Executor nodes which seems strange. I'm not sure which way to go  
here but at
least for now I'm leaving it this way since it's less code to write.  
If we

want it the other way to commit then I'll do it.

The other pending question is the same I had back when I originally  
submitted

it. I don't really understand what's going on with eclasses and what
invariants we're aiming to maintain with them. I don't see a problem  
tossing
all the child relation attributes into the same eclass even though  
they're not
strictly speaking equivalent. No join above the append path is  
going to see
the child attributes anyways. But that might be shortsighted as I'm  
not really
sure what the consequences are and what other uses we have  
envisioned for

eclasses in the future.


Can you provide some more details about the objective of this patch?   
Or a link to previous discussion?


Thanks,

...Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Merge Append Patch merged up to 85devel

2009-07-05 Thread Greg Stark
On Sun, Jul 5, 2009 at 10:34 PM, Robert Haasrobertmh...@gmail.com wrote:
 On Jul 5, 2009, at 10:02 AM, Gregory Stark st...@mit.edu wrote:

 Here's a copy of the merge-append patch that I sent months ago merged up to
 head. I haven't really added any additional functionality since then.

 Can you provide some more details about the objective of this patch?  Or a
 link to previous discussion?


It's one piece of the puzzle of how to deal with partitioned tables
more completely.

The basic problem is that currently the planner assumes all Append
nodes produce unordered output. That means there's no way for the
planner to use any indexes on partitions to produce a desired
ordering. That means it's impossible to do a merge join or to satisfy
an ORDER BY clause without introducing a sort even if you have
matching indexes on every partition.

Lots of common cases arise with partitioned tables where this kicks
in. Many of them will be eliminated as our partitioning support gets
more intelligent and is able to recognize how the partitioning key
relates to the requested order or collapse singleton partition scans
in cases where the parent table currently interferes. However there
will always be cases where those mechanisms to simplify the plan
sufficiently and we end up with an Append node of unordered partitions
and we need this as a back-stop to avoid awful plans.

The merge-append node works by teaching the Append node what sort
order it's aiming to produce. The append node then keeps a heap of
slots and returns the tuple from the top child plan replacing it in
the heap with the next plan.

This produces plans like this:

  QUERY PLAN

 Result  (cost=0.20..489.00 rows=9600 width=4)
   -  Append  (cost=0.20..489.00 rows=9600 width=4)
 -  Index Scan using p_pkey on p  (cost=0.00..80.25 rows=2400 width=4)
 -  Index Scan using p1_pkey on p1 p  (cost=0.00..80.25
rows=2400 width=4)
 -  Index Scan using p2_pkey on p2 p  (cost=0.00..80.25
rows=2400 width=4)
 -  Index Scan using p3_pkey on p3 p  (cost=0.00..80.25
rows=2400 width=4)
(6 rows)

Instead of plans like this which is the best we can do today:

QUERY PLAN
--
 Sort  (cost=770.98..794.98 rows=9600 width=4)
   Sort Key: public.p.i
   -  Result  (cost=0.00..136.00 rows=9600 width=4)
 -  Append  (cost=0.00..136.00 rows=9600 width=4)
   -  Seq Scan on p  (cost=0.00..34.00 rows=2400 width=4)
   -  Seq Scan on p1 p  (cost=0.00..34.00 rows=2400 width=4)
   -  Seq Scan on p2 p  (cost=0.00..34.00 rows=2400 width=4)
   -  Seq Scan on p3 p  (cost=0.00..34.00 rows=2400 width=4)
(8 rows)


-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers