Re: [HACKERS] Merge Append Patch merged up to 85devel
Gregory Stark st...@enterprisedb.com writes: Here's a copy of the merge-append patch that I sent months ago merged up to head. I haven't really added any additional functionality since then. I looked at the planner part of this a little bit. I think that it's confusing an append that produces an ordered result with an append that's doing a merge. Those concepts need to be kept separate, because as soon as we have a real concept of partitioned tables it will be possible to have known-ordered output from a simple append of indexscans on the partition key. So you need an explicit flag to say we'll do a merge, not just rely on whether the path has pathkeys. As your comments note, the current approach to deciding which ordered paths to generate is pretty unworkable --- it won't scale nicely at all for large numbers of child tables. One random idea is to take some specific one of the children (the largest, probably) as the leader and consider only ordered-appends generating the same pathkeys as are available for the leader. I approve of the fact that the code will consider force-sorting children that are missing a way to match the pathkeys, but presumably we don't want to do that on any but small tables, so this seems like a possibly usable approach. Speaking of sorting, it's not entirely clear to me how the patch ensures that all the child plans produce the necessary sort keys as output columns, and especially not how it ensures that they all get produced in the *same* output columns. This might accidentally manage to work because of the throwaway call to make_sort_from_pathkeys(), but at the very least that's a misleading comment. The other pending question is the same I had back when I originally submitted it. I don't really understand what's going on with eclasses and what invariants we're aiming to maintain with them. For non-equivalence-class qualifications, we translate the parent rel's quals to match the child's columns using adjust_appendrel_attrs(). (This isn't necessarily a no-op because child rels might have different physical numbers for inherited columns.) The child-eclass stuff is just a mechanism to be able to generate suitably translated quals for the cases where quals are being deduced from eclasses instead of presented directly. I don't remember offhand whether there are any special considerations for the associated pathkeys, but it seems possible that it would Just Work --- especially if the patch did anything useful for you at all ;-). In any case, I'm amazed that it's not failing regression tests all over the place with those critical tests in make_sort_from_pathkeys lobotomized by random #ifdef FIXMEs. Perhaps we need some more regression tests... In the same vein, the hack to short circuit the append stuff for a single child node is simply wrong, because it doesn't allow for column number variances. Please remove it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge Append Patch merged up to 85devel
On Sun, Jul 5, 2009 at 7:23 PM, Greg Starkst...@mit.edu wrote: Here's a copy of the merge-append patch that I sent months ago merged up to head. I haven't really added any additional functionality since then. Can you provide some more details about the objective of this patch? Or a link to previous discussion? i was trying to test this one but i can't find a query that produces a diferent plan than in 8.4.0, attached my current test just in case... what kind of query is this intended to help? something, maybe style dependant, that i don't like is the definition of LAPPEND_PATH_FLATTEN_APPENDPATHS macro at some point in the middle of the file i prefer they be defined at the top (just my preference)... or there is a reason for doing it there? another thing a don't like is those #ifdef FIXME surrounding already existing if, why are those? and if they need to be fixed why there isn't a comment explaining what the fix is or what it should behave? -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157 drop database if exists mergeappend; create database mergeappend; \c mergeappend create table tab1 (col1 int); create table tab1_part1 (check (col1 between1 and 1000)) inherits (tab1); create table tab1_part2 (check (col1 between 1001 and 2000)) inherits (tab1); create table tab1_part3 (check (col1 between 2001 and 3000)) inherits (tab1); create table tab1_part4 (check (col1 between 3001 and 4000)) inherits (tab1); create table tab1_part5 (check (col1 between 4001 and 5000)) inherits (tab1); create table tab1_part6 (check (col1 between 5001 and 6000)) inherits (tab1); create table tab1_part7 (check (col1 between 6001 and 7000)) inherits (tab1); create table tab1_part8 (check (col1 between 7001 and 8000)) inherits (tab1); create table tab1_part9 (check (col1 between 8001 and 9000)) inherits (tab1); create table tab1_part10 (check (col1 between 9001 and )) inherits (tab1); insert into tab1_part1 select generate_series( 1, 1000); insert into tab1_part2 select generate_series(1001, 2000); insert into tab1_part3 select generate_series(2001, 3000); insert into tab1_part4 select generate_series(3001, 4000); insert into tab1_part5 select generate_series(4001, 5000); insert into tab1_part6 select generate_series(5001, 6000); insert into tab1_part7 select generate_series(6001, 7000); insert into tab1_part8 select generate_series(7001, 8000); insert into tab1_part9 select generate_series(8001, 9000); insert into tab1_part10 select generate_series(9001, ); create index idx1_tab1_part1 on tab1_part1(col1); create index idx1_tab1_part2 on tab1_part2(col1); create index idx1_tab1_part3 on tab1_part3(col1); create index idx1_tab1_part4 on tab1_part4(col1); create index idx1_tab1_part5 on tab1_part5(col1); create index idx1_tab1_part6 on tab1_part6(col1); create index idx1_tab1_part7 on tab1_part7(col1); create index idx1_tab1_part8 on tab1_part8(col1); create index idx1_tab1_part9 on tab1_part9(col1); create index idx1_tab1_part10 on tab1_part10(col1); analyze; set enable_sort to off; explain analyze select a.* from tab1 a, tab1 b where a.col1 = b.col1 -- and ((a.col1 = 997 and a.col1 = 1000) or --(a.col1 = 4999 and a.col1 = 5000) or --(a.col1 = 5995 and a.col1 = 6000)) order by a.col1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge Append Patch merged up to 85devel
On Sat, Jul 25, 2009 at 8:12 PM, Jaime Casanovajcasa...@systemguards.com.ec wrote: i was trying to test this one but i can't find a query that produces a diferent plan than in 8.4.0, attached my current test just in case... what kind of query is this intended to help? You may have to disable enable_seqscan to get simple examples like this to work: select * from partitioned_table order by indexed_column; more complex examples would trigger it naturally such as: select * from partitioned_table where active order by indexed_column (with an index on indexed_column where active) or select * from partitioned_table where indexed_column between x and y order by indexed_column something, maybe style dependant, that i don't like is the definition of LAPPEND_PATH_FLATTEN_APPENDPATHS macro at some point in the middle of the file i prefer they be defined at the top (just my preference)... or there is a reason for doing it there? well it's only used in that one function, it's just some code which is repeated three times and would obscure what's going on if it were inlined. another thing a don't like is those #ifdef FIXME surrounding already existing if, why are those? and if they need to be fixed why there isn't a comment explaining what the fix is or what it should behave? Yeah, if I knew how to fix them then this patch wouldn't be stuck waiting for feedback... :( -- greg http://mit.edu/~gsstark/resume.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge Append Patch merged up to 85devel
On Sat, Jul 25, 2009 at 2:26 PM, Greg Starkst...@mit.edu wrote: more complex examples would trigger it naturally such as: select * from partitioned_table where active order by indexed_column (with an index on indexed_column where active) or select * from partitioned_table where indexed_column between x and y order by indexed_column look at the example, i had three OR'ed conditions on col1 wich is indexed (in the script those where commented but i tried it first) and i get the same plan than in 8.4 but you're right with enable_seqscan to off i get a better plan, attaching explain analyze in 8.4 and 8.5 (with seqscan to on and off) another thing a don't like is those #ifdef FIXME surrounding already existing if, why are those? and if they need to be fixed why there isn't a comment explaining what the fix is or what it should behave? Yeah, if I knew how to fix them then this patch wouldn't be stuck waiting for feedback... :( and what's the problem with those if? as someone says before feel free to speak slowly and draw pictures ;) -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157 explain_analyze_84.out.gz Description: GNU Zip compressed data explain_analyze_85dev_seqscan_off.out.gz Description: GNU Zip compressed data explain_analyze_85dev_seqscan_on.out.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge Append Patch merged up to 85devel
I've added this to the July commitfest. Gregory Stark wrote: Here's a copy of the merge-append patch that I sent months ago merged up to head. I haven't really added any additional functionality since then. Heikki suggested I separate the Append and MergeAppend nodes into two executor nodes. I had that half done in my tree but looking it over it leads to a lot of duplicated code and a strange effect that there's on Path node but two Executor nodes which seems strange. I'm not sure which way to go here but at least for now I'm leaving it this way since it's less code to write. If we want it the other way to commit then I'll do it. The other pending question is the same I had back when I originally submitted it. I don't really understand what's going on with eclasses and what invariants we're aiming to maintain with them. I don't see a problem tossing all the child relation attributes into the same eclass even though they're not strictly speaking equivalent. No join above the append path is going to see the child attributes anyways. But that might be shortsighted as I'm not really sure what the consequences are and what other uses we have envisioned for eclasses in the future. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge Append Patch merged up to 85devel
Can you provide some more details about the objective of this patch? Or a link to previous discussion? Suppose, Greg's patch could be modified to support order OR index scans: SELECT ... WHERE (c 10 AND c 20) OR (c 100 AND C 110) ORDER BY c DESC with plan: Result - Append - Backward index scan (c 100 AND C 110) - Backward index scan (c 10 AND C 20) I suggested a similar patch two years ago: http://archives.postgresql.org/message-id/45742c51.9020...@sigaev.ru (4-th point) and subsequent discussion http://archives.postgresql.org/pgsql-patches/2006-12/msg00029.php -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Merge Append Patch merged up to 85devel
Here's a copy of the merge-append patch that I sent months ago merged up to head. I haven't really added any additional functionality since then. Heikki suggested I separate the Append and MergeAppend nodes into two executor nodes. I had that half done in my tree but looking it over it leads to a lot of duplicated code and a strange effect that there's on Path node but two Executor nodes which seems strange. I'm not sure which way to go here but at least for now I'm leaving it this way since it's less code to write. If we want it the other way to commit then I'll do it. The other pending question is the same I had back when I originally submitted it. I don't really understand what's going on with eclasses and what invariants we're aiming to maintain with them. I don't see a problem tossing all the child relation attributes into the same eclass even though they're not strictly speaking equivalent. No join above the append path is going to see the child attributes anyways. But that might be shortsighted as I'm not really sure what the consequences are and what other uses we have envisioned for eclasses in the future. diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index 0938e94..cf0f3a1 100644 --- a/src/backend/executor/nodeAppend.c +++ b/src/backend/executor/nodeAppend.c @@ -59,9 +59,25 @@ #include executor/execdebug.h #include executor/nodeAppend.h +#include utils/lsyscache.h +#include access/nbtree.h + +/* It gets quite confusing having a heap array (indexed by integers) which + * contains integers which index into the slots array. These typedefs try to + * clear it up but without making simple inline accessing functions they don't + * actually produce any warnings on mistakes */ + +typedef int SlotNumber; +typedef int HeapPosition; + +#define WHICHPLAN_PLANS_UNINITIALIZED (-1) static bool exec_append_initialize_next(AppendState *appendstate); +static int heap_compare_slots(AppendState *node, SlotNumber slot1, SlotNumber slot2); + +static void heap_siftup_slot(AppendState *node); +static void heap_insert_slot(AppendState *node, SlotNumber new_slot); /* * exec_append_initialize_next @@ -177,12 +193,14 @@ ExecInitAppend(Append *node, EState *estate, int eflags) appendstate-as_firstplan = tplan; appendstate-as_lastplan = tplan; + appendstate-as_is_ordered = false; } else { /* normal case, scan all subplans */ appendstate-as_firstplan = 0; appendstate-as_lastplan = nplans - 1; + appendstate-as_is_ordered = node-isOrdered; } /* @@ -224,11 +242,50 @@ ExecInitAppend(Append *node, EState *estate, int eflags) ExecAssignResultTypeFromTL(appendstate-ps); appendstate-ps.ps_ProjInfo = NULL; - /* - * return the result from the first subplan's initialization - */ - appendstate-as_whichplan = appendstate-as_firstplan; - exec_append_initialize_next(appendstate); + if (!appendstate-as_is_ordered) + { + /* + * return the result from the first subplan's initialization + */ + appendstate-as_whichplan = appendstate-as_firstplan; + exec_append_initialize_next(appendstate); + } else { + /* set up scan keys and initialize *all* the subnodes */ + int i; + + appendstate-as_nkeys = node-numCols; + appendstate-as_scankeys = palloc(sizeof(ScanKeyData) * node-numCols); + appendstate-as_slots = palloc(sizeof(TupleTableSlot *) * nplans); + appendstate-as_heap = palloc(sizeof(int) * nplans); + appendstate-as_heap_size = 0; + + for (i=0; i nplans; i++) + { + appendstate-as_whichplan = i; + exec_append_initialize_next(appendstate); + } + + appendstate-as_whichplan = WHICHPLAN_PLANS_UNINITIALIZED; + + for (i=0; i node-numCols; i++) + { + Oid sortFunction; + bool reverse; + + get_compare_function_for_ordering_op(node-sortOperators[i], + sortFunction, reverse); + + ScanKeyInit(appendstate-as_scankeys[i], + node-sortColIdx[i], + InvalidStrategy, + sortFunction, + (Datum)0); + if (reverse) +appendstate-as_scankeys[i].sk_flags |= SK_BT_DESC; + if (node-nullsFirst[i]) +appendstate-as_scankeys[i].sk_flags |= SK_BT_NULLS_FIRST; + } + } return appendstate; } @@ -253,47 +310,168 @@ ExecCountSlotsAppend(Append *node) TupleTableSlot * ExecAppend(AppendState *node) { - for (;;) - { - PlanState *subnode; - TupleTableSlot *result; + if (!node-as_is_ordered) + for (;;) + { + PlanState *subnode; + TupleTableSlot *result; - /* - * figure out which subplan we are currently processing - */ - subnode = node-appendplans[node-as_whichplan]; + /* + * figure out which subplan we are currently processing + */ + subnode = node-appendplans[node-as_whichplan]; - /* - * get a tuple from the subplan - */ - result = ExecProcNode(subnode); + /* + * get a tuple from the subplan + */ + result = ExecProcNode(subnode); + + if (!TupIsNull(result)) + { +/* + * If the subplan gave
[HACKERS] Merge Append Patch merged up to 85devel
Here's a copy of the merge-append patch that I sent months ago merged up to head. I haven't really added any additional functionality since then. Heikki suggested I separate the Append and MergeAppend nodes into two executor nodes. I had that half done in my tree but looking it over it leads to a lot of duplicated code and a strange effect that there's on Path node but two Executor nodes which seems strange. I'm not sure which way to go here but at least for now I'm leaving it this way since it's less code to write. If we want it the other way to commit then I'll do it. The other pending question is the same I had back when I originally submitted it. I don't really understand what's going on with eclasses and what invariants we're aiming to maintain with them. I don't see a problem tossing all the child relation attributes into the same eclass even though they're not strictly speaking equivalent. No join above the append path is going to see the child attributes anyways. But that might be shortsighted as I'm not really sure what the consequences are and what other uses we have envisioned for eclasses in the future. diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index 0938e94..cf0f3a1 100644 --- a/src/backend/executor/nodeAppend.c +++ b/src/backend/executor/nodeAppend.c @@ -59,9 +59,25 @@ #include executor/execdebug.h #include executor/nodeAppend.h +#include utils/lsyscache.h +#include access/nbtree.h + +/* It gets quite confusing having a heap array (indexed by integers) which + * contains integers which index into the slots array. These typedefs try to + * clear it up but without making simple inline accessing functions they don't + * actually produce any warnings on mistakes */ + +typedef int SlotNumber; +typedef int HeapPosition; + +#define WHICHPLAN_PLANS_UNINITIALIZED (-1) static bool exec_append_initialize_next(AppendState *appendstate); +static int heap_compare_slots(AppendState *node, SlotNumber slot1, SlotNumber slot2); + +static void heap_siftup_slot(AppendState *node); +static void heap_insert_slot(AppendState *node, SlotNumber new_slot); /* * exec_append_initialize_next @@ -177,12 +193,14 @@ ExecInitAppend(Append *node, EState *estate, int eflags) appendstate-as_firstplan = tplan; appendstate-as_lastplan = tplan; + appendstate-as_is_ordered = false; } else { /* normal case, scan all subplans */ appendstate-as_firstplan = 0; appendstate-as_lastplan = nplans - 1; + appendstate-as_is_ordered = node-isOrdered; } /* @@ -224,11 +242,50 @@ ExecInitAppend(Append *node, EState *estate, int eflags) ExecAssignResultTypeFromTL(appendstate-ps); appendstate-ps.ps_ProjInfo = NULL; - /* - * return the result from the first subplan's initialization - */ - appendstate-as_whichplan = appendstate-as_firstplan; - exec_append_initialize_next(appendstate); + if (!appendstate-as_is_ordered) + { + /* + * return the result from the first subplan's initialization + */ + appendstate-as_whichplan = appendstate-as_firstplan; + exec_append_initialize_next(appendstate); + } else { + /* set up scan keys and initialize *all* the subnodes */ + int i; + + appendstate-as_nkeys = node-numCols; + appendstate-as_scankeys = palloc(sizeof(ScanKeyData) * node-numCols); + appendstate-as_slots = palloc(sizeof(TupleTableSlot *) * nplans); + appendstate-as_heap = palloc(sizeof(int) * nplans); + appendstate-as_heap_size = 0; + + for (i=0; i nplans; i++) + { + appendstate-as_whichplan = i; + exec_append_initialize_next(appendstate); + } + + appendstate-as_whichplan = WHICHPLAN_PLANS_UNINITIALIZED; + + for (i=0; i node-numCols; i++) + { + Oid sortFunction; + bool reverse; + + get_compare_function_for_ordering_op(node-sortOperators[i], + sortFunction, reverse); + + ScanKeyInit(appendstate-as_scankeys[i], + node-sortColIdx[i], + InvalidStrategy, + sortFunction, + (Datum)0); + if (reverse) +appendstate-as_scankeys[i].sk_flags |= SK_BT_DESC; + if (node-nullsFirst[i]) +appendstate-as_scankeys[i].sk_flags |= SK_BT_NULLS_FIRST; + } + } return appendstate; } @@ -253,47 +310,168 @@ ExecCountSlotsAppend(Append *node) TupleTableSlot * ExecAppend(AppendState *node) { - for (;;) - { - PlanState *subnode; - TupleTableSlot *result; + if (!node-as_is_ordered) + for (;;) + { + PlanState *subnode; + TupleTableSlot *result; - /* - * figure out which subplan we are currently processing - */ - subnode = node-appendplans[node-as_whichplan]; + /* + * figure out which subplan we are currently processing + */ + subnode = node-appendplans[node-as_whichplan]; - /* - * get a tuple from the subplan - */ - result = ExecProcNode(subnode); + /* + * get a tuple from the subplan + */ + result = ExecProcNode(subnode); + + if (!TupIsNull(result)) + { +/* + * If the subplan gave
Re: [HACKERS] Merge Append Patch merged up to 85devel
On Jul 5, 2009, at 10:02 AM, Gregory Stark st...@mit.edu wrote: Here's a copy of the merge-append patch that I sent months ago merged up to head. I haven't really added any additional functionality since then. Heikki suggested I separate the Append and MergeAppend nodes into two executor nodes. I had that half done in my tree but looking it over it leads to a lot of duplicated code and a strange effect that there's on Path node but two Executor nodes which seems strange. I'm not sure which way to go here but at least for now I'm leaving it this way since it's less code to write. If we want it the other way to commit then I'll do it. The other pending question is the same I had back when I originally submitted it. I don't really understand what's going on with eclasses and what invariants we're aiming to maintain with them. I don't see a problem tossing all the child relation attributes into the same eclass even though they're not strictly speaking equivalent. No join above the append path is going to see the child attributes anyways. But that might be shortsighted as I'm not really sure what the consequences are and what other uses we have envisioned for eclasses in the future. Can you provide some more details about the objective of this patch? Or a link to previous discussion? Thanks, ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Merge Append Patch merged up to 85devel
On Sun, Jul 5, 2009 at 10:34 PM, Robert Haasrobertmh...@gmail.com wrote: On Jul 5, 2009, at 10:02 AM, Gregory Stark st...@mit.edu wrote: Here's a copy of the merge-append patch that I sent months ago merged up to head. I haven't really added any additional functionality since then. Can you provide some more details about the objective of this patch? Or a link to previous discussion? It's one piece of the puzzle of how to deal with partitioned tables more completely. The basic problem is that currently the planner assumes all Append nodes produce unordered output. That means there's no way for the planner to use any indexes on partitions to produce a desired ordering. That means it's impossible to do a merge join or to satisfy an ORDER BY clause without introducing a sort even if you have matching indexes on every partition. Lots of common cases arise with partitioned tables where this kicks in. Many of them will be eliminated as our partitioning support gets more intelligent and is able to recognize how the partitioning key relates to the requested order or collapse singleton partition scans in cases where the parent table currently interferes. However there will always be cases where those mechanisms to simplify the plan sufficiently and we end up with an Append node of unordered partitions and we need this as a back-stop to avoid awful plans. The merge-append node works by teaching the Append node what sort order it's aiming to produce. The append node then keeps a heap of slots and returns the tuple from the top child plan replacing it in the heap with the next plan. This produces plans like this: QUERY PLAN Result (cost=0.20..489.00 rows=9600 width=4) - Append (cost=0.20..489.00 rows=9600 width=4) - Index Scan using p_pkey on p (cost=0.00..80.25 rows=2400 width=4) - Index Scan using p1_pkey on p1 p (cost=0.00..80.25 rows=2400 width=4) - Index Scan using p2_pkey on p2 p (cost=0.00..80.25 rows=2400 width=4) - Index Scan using p3_pkey on p3 p (cost=0.00..80.25 rows=2400 width=4) (6 rows) Instead of plans like this which is the best we can do today: QUERY PLAN -- Sort (cost=770.98..794.98 rows=9600 width=4) Sort Key: public.p.i - Result (cost=0.00..136.00 rows=9600 width=4) - Append (cost=0.00..136.00 rows=9600 width=4) - Seq Scan on p (cost=0.00..34.00 rows=2400 width=4) - Seq Scan on p1 p (cost=0.00..34.00 rows=2400 width=4) - Seq Scan on p2 p (cost=0.00..34.00 rows=2400 width=4) - Seq Scan on p3 p (cost=0.00..34.00 rows=2400 width=4) (8 rows) -- greg http://mit.edu/~gsstark/resume.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers