Re: [HACKERS] upper planner path-ification
On Tue, Jun 23, 2015 at 4:41 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: So, unless we don't find out a solution around planner, 2-phase aggregation is like a curry without rice Simon and I spoke with Tom about this upper planner path-ification problem at PGCon, and he indicated that he intended to work on it soon, and that we should bug him about it if he doesn't. I'm pleased to here this as I seem to have a special talent in that area. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
-Original Message- From: David Rowley [mailto:david.row...@2ndquadrant.com] Sent: Tuesday, June 23, 2015 2:06 PM To: Kaigai Kouhei(海外 浩平) Cc: Robert Haas; pgsql-hackers@postgresql.org; Tom Lane Subject: Re: [HACKERS] upper planner path-ification On 23 June 2015 at 13:55, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Once we support to add aggregation path during path consideration, we need to pay attention morphing of the final target-list according to the intermediate path combination, tentatively chosen. For example, if partial-aggregation makes sense from cost perspective; like SUM(NRows) of partial COUNT(*) AS NRows instead of COUNT(*) on billion rows, planner also has to adjust the final target-list according to the interim paths. In this case, final output shall be SUM(), instead of COUNT(). This sounds very much like what's been discussed here: http://www.postgresql.org/message-id/CA+U5nMJ92azm0Yt8TT=hNxFP=VjFhDqFpaWfmj +66-4zvcg...@mail.gmail.com The basic concept is that we add another function set to aggregates that allow the combination of 2 states. For the case of MIN() and MAX() this will just be the same as the transfn. SUM() is similar for many types, more complex for others. I've quite likely just borrowed SUM(BIGINT)'s transfer functions to allow COUNT()'s to be combined. STDDEV, VARIANCE and relevant can be constructed using nrows, sum(X) and sum(X^2). REGR_*, COVAR_* and relevant can be constructed using nrows, sum(X), sum(Y), sum(X^2), sum(Y^2) and sum(X*Y). Let me introduce a positive side effect of this approach. Because final aggregate function processes values already aggregated partially, the difference between the state value and transition value gets relatively small. It reduces accidental errors around floating-point calculation. :-) More time does need spent inventing the new combining functions that don't currently exist, but that shouldn't matter as it can be done later. Commitfest link to patch here https://commitfest.postgresql.org/5/131/ I see you've signed up to review it! Yes, all of us looks at same direction. Problem is, we have to cross the mountain of the planner enhancement to reach all the valuable: - parallel aggregation - aggregation before join - remote aggregation via FDW So, unless we don't find out a solution around planner, 2-phase aggregation is like a curry without rice Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
-Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas Sent: Tuesday, June 23, 2015 10:18 PM To: Kaigai Kouhei(海外 浩平) Cc: David Rowley; pgsql-hackers@postgresql.org; Tom Lane Subject: Re: [HACKERS] upper planner path-ification On Tue, Jun 23, 2015 at 4:41 AM, Kouhei Kaigai kai...@ak.jp.nec.com wrote: So, unless we don't find out a solution around planner, 2-phase aggregation is like a curry without rice Simon and I spoke with Tom about this upper planner path-ification problem at PGCon, and he indicated that he intended to work on it soon, and that we should bug him about it if he doesn't. I'm pleased to here this as I seem to have a special talent in that area. It's fantastic support arm. So, it may be more productive to pay my efforts for investigation of David Rowley's patch, based on the assumption if we can add partial/final-aggregation path node as if scan/join-paths. My interest also stands on his infrastructure. https://cs.uwaterloo.ca/research/tr/1993/46/file.pdf It is a bit old paper but good starting point. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
-Original Message- From: pgsql-hackers-ow...@postgresql.org [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas Sent: Thursday, May 14, 2015 10:39 AM To: pgsql-hackers@postgresql.org; Tom Lane Subject: [HACKERS] upper planner path-ification Hi, I've been pulling over Tom's occasional remarks about redoing grouping_planner - and maybe further layers of the planner - to work with Paths instead of Plans. I've had difficulty locating all of the relevant threads. Here's everything I've found so far, which I'm pretty sure is not everything: http://www.postgresql.org/message-id/17400.1311716...@sss.pgh.pa.us http://www.postgresql.org/message-id/2614.1375730...@sss.pgh.pa.us http://www.postgresql.org/message-id/22721.1385048...@sss.pgh.pa.us http://www.postgresql.org/message-id/BANLkTinDjjFHNOzESG2J2U4GOkqLu69Zqg@mai l.gmail.com http://www.postgresql.org/message-id/8479.1418420...@sss.pgh.pa.us I think there are two separate problems here. First, there's the problem that grouping_planner() is complicated. It's doing cost comparisons, but all in ad-hoc fashion rather than using a consistent mechanic the way add_path() does. Generally, we can plan an aggregate using either (1) a hashed aggregate, (2) a sorted aggregate, or (3) for min or max, an index scan that just grabs the highest or lowest value in lieu of a full table scan. Instead of generating a plan for each of these possibilities, we'd like to generate paths for each one, and then pick one to turn into a plan. AIUI, the hope is that this would simplify the cost calculations, and also make it easier to inject other paths, such as a path where an FDW performs the aggregation step on the remote server. Second, there's the problem that we might like to order aggregates with respect to joins. If we have something like SELECT DISTINCT ON (foo.x) foo.x, foo.y, bar.y FROM foo, bar WHERE foo.x = bar.x, then (a) if foo.x has many duplicates, it will be better to DISTINCT-ify foo and then join to bar afterwards but (b) if foo.x = bar.x is highly selective, it will be better to join to bar first and then DISTINCT-ify the result. Currently, aggregation is always last; that's not great. Hitoshi Harada's proposed strategy of essentially figuring out where the aggregation steps can go and then re-planning for each one is also not great, because each join problem will be a subtree of the one we use for the aggregate-last strategy, and thus we're wasting effort by planning the same subtrees multiple times. Instead, we might imagine letting grouping planner fish out the best paths for the joinrels that represent possible aggregation points, generate aggregation paths for each of those, and then work out what additional rels need to be joined afterwards. That sounds hard, but not as hard as doing something sensible with what we have today. Once we support to add aggregation path during path consideration, we need to pay attention morphing of the final target-list according to the intermediate path combination, tentatively chosen. For example, if partial-aggregation makes sense from cost perspective; like SUM(NRows) of partial COUNT(*) AS NRows instead of COUNT(*) on billion rows, planner also has to adjust the final target-list according to the interim paths. In this case, final output shall be SUM(), instead of COUNT(). I think, this planner enhancement needs a mechanism to inform planner expected final output for each Path. It means, individual Path node can have a replacement rule: which target-entry should have what expression instead, once this path is applied. In above case, PartialAggPath will have an indication; SUM(nrows) shall be applied on the location where COUNT(*) is originally placed. In case of relations join that involves paths with this replacement rules, it may be a bit complicated if an expression takes argument come from both relations; like CORR(inner_rel.X, outer_rel.Y), I'm not sure whether we can have more broken-down partial aggregate. However, relations on either of inner/outer side are responsible to the columns in inner/outer side. Unless all the arguments in aggregate function come from same side, it should not block aggregate push- down as long as estimated cost makes sense. That is a probably makes a problem when we allow to add aggregate path as a part of partial plan tree. Thanks, I'm inclined to think that it would be useful to solve the first problem even if we didn't solve the second one right away (but that might be wrong). As a preparatory step, I'm thinking it would be sensible to split grouping_planner() into an outer function that would handle the addition of Limit and LockRows nodes and maybe planning of set operations, and an inner function that would handle GROUP BY, DISTINCT, and possibly window function planning. Thoughts? -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -- Sent
Re: [HACKERS] upper planner path-ification
On 23 June 2015 at 13:55, Kouhei Kaigai kai...@ak.jp.nec.com wrote: Once we support to add aggregation path during path consideration, we need to pay attention morphing of the final target-list according to the intermediate path combination, tentatively chosen. For example, if partial-aggregation makes sense from cost perspective; like SUM(NRows) of partial COUNT(*) AS NRows instead of COUNT(*) on billion rows, planner also has to adjust the final target-list according to the interim paths. In this case, final output shall be SUM(), instead of COUNT(). This sounds very much like what's been discussed here: http://www.postgresql.org/message-id/CA+U5nMJ92azm0Yt8TT=hNxFP=vjfhdqfpawfmj+66-4zvcg...@mail.gmail.com The basic concept is that we add another function set to aggregates that allow the combination of 2 states. For the case of MIN() and MAX() this will just be the same as the transfn. SUM() is similar for many types, more complex for others. I've quite likely just borrowed SUM(BIGINT)'s transfer functions to allow COUNT()'s to be combined. More time does need spent inventing the new combining functions that don't currently exist, but that shouldn't matter as it can be done later. Commitfest link to patch here https://commitfest.postgresql.org/5/131/ I see you've signed up to review it! Regards David Rowley -- David Rowley http://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] upper planner path-ification
Tom == Tom Lane t...@sss.pgh.pa.us writes: Tom Hm. That's a hangover from when query_planner also gave back a Tom Plan (singular) rather than a set of Paths. I don't see any Tom fundamental reason why we couldn't generalize it to be a list of Tom potentially useful output orderings rather than just one. But I'm Tom a bit concerned about the ensuing growth in planning time; is it Tom really all that useful? The planning time growth is a possible concern, yes. The potential gain is eliminating one sort step, in the case when the input has a usable sorted path but grouping_planner happens not to ask for it (when there's more than just a single rollup, the code currently asks for one of the sort orders pretty much arbitrarily since it has no real way to know otherwise). Whether that would justify it... I don't know. Maybe that's one to save for later to see if there's any feedback from actual use. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
At Tue, 19 May 2015 09:04:00 -0400, Robert Haas robertmh...@gmail.com wrote in CA+TgmobAV3_DS1sXA+PFWkjvX1K-VNiAnMMJrzPfD43g=-4...@mail.gmail.com On Tue, May 19, 2015 at 7:19 AM, Andrew Gierth and...@tao11.riddles.org.uk wrote: Tom == Tom Lane t...@sss.pgh.pa.us writes: Tom Hm. That's a hangover from when query_planner also gave back a Tom Plan (singular) rather than a set of Paths. I don't see any Tom fundamental reason why we couldn't generalize it to be a list of Tom potentially useful output orderings rather than just one. But I'm Tom a bit concerned about the ensuing growth in planning time; is it Tom really all that useful? The planning time growth is a possible concern, yes. The potential gain is eliminating one sort step, in the case when the input has a usable sorted path but grouping_planner happens not to ask for it (when there's more than just a single rollup, the code currently asks for one of the sort orders pretty much arbitrarily since it has no real way to know otherwise). Whether that would justify it... I don't know. Maybe that's one to save for later to see if there's any feedback from actual use. I kind of doubt that the growth in planning time would be anything too unreasonable. We already consider multiple orderings for ordinary base relations, so it's not very obvious why consideration multiple orderings for subqueries would be any worse. If we can arrange to throw away useless orderings early, as we do in other cases, then any extra paths we consider have a reasonable chance of being useful. Though I don't think that the simple path-ification of what is currently done make it grow in any degree, it could rapidly grow if we unconditionally construct extra upper-paths using the previously-abandoned extra paths or make them involved in join considerations. But the growth in planning time could be kept reasonable if we pay attention so that, as we have done so, the additional optimization schemes to have simple and precise bailing-out logic even if they require some complicated calculation or yields more extra paths. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
On Tue, May 19, 2015 at 7:19 AM, Andrew Gierth and...@tao11.riddles.org.uk wrote: Tom == Tom Lane t...@sss.pgh.pa.us writes: Tom Hm. That's a hangover from when query_planner also gave back a Tom Plan (singular) rather than a set of Paths. I don't see any Tom fundamental reason why we couldn't generalize it to be a list of Tom potentially useful output orderings rather than just one. But I'm Tom a bit concerned about the ensuing growth in planning time; is it Tom really all that useful? The planning time growth is a possible concern, yes. The potential gain is eliminating one sort step, in the case when the input has a usable sorted path but grouping_planner happens not to ask for it (when there's more than just a single rollup, the code currently asks for one of the sort orders pretty much arbitrarily since it has no real way to know otherwise). Whether that would justify it... I don't know. Maybe that's one to save for later to see if there's any feedback from actual use. I kind of doubt that the growth in planning time would be anything too unreasonable. We already consider multiple orderings for ordinary base relations, so it's not very obvious why consideration multiple orderings for subqueries would be any worse. If we can arrange to throw away useless orderings early, as we do in other cases, then any extra paths we consider have a reasonable chance of being useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
Andrew Gierth and...@tao11.riddles.org.uk writes: Incidentally, the most obvious obstacle to better planning of grouping sets in the sorted cases is not so much how to pick paths in grouping_planner itself, but rather the fact that query_planner wants to be given only one sort order. Is there any prospect for improvement there? Hm. That's a hangover from when query_planner also gave back a Plan (singular) rather than a set of Paths. I don't see any fundamental reason why we couldn't generalize it to be a list of potentially useful output orderings rather than just one. But I'm a bit concerned about the ensuing growth in planning time; is it really all that useful? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
Tom == Tom Lane t...@sss.pgh.pa.us writes: Hrm, ok. So for the near future, we should leave it more or less as-is? We don't have a timescale yet, but it's our intention to submit a hashagg support patch for grouping sets as soon as time permits. Tom Well, mumble. I keep saying that I want to tackle path-ification Tom in that area, and I keep not finding the time to actually do it. Tom So I'm hesitant to tell you that you should wait on it. But Tom certainly I think that it'll be a lot easier to get hashagg Tom costing done in that framework than in what currently exists. Incidentally, the most obvious obstacle to better planning of grouping sets in the sorted cases is not so much how to pick paths in grouping_planner itself, but rather the fact that query_planner wants to be given only one sort order. Is there any prospect for improvement there? -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
On Sun, May 17, 2015 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: Rather than adding tlists per se to Paths, I've been vaguely toying with a notion of identifying all the interesting subexpressions in a query (expensive functions, aggregates, etc), giving them indexes 1..n, and then marking Paths with bitmapsets showing which interesting subexpressions they can produce values for. This would make things like does this Path compute all the needed aggregates much cheaper to deal with than a raw tlist representation would do. But maybe that's still not the best way. I don't know, but it seems like this might be pulling in the opposite direction from your previously-stated desire to get subquery_planner to output Paths rather than Plans as soon as possible. I'm inclined to think that we need to make a start on this in whatever way is simplest, and then refine it later. Adding tlists to (some?) paths might not be the most beautiful way to get there, but there's a lot of important stuff piling up behind making some kind of a start on this refactoring, and we can't just keep postponing all of that stuff indefinitely. Or if we do, the project as a whole will be worse off for it, I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
Robert Haas robertmh...@gmail.com writes: On Sun, May 17, 2015 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: Rather than adding tlists per se to Paths, I've been vaguely toying with a notion of identifying all the interesting subexpressions in a query (expensive functions, aggregates, etc), giving them indexes 1..n, and then marking Paths with bitmapsets showing which interesting subexpressions they can produce values for. This would make things like does this Path compute all the needed aggregates much cheaper to deal with than a raw tlist representation would do. But maybe that's still not the best way. I don't know, but it seems like this might be pulling in the opposite direction from your previously-stated desire to get subquery_planner to output Paths rather than Plans as soon as possible. Sorry, I didn't mean to suggest that that necessarily had to happen right away. What we do need right away, though, is *some* design for distinguishing Paths for the different possible upper-level steps. I won't cry if we change it around later, but we have to have something to start with. So for the moment, let's assume that we still rigidly follow the sequence of upper-level steps currently embodied in grouping_planner. (I'm not sure if it even makes sense to consider other orderings of those processing steps, but in any case we don't need to allow it on day zero.) Then, make a dummy RelOptInfo corresponding to the result of each step, and insert links to those in new fields in PlannerInfo. (We create these *before* starting scan/join planning, so that FDWs, custom scans, etc, can inject paths into these RelOptInfos if they want, so as to represent cases like remote aggregation.) Then just use add_path with the appropriate target RelOptInfo when producing different ways to do grouping etc. This is a bit ad-hoc but it would be a place to start. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
On 18 May 2015 at 14:50, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Sun, May 17, 2015 at 12:11 PM, Tom Lane t...@sss.pgh.pa.us wrote: Rather than adding tlists per se to Paths, I've been vaguely toying with a notion of identifying all the interesting subexpressions in a query (expensive functions, aggregates, etc), giving them indexes 1..n, and then marking Paths with bitmapsets showing which interesting subexpressions they can produce values for. This would make things like does this Path compute all the needed aggregates much cheaper to deal with than a raw tlist representation would do. But maybe that's still not the best way. I don't know, but it seems like this might be pulling in the opposite direction from your previously-stated desire to get subquery_planner to output Paths rather than Plans as soon as possible. Sorry, I didn't mean to suggest that that necessarily had to happen right away. What we do need right away, though, is *some* design for distinguishing Paths for the different possible upper-level steps. I won't cry if we change it around later, but we have to have something to start with. So for the moment, let's assume that we still rigidly follow the sequence of upper-level steps currently embodied in grouping_planner. (I'm not sure if it even makes sense to consider other orderings of those processing steps, but in any case we don't need to allow it on day zero.) Then, make a dummy RelOptInfo corresponding to the result of each step, and insert links to those in new fields in PlannerInfo. (We create these *before* starting scan/join planning, so that FDWs, custom scans, etc, can inject paths into these RelOptInfos if they want, so as to represent cases like remote aggregation.) Then just use add_path with the appropriate target RelOptInfo when producing different ways to do grouping etc. This is a bit ad-hoc but it would be a place to start. Comments? My thinking was to push aggregation down to the lowest level possible in the plan, hopefully a single relation. That way we can generate paths for the current grouping_planner options as well as others, such as these * Push down aggregate prior to a join (which might then affect join planning) * Allow parallel queries to follow a scan-aggregate-collectfromslaves-aggregate strategy (hence need for double aggregation semantics) * Allow a lookaside to a Mat View rather than do a scan-aggregate (assume for now these are maintained correctly) * Allow a lookaside to an alternate datastore/mechanism via CustomScan (assume these are maintained correctly) all of which need to be costed against each other and the current strategies (aggregate last). The above proposal sounds like it will do that, but not completely sure. I'm assuming the O(N^2) Mat View planning problem can be solved in part by recognizing that many MVs are just single-table plus aggregates, and that we'd have a small enough number of MVs in play that search would not be a problem in practice. I'm also aware that LIMIT is still very badly optimized, so I'm hoping it helps there also. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] upper planner path-ification
Simon Riggs si...@2ndquadrant.com writes: On 18 May 2015 at 14:50, Tom Lane t...@sss.pgh.pa.us wrote: So for the moment, let's assume that we still rigidly follow the sequence of upper-level steps currently embodied in grouping_planner. (I'm not sure if it even makes sense to consider other orderings of those processing steps, but in any case we don't need to allow it on day zero.) Then, make a dummy RelOptInfo corresponding to the result of each step, and insert links to those in new fields in PlannerInfo. (We create these *before* starting scan/join planning, so that FDWs, custom scans, etc, can inject paths into these RelOptInfos if they want, so as to represent cases like remote aggregation.) Then just use add_path with the appropriate target RelOptInfo when producing different ways to do grouping etc. This is a bit ad-hoc but it would be a place to start. My thinking was to push aggregation down to the lowest level possible in the plan, hopefully a single relation. In this sketch, it's basically up to an FDW to recognize when remote aggregation is possible, and put a suitable ForeignScan path into the RelOptInfo that corresponds to aggregate output. It could do that for either a single foreign relation or a foreign join, if it had enough smarts. This is more or less where we are today for foreign queries of all types: the responsibility is totally on the FDW to figure out whether it can optimize the query. I have hopes that we can later provide some infrastructure that would assist FDWs in that, so they're not all reinventing the wheel. But I don't yet have much idea what such infrastructure ought to look like, and it's probably hopeless to try to design it until things have stabilized more. Most of the other stuff you're talking about is entirely off the table until we think about ways that aggregation could be done below the top level, which this sketch isn't pretending to address. I agree with Robert that that's probably best left till later. There's a fair amount of work to do before we even have this much done. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
On Mon, May 18, 2015 at 2:50 PM, Tom Lane t...@sss.pgh.pa.us wrote: I don't know, but it seems like this might be pulling in the opposite direction from your previously-stated desire to get subquery_planner to output Paths rather than Plans as soon as possible. Sorry, I didn't mean to suggest that that necessarily had to happen right away. What we do need right away, though, is *some* design for distinguishing Paths for the different possible upper-level steps. I won't cry if we change it around later, but we have to have something to start with. That sounds good to me. So for the moment, let's assume that we still rigidly follow the sequence of upper-level steps currently embodied in grouping_planner. (I'm not sure if it even makes sense to consider other orderings of those processing steps, but in any case we don't need to allow it on day zero.) Then, make a dummy RelOptInfo corresponding to the result of each step, and insert links to those in new fields in PlannerInfo. (We create these *before* starting scan/join planning, so that FDWs, custom scans, etc, can inject paths into these RelOptInfos if they want, so as to represent cases like remote aggregation.) Then just use add_path with the appropriate target RelOptInfo when producing different ways to do grouping etc. This is a bit ad-hoc but it would be a place to start. Comments? That sounds reasonable, but I think the key thing is to nail down what the new Path types are going to look like. If we know that, then rearranging grouping_planner becomes a matter of adjusting things piece by piece for the new data structures. Or at least I think it does. I think it would also be an excellent idea to split each of those upper level steps embodied in grouping_planner into its own function. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
Robert == Robert Haas robertmh...@gmail.com writes: Robert I think grouping_planner() is badly in need of some refactoring Robert just to make it shorter. It's over 1000 lines of code, which Robert IMHO is a fairly ridiculous length for a single function. If there's interest, we could do that specific task as part of adding hashagg support for grouping sets (which would otherwise make it even longer), or as preparatory work for that. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
Andrew Gierth and...@tao11.riddles.org.uk writes: Robert == Robert Haas robertmh...@gmail.com writes: Robert I think grouping_planner() is badly in need of some refactoring Robert just to make it shorter. It's over 1000 lines of code, which Robert IMHO is a fairly ridiculous length for a single function. If there's interest, we could do that specific task as part of adding hashagg support for grouping sets (which would otherwise make it even longer), or as preparatory work for that. I think that refactoring without changing anything about the way it works will be painful and probably ultimately a dead end. As an example, the current_pathkeys local variable is state that's needed throughout that process, so you'd need some messy notation to pass it around (unless you stuck it into PlannerRoot, which would be ugly too). But that would go away in a path-ified universe, because each Path would be marked as to its output sort order. More, a lot of the code that you'd be relocating is code that we should be trying to get rid of altogether, not just relocate (to wit all the stuff that does cost-based comparisons of alternatives). So I'm all for refactoring, but I think it will happen as a natural byproduct of path-ification, and otherwise would be rather forced. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
Tom == Tom Lane t...@sss.pgh.pa.us writes: If there's interest, we could do that specific task as part of adding hashagg support for grouping sets (which would otherwise make it even longer), or as preparatory work for that. Tom I think that refactoring without changing anything about the way Tom it works will be painful and probably ultimately a dead end. As Tom an example, the current_pathkeys local variable is state that's Tom needed throughout that process, so you'd need some messy notation Tom to pass it around (unless you stuck it into PlannerRoot, which Tom would be ugly too). But that would go away in a path-ified Tom universe, because each Path would be marked as to its output sort Tom order. More, a lot of the code that you'd be relocating is code Tom that we should be trying to get rid of altogether, not just Tom relocate (to wit all the stuff that does cost-based comparisons of Tom alternatives). Tom So I'm all for refactoring, but I think it will happen as a natural Tom byproduct of path-ification, and otherwise would be rather forced. Hrm, ok. So for the near future, we should leave it more or less as-is? We don't have a timescale yet, but it's our intention to submit a hashagg support patch for grouping sets as soon as time permits. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
Andrew Gierth and...@tao11.riddles.org.uk writes: Tom == Tom Lane t...@sss.pgh.pa.us writes: Tom So I'm all for refactoring, but I think it will happen as a natural Tom byproduct of path-ification, and otherwise would be rather forced. Hrm, ok. So for the near future, we should leave it more or less as-is? We don't have a timescale yet, but it's our intention to submit a hashagg support patch for grouping sets as soon as time permits. Well, mumble. I keep saying that I want to tackle path-ification in that area, and I keep not finding the time to actually do it. So I'm hesitant to tell you that you should wait on it. But certainly I think that it'll be a lot easier to get hashagg costing done in that framework than in what currently exists. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
Robert Haas robertmh...@gmail.com writes: So, getting back to this part, what's the value of returning a list of Paths rather than a list of Plans? (1) less work, since we don't have to fill in details not needed for costing purposes; (2) paths carry info that the planner wants but the executor doesn't, notably sort-order annotations. target lists are normally computed when paths are converted to plans, but for the higher-level plan nodes adding by grouping_planner, the path list is typically just passed in. So would the new path types be expected to carry target lists of their own, or would they need to figure out the target list on the fly at plan generation time? Yeah, that is something I've been struggling with while thinking about this. I don't really want to add tlists as such to Paths, but it's not very clear how else to annotate a Path as to what it computes, and that seems like an annotation we have to have in some form in order to convert these planning steps into a Path universe. There are other cases where it would be useful to have some notion of this kind. An example is that right now, if you have an expression index on an expensive function and a query that wants the value of that function, the planner is able to extract the value from the index --- but there is nothing that gives any cost benefit to doing so, so it's just as likely to choose some other index and eat the cost of recalculating the function. It seems like the only way to fix that in a principled fashion is to have some concept that the index-scan Path can produce the function value, and then when we come to some appropriate costing step, penalize the other paths for having to compute the value that's available for free from this one. Rather than adding tlists per se to Paths, I've been vaguely toying with a notion of identifying all the interesting subexpressions in a query (expensive functions, aggregates, etc), giving them indexes 1..n, and then marking Paths with bitmapsets showing which interesting subexpressions they can produce values for. This would make things like does this Path compute all the needed aggregates much cheaper to deal with than a raw tlist representation would do. But maybe that's still not the best way. Another point is that a Path that computes aggregates is fundamentally different from a Path that doesn't, because it doesn't even produce the same number of rows. So I'm not at all sure how to visualize the idea of a Path that computes only some aggregates, or whether it's even a sensible thing to worry about supporting. One thing that seems like it might complicate things here is that a lot of planner functions take PlannerInfo *root as an argument. But if we generate only paths in grouping_planner() and path-ify them later, the subquery's root will not be available when we're trying to do the Path - Plan transformation. Ah, you're wrong there, because we hang onto the subquery's root already (I forget why exactly, but see PlannerGlobal.subroots for SubPlans, and RelOptInfo.subroot for subquery-in-FROM). So it would not be a fundamental problem to postpone create_plan() for a subquery. I think grouping_planner() is badly in need of some refactoring just to make it shorter. It's over 1000 lines of code, which IMHO is a fairly ridiculous length for a single function. Amen to that. But as I said to Andrew, I think this will be a side-effect of path-ification in this area, and is probably not something to set out to do first. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
On Wed, May 13, 2015 at 10:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: For the reasons I mentioned, I'd like to get to a point where subquery_planner's output is Paths not Plans as soon as possible. But the idea of coarse representation of steps that we aren't trying to be smart about might be useful to save some labor in the short run. The zero-order version of that might be a single Path node type that represents do whatever grouping_planner would do, which we'd start to break down into multiple node types once we had the other APIs fixed. So, getting back to this part, what's the value of returning a list of Paths rather than a list of Plans? It seems that, in general terms, what we're trying to do is delay performing some of the work, so that we can generate several candidates relatively inexpensively and decide only later which one to turn into a Plan. What's not entirely clear to me is which parts of the work we're trying to delay. For example, target lists are normally computed when paths are converted to plans, but for the higher-level plan nodes adding by grouping_planner, the path list is typically just passed in. So would the new path types be expected to carry target lists of their own, or would they need to figure out the target list on the fly at plan generation time? One thing that seems like it might complicate things here is that a lot of planner functions take PlannerInfo *root as an argument. But if we generate only paths in grouping_planner() and path-ify them later, the subquery's root will not be available when we're trying to do the Path - Plan transformation. Presumably that means anything we need from the PlannerInfo has to be dug out and saved in the path. I think grouping_planner() is badly in need of some refactoring just to make it shorter. It's over 1000 lines of code, which IMHO is a fairly ridiculous length for a single function. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
Robert Haas robertmh...@gmail.com writes: On Wed, May 13, 2015 at 10:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: For the reasons I mentioned, I'd like to get to a point where subquery_planner's output is Paths not Plans as soon as possible. But the idea of coarse representation of steps that we aren't trying to be smart about might be useful to save some labor in the short run. The zero-order version of that might be a single Path node type that represents do whatever grouping_planner would do, which we'd start to break down into multiple node types once we had the other APIs fixed. The problem I'm really interested in solving is gaining the ability to add additional aggregation strategies, such as letting an FDW do it remotely, or doing it in parallel. It seems to me that your proposed zero-order version of that wouldn't really get us terribly far toward that goal - it would be more oriented towards solving the other problems you mention, specifically adding more intelligence to setops and allowing parameterization of subqueries. Those things certainly have some value, but I think supporting alternate aggregation strategies is a lot more interesting. Clearly we'd like to get to both goals. I don't see the zero order design as something we'd ship or even have in the tree for more than a short time. But it might still be a useful refactorization step. In any case, the key question if we're to have Paths representing higher-level computations is what do we hang our lists of such Paths off of?. If we have say both GROUP BY and LIMIT, it's important to distinguish Paths that purport to do only the grouping step from those that do both the grouping and the limit. For the scan/join part of planning, we do this by attaching the Paths to RelOptInfos that denote various levels of joining ... but how does that translate to the higher level processing steps? Perhaps we could make dummy RelOptInfos that correspond to having completed different steps of processing; but I've not got a clear idea about how many such RelOptInfos we'd need, and in particular not about whether we need to cater for completing those steps in different orders. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
On Thu, May 14, 2015 at 10:54 AM, Tom Lane t...@sss.pgh.pa.us wrote: In any case, the key question if we're to have Paths representing higher-level computations is what do we hang our lists of such Paths off of?. Yeah, I was wondering about that, too. If we have say both GROUP BY and LIMIT, it's important to distinguish Paths that purport to do only the grouping step from those that do both the grouping and the limit. For the scan/join part of planning, we do this by attaching the Paths to RelOptInfos that denote various levels of joining ... but how does that translate to the higher level processing steps? Perhaps we could make dummy RelOptInfos that correspond to having completed different steps of processing; but I've not got a clear idea about how many such RelOptInfos we'd need, and in particular not about whether we need to cater for completing those steps in different orders. Well, I'm just shooting from the hip here, but it seems to me that the basic pipeline as it exists today is Join - Aggregate - SetOp - Limit - LockRows. I don't think Limit or LockRows can be moved any earlier. SetOps have a lot in common with Aggregates, though, and might be able to commute. For instance, if you had an EXCEPT that was going to knock out most of the rows and also a DISTINCT, you might want to postpone the DISTINCT until after you do the EXCEPT, or you might just want to do both at once: rhaas=# explain select distinct * from (values (1), (1), (2), (3)) x except select 3; QUERY PLAN - HashSetOp Except (cost=0.06..0.17 rows=4 width=3) - Append (cost=0.06..0.16 rows=5 width=3) - Subquery Scan on *SELECT* 1 (cost=0.06..0.14 rows=4 width=4) - HashAggregate (cost=0.06..0.10 rows=4 width=4) Group Key: *VALUES*.column1 - Values Scan on *VALUES* (cost=0.00..0.05 rows=4 width=4) - Subquery Scan on *SELECT* 2 (cost=0.00..0.02 rows=1 width=0) - Result (cost=0.00..0.01 rows=1 width=0) (8 rows) In an ideal world, we'd only hash once. And both set operations and aggregates can sometimes commute with joins, which seems like the stickiest wicket, because there can't be more than a couple of levels of grouping or aggregation in the same subquery (GROUP BY, DISTINCT, UNION?) but there can be lots of joins, and if a particular aggregate can be implemented in lots of places, things start to get complicated. Like, if you've got a SELECT DISTINCT ON (a.x) ... FROM ...lots...-type query, I think you can pretty much slap the DISTINCT on there at any point in the join nest, probably ideally at the point where the number of rows is the lowest it's going to get, or maybe when the sortorder is convenient. For a GROUP BY query with ungrouped dependent columns, you can do the aggregation before or after those joins, but you must do it after the joins that are needed to provide all the values needed for the aggregated columns. If you model this with RelOptInfos, you're basically going to need a RelOptInfo to say i include these relids and these aggregation or setop operations. So in the worst case each agg/setop doubles the number of RelOptInfos, although probably in reality it doesn't because you won't have infinite flexibility to reorder things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
On Thu, May 14, 2015 at 12:19:44PM -0400, Robert Haas wrote: Well, I'm just shooting from the hip here, but it seems to me that the basic pipeline as it exists today is Join - Aggregate - SetOp - Limit - LockRows. I don't think Limit or LockRows can be moved any earlier. SetOps have a lot in common with Aggregates, though, and might be able to commute. For instance, if you had an EXCEPT that was going to knock out most of the rows and also a DISTINCT, you might want to postpone the DISTINCT until after you do the EXCEPT, or you might just want to do both at once: Also thinking a little from the side: an SQL query is a expression of some tree in relational algebra, and a Path is a representation of a way to acheive some sub-part of it. The planner attempts to try find alternative ways of generating path by reordering joins but AIUI doesn't do much about aggregations. What is essentially being discussed here is also allowing commuting aggregations and joins. DISTINCT and DISTINCT ON are just special kinds of aggregations so don't need to be considered especially. ISTM you should be able to for each aggregation note which joins it commutes with and which it doesn't, perhaps even with a simple bitmap. The backbone of the plan is the order of the aggregations which generally won't commute at all, and the joins which can float around as long as the dependancies (stuff that won't commute) are satisfied. And both set operations and aggregates can sometimes commute with joins, which seems like the stickiest wicket, because there can't be more than a couple of levels of grouping or aggregation in the same subquery (GROUP BY, DISTINCT, UNION?) but there can be lots of joins, and if a particular aggregate can be implemented in lots of places, things start to get complicated. I think this is basically the same idea. I'm not sure aggregates and set operations can commute in general, unless you could somehow (conceptually) describe them as a join/antijoin. UNION might be special here. Like, if you've got a SELECT DISTINCT ON (a.x) ... FROM ...lots...-type query, I think you can pretty much slap the DISTINCT on there at any point in the join nest, probably ideally at the point where the number of rows is the lowest it's going to get, or maybe when the sortorder is convenient. For a GROUP BY query with ungrouped dependent columns, you can do the aggregation before or after those joins, but you must do it after the joins that are needed to provide all the values needed for the aggregated columns. If you model this with RelOptInfos, you're basically going to need a RelOptInfo to say i include these relids and these aggregation or setop operations. So in the worst case each agg/setop doubles the number of RelOptInfos, although probably in reality it doesn't because you won't have infinite flexibility to reorder things. I think it's more like the number of possibilities doubles for each join that could commute with the aggregate. But as you say the number of actual possibilities doesn't actually grow that fast. I think it may be better to have each path track the relids and aggregates it covers, but then you need to have an efficient way to work out which rels/aggregates can be considered for each path. Either sounds it sounds like quite a planner overhaul. Hope this helps, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] upper planner path-ification
On Thu, May 14, 2015 at 7:09 AM, Robert Haas robertmh...@gmail.com wrote: Hi, I've been pulling over Tom's occasional remarks about redoing grouping_planner - and maybe further layers of the planner - to work with Paths instead of Plans. I've had difficulty locating all of the relevant threads. Here's everything I've found so far, which I'm pretty sure is not everything: http://www.postgresql.org/message-id/17400.1311716...@sss.pgh.pa.us http://www.postgresql.org/message-id/2614.1375730...@sss.pgh.pa.us http://www.postgresql.org/message-id/22721.1385048...@sss.pgh.pa.us http://www.postgresql.org/message-id/banlktindjjfhnozesg2j2u4gokqlu69...@mail.gmail.com http://www.postgresql.org/message-id/8479.1418420...@sss.pgh.pa.us I think there are two separate problems here. First, there's the problem that grouping_planner() is complicated. It's doing cost comparisons, but all in ad-hoc fashion rather than using a consistent mechanic the way add_path() does. Generally, we can plan an aggregate using either (1) a hashed aggregate, (2) a sorted aggregate, or (3) for min or max, an index scan that just grabs the highest or lowest value in lieu of a full table scan. Instead of generating a plan for each of these possibilities, we'd like to generate paths for each one, and then pick one to turn into a plan. AIUI, the hope is that this would simplify the cost calculations, and also make it easier to inject other paths, such as a path where an FDW performs the aggregation step on the remote server. Second, there's the problem that we might like to order aggregates with respect to joins. If we have something like SELECT DISTINCT ON (foo.x) foo.x, foo.y, bar.y FROM foo, bar WHERE foo.x = bar.x, then (a) if foo.x has many duplicates, it will be better to DISTINCT-ify foo and then join to bar afterwards but (b) if foo.x = bar.x is highly selective, it will be better to join to bar first and then DISTINCT-ify the result. Currently, aggregation is always last; that's not great. Hitoshi Harada's proposed strategy of essentially figuring out where the aggregation steps can go and then re-planning for each one is also not great, because each join problem will be a subtree of the one we use for the aggregate-last strategy, and thus we're wasting effort by planning the same subtrees multiple times. Instead, we might imagine letting grouping planner fish out the best paths for the joinrels that represent possible aggregation points, generate aggregation paths for each of those, and then work out what additional rels need to be joined afterwards. That sounds hard, but not as hard as doing something sensible with what we have today. Another futuristic avenue for optimization would be commuting Append and Join when joining two similarly partitioned tables. I'm inclined to think that it would be useful to solve the first problem even if we didn't solve the second one right away (but that might be wrong). As a preparatory step, I'm thinking it would be sensible to split grouping_planner() into an outer function that would handle the addition of Limit and LockRows nodes and maybe planning of set operations, and an inner function that would handle GROUP BY, DISTINCT, and possibly window function planning. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] upper planner path-ification
On Thu, May 14, 2015 at 8:24 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Wed, May 13, 2015 at 10:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: For the reasons I mentioned, I'd like to get to a point where subquery_planner's output is Paths not Plans as soon as possible. But the idea of coarse representation of steps that we aren't trying to be smart about might be useful to save some labor in the short run. The zero-order version of that might be a single Path node type that represents do whatever grouping_planner would do, which we'd start to break down into multiple node types once we had the other APIs fixed. The problem I'm really interested in solving is gaining the ability to add additional aggregation strategies, such as letting an FDW do it remotely, or doing it in parallel. It seems to me that your proposed zero-order version of that wouldn't really get us terribly far toward that goal - it would be more oriented towards solving the other problems you mention, specifically adding more intelligence to setops and allowing parameterization of subqueries. Those things certainly have some value, but I think supporting alternate aggregation strategies is a lot more interesting. Clearly we'd like to get to both goals. I don't see the zero order design as something we'd ship or even have in the tree for more than a short time. But it might still be a useful refactorization step. In any case, the key question if we're to have Paths representing higher-level computations is what do we hang our lists of such Paths off of?. If we have say both GROUP BY and LIMIT, it's important to distinguish Paths that purport to do only the grouping step from those that do both the grouping and the limit. For the scan/join part of planning, we do this by attaching the Paths to RelOptInfos that denote various levels of joining ... but how does that translate to the higher level processing steps? Perhaps we could make dummy RelOptInfos that correspond to having completed different steps of processing; but I've not got a clear idea about how many such RelOptInfos we'd need, and in particular not about whether we need to cater for completing those steps in different orders. Given that the results of such computations are relations, having a RelOptInfo for each operation looks natural. Although, Sort/Order, which doesn't alter the result-set but just re-orders it, may be an exception here. It can be modeled as a property of some RelOptInfo rather than modelling as a RelOptInfo by itself. Same might be the case of row locking operation. As Robert has mentioned earlier, reordering does open opportunities for optimizations and hence can not be ignored. If we are restructuring the planner, we should restructure to enable such reordering. One possible solution would be to imitate make_one_rel(). make_one_rel() creates possible relations (and paths for each of them) which represent joining order for a given joinlist. Similarly, given a list of operations requested by user, a function would generate relations (and paths for each of them) which represent possible orders of those operations (operations also include JOINs). But that might explode the number of relations (and hence paths) we examine during planning. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Re: [HACKERS] upper planner path-ification
Hello, this topic lured me on.. At Wed, 13 May 2015 23:43:57 -0400, Robert Haas robertmh...@gmail.com wrote in ca+tgmoaqa6bcasgcl8toxwmmoom-d7ebesadz4y58cb+tjq...@mail.gmail.com On Wed, May 13, 2015 at 10:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: Both of those are problems all right, but there is more context here. Thanks for providing the context. I'm inclined to think that it would be useful to solve the first problem even if we didn't solve the second one right away (but that might be wrong). As a preparatory step, I'm thinking it would be sensible to split grouping_planner() into an outer function that would handle the addition of Limit and LockRows nodes and maybe planning of set operations, and an inner function that would handle GROUP BY, DISTINCT, and possibly window function planning. For the reasons I mentioned, I'd like to get to a point where subquery_planner's output is Paths not Plans as soon as possible. But the idea of coarse representation of steps that we aren't trying to be smart about might be useful to save some labor in the short run. The zero-order version of that might be a single Path node type that represents do whatever grouping_planner would do, which we'd start to break down into multiple node types once we had the other APIs fixed. The problem I'm really interested in solving is gaining the ability to add additional aggregation strategies, such as letting an FDW do it remotely, or doing it in parallel. It seems to me that your proposed zero-order version of that wouldn't really get us terribly far toward that goal - it would be more oriented towards solving the other problems you mention, specifically adding more intelligence to setops and allowing parameterization of subqueries. Those things certainly have some value, but I think supporting alternate aggregation strategies is a lot more interesting. I don't know which you are planning the interface between the split layers to be path or plan, I guess simply splitting grouping_planner results in the latter. Although Tom's path representation up to higher layer does not seem to clash with, or have anything to do directly with your layer splitting, complicating the existing complication should make it more difficult to refactor the planner to have more intelligence. For example, Limit could be pushed down into bottom of a path tree ideally and would be also effective for remote node greately like remote aggregations would be. The zero-order version would eliminate subqery scan (or frozen plan) path or such nodes (if any) which obstruct possible global optimization. But simple splitting of the current grouping_planner looks make it more difficult. It is so frustrating not to be able to express my thought well but IMHO, shortly, I wish the planner not to grow to be in such form. Certainly it seems not so easy to get over the zero-order version, but.. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] upper planner path-ification
Hi, I've been pulling over Tom's occasional remarks about redoing grouping_planner - and maybe further layers of the planner - to work with Paths instead of Plans. I've had difficulty locating all of the relevant threads. Here's everything I've found so far, which I'm pretty sure is not everything: http://www.postgresql.org/message-id/17400.1311716...@sss.pgh.pa.us http://www.postgresql.org/message-id/2614.1375730...@sss.pgh.pa.us http://www.postgresql.org/message-id/22721.1385048...@sss.pgh.pa.us http://www.postgresql.org/message-id/banlktindjjfhnozesg2j2u4gokqlu69...@mail.gmail.com http://www.postgresql.org/message-id/8479.1418420...@sss.pgh.pa.us I think there are two separate problems here. First, there's the problem that grouping_planner() is complicated. It's doing cost comparisons, but all in ad-hoc fashion rather than using a consistent mechanic the way add_path() does. Generally, we can plan an aggregate using either (1) a hashed aggregate, (2) a sorted aggregate, or (3) for min or max, an index scan that just grabs the highest or lowest value in lieu of a full table scan. Instead of generating a plan for each of these possibilities, we'd like to generate paths for each one, and then pick one to turn into a plan. AIUI, the hope is that this would simplify the cost calculations, and also make it easier to inject other paths, such as a path where an FDW performs the aggregation step on the remote server. Second, there's the problem that we might like to order aggregates with respect to joins. If we have something like SELECT DISTINCT ON (foo.x) foo.x, foo.y, bar.y FROM foo, bar WHERE foo.x = bar.x, then (a) if foo.x has many duplicates, it will be better to DISTINCT-ify foo and then join to bar afterwards but (b) if foo.x = bar.x is highly selective, it will be better to join to bar first and then DISTINCT-ify the result. Currently, aggregation is always last; that's not great. Hitoshi Harada's proposed strategy of essentially figuring out where the aggregation steps can go and then re-planning for each one is also not great, because each join problem will be a subtree of the one we use for the aggregate-last strategy, and thus we're wasting effort by planning the same subtrees multiple times. Instead, we might imagine letting grouping planner fish out the best paths for the joinrels that represent possible aggregation points, generate aggregation paths for each of those, and then work out what additional rels need to be joined afterwards. That sounds hard, but not as hard as doing something sensible with what we have today. I'm inclined to think that it would be useful to solve the first problem even if we didn't solve the second one right away (but that might be wrong). As a preparatory step, I'm thinking it would be sensible to split grouping_planner() into an outer function that would handle the addition of Limit and LockRows nodes and maybe planning of set operations, and an inner function that would handle GROUP BY, DISTINCT, and possibly window function planning. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
On Wed, May 13, 2015 at 10:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: Both of those are problems all right, but there is more context here. Thanks for providing the context. I'm inclined to think that it would be useful to solve the first problem even if we didn't solve the second one right away (but that might be wrong). As a preparatory step, I'm thinking it would be sensible to split grouping_planner() into an outer function that would handle the addition of Limit and LockRows nodes and maybe planning of set operations, and an inner function that would handle GROUP BY, DISTINCT, and possibly window function planning. For the reasons I mentioned, I'd like to get to a point where subquery_planner's output is Paths not Plans as soon as possible. But the idea of coarse representation of steps that we aren't trying to be smart about might be useful to save some labor in the short run. The zero-order version of that might be a single Path node type that represents do whatever grouping_planner would do, which we'd start to break down into multiple node types once we had the other APIs fixed. The problem I'm really interested in solving is gaining the ability to add additional aggregation strategies, such as letting an FDW do it remotely, or doing it in parallel. It seems to me that your proposed zero-order version of that wouldn't really get us terribly far toward that goal - it would be more oriented towards solving the other problems you mention, specifically adding more intelligence to setops and allowing parameterization of subqueries. Those things certainly have some value, but I think supporting alternate aggregation strategies is a lot more interesting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] upper planner path-ification
Robert Haas robertmh...@gmail.com writes: I've been pulling over Tom's occasional remarks about redoing grouping_planner - and maybe further layers of the planner - to work with Paths instead of Plans. ... I think there are two separate problems here. First, there's the problem that grouping_planner() is complicated. Second, there's the problem that we might like to order aggregates with respect to joins. Both of those are problems all right, but there is more context here. * As some of the messages you cited mention, we would like to have Path representations for things like aggregation, because that's the only way we'll get to a sane API that lets FDWs propose remote aggregation. * We have also had requests for the planner to be smarter about UNION/INTERSECT/EXCEPT queries. Again, that requires cost comparisons, which would be better done if we had Path representations for the various ways we'd want to consider. Also, a big part of the issue there is wanting to be able to consider sorted versus unsorted plans for the leaf queries of the set-op (IOW, optionally pushing the sort requirements of the set-op down into the leaves). Right now, such comparisons are impossible because prepunion.c uses subquery_planner to handle the leaf queries, and what it gets back from that is one finished plan, not alternative Paths. * Likewise, subqueries-in-FROM are handled by recursing to subquery_planner, which gives us back just one frozen Plan for the subquery. Among other things this seems to make it too expensive to consider generating parameterized paths for the subquery. I'd like to keep subquery plans in Path form until much later as well. So these considerations motivate wishing that the result of subquery_planner could be a list of alternative Paths rather than a Plan, which means that every operation it knows how to tack onto the scan/join plan has to be representable by a Path of some sort. I don't know how granular that needs to be, though. For instance, one could certainly imagine that it might be sufficient initially to have a single WindowPath that represents do all the window functions, and then at create_plan time we'd generate multiple WindowAgg plan nodes in the same ad-hoc way as now. Breaking that down in the Path representation would only become interesting if it would affect higher-level decisions, and I'm not immediately seeing how it might do that. I'm inclined to think that it would be useful to solve the first problem even if we didn't solve the second one right away (but that might be wrong). As a preparatory step, I'm thinking it would be sensible to split grouping_planner() into an outer function that would handle the addition of Limit and LockRows nodes and maybe planning of set operations, and an inner function that would handle GROUP BY, DISTINCT, and possibly window function planning. For the reasons I mentioned, I'd like to get to a point where subquery_planner's output is Paths not Plans as soon as possible. But the idea of coarse representation of steps that we aren't trying to be smart about might be useful to save some labor in the short run. The zero-order version of that might be a single Path node type that represents do whatever grouping_planner would do, which we'd start to break down into multiple node types once we had the other APIs fixed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers