Re: [HACKERS] updated join removal patch
Robert Haas robertmh...@gmail.com writes: On Fri, Sep 18, 2009 at 3:52 PM, Tom Lane t...@sss.pgh.pa.us wrote: clause_sides_match_join? Yes, that's perfect. Going once ... going twice ... sold. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
Robert Haas robertmh...@gmail.com writes: On Thu, Sep 17, 2009 at 4:59 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: You're the committer; I'm not. But I completely disagree. There isn't any reason at all to duplicate this logic in two separate places, let alone three. I'd actually be in favor of merging the existing two cases even if we weren't adding join removal. No, I still think this was a bad idea. There are *parts* of those tests that are similar, but combining them all into one function is just a recipe for bugs. Having read your commit, it makes more sense to me. The fact that we're now looking at innerrel-baserestrictinfo also is a pretty powerful argument for your way. Looking at it some more, I think that there is some value in factoring out the tests to see if the clause has the right variable membership, so I did that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
On Fri, Sep 18, 2009 at 1:26 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Thu, Sep 17, 2009 at 4:59 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: You're the committer; I'm not. But I completely disagree. There isn't any reason at all to duplicate this logic in two separate places, let alone three. I'd actually be in favor of merging the existing two cases even if we weren't adding join removal. No, I still think this was a bad idea. There are *parts* of those tests that are similar, but combining them all into one function is just a recipe for bugs. Having read your commit, it makes more sense to me. The fact that we're now looking at innerrel-baserestrictinfo also is a pretty powerful argument for your way. Looking at it some more, I think that there is some value in factoring out the tests to see if the clause has the right variable membership, so I did that. Mmm, I like that. Putting that bunch of hairy logic in a subroutine instead of repeating it in several places definitely seems better. I don't really like the name clause_matches_join, though. It's more like clause has well-defined sides, and mark which is which as a side-effect. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
Robert Haas robertmh...@gmail.com writes: Mmm, I like that. Putting that bunch of hairy logic in a subroutine instead of repeating it in several places definitely seems better. I don't really like the name clause_matches_join, though. It's more like clause has well-defined sides, and mark which is which as a side-effect. It was the first thing that came to mind ... got a better idea? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
On Fri, Sep 18, 2009 at 1:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Mmm, I like that. Putting that bunch of hairy logic in a subroutine instead of repeating it in several places definitely seems better. I don't really like the name clause_matches_join, though. It's more like clause has well-defined sides, and mark which is which as a side-effect. It was the first thing that came to mind ... got a better idea? clause_has_well_defined_sides()? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
Robert Haas robertmh...@gmail.com writes: On Fri, Sep 18, 2009 at 1:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Mmm, I like that. Putting that bunch of hairy logic in a subroutine instead of repeating it in several places definitely seems better. I don't really like the name clause_matches_join, though. It was the first thing that came to mind ... got a better idea? clause_has_well_defined_sides()? Nah ... they're well defined in any case, they might just not be what we need for the current join. As an example, (a.f1 + b.f2) = c.f3 would be usable if joining {A B} to {C}, but not when joining {A} to {B C}. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
On Fri, Sep 18, 2009 at 3:06 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Fri, Sep 18, 2009 at 1:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Mmm, I like that. Putting that bunch of hairy logic in a subroutine instead of repeating it in several places definitely seems better. I don't really like the name clause_matches_join, though. It was the first thing that came to mind ... got a better idea? clause_has_well_defined_sides()? Nah ... they're well defined in any case, they might just not be what we need for the current join. As an example, (a.f1 + b.f2) = c.f3 would be usable if joining {A B} to {C}, but not when joining {A} to {B C}. The clauses are well-defined, but they don't have well-defined sides. I see now what you're going for with clause_matches_join, but matches is a pretty broad term, IMO. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
Robert Haas robertmh...@gmail.com writes: The clauses are well-defined, but they don't have well-defined sides. I see now what you're going for with clause_matches_join, but matches is a pretty broad term, IMO. clause_sides_match_join? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
On Fri, Sep 18, 2009 at 3:52 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: The clauses are well-defined, but they don't have well-defined sides. I see now what you're going for with clause_matches_join, but matches is a pretty broad term, IMO. clause_sides_match_join? Yes, that's perfect. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
Robert Haas robertmh...@gmail.com writes: On Tue, Sep 15, 2009 at 9:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Here we go again. Following Tom's advice to not insert crocks in add_path(), I've instead introduced a noopjoin path type which ignores its inner side. I've committed this with some revisions. As to the points discussed earlier: * I'm not really happy with the NoopJoinPath representation. ... A possible compromise is to use T_SubqueryScan as the pathtype, with the idea that we're pretending a SubqueryScan is to be inserted and then immediately optimized away. But I don't want the inner path in there in any case. I don't have strong feelings about it. I thought about making just a Noop path type, but I couldn't see any clear reason to prefer it one way or the other. I ended up using a NoOpPath struct with pathtype = T_Join, which is a dummy superclass nodetag that never really gets instantiated. The SubqueryScan idea looked too messy after I remembered that createplan.c likes to switch() on the pathtype, so having two different pathtype uses of T_SubqueryScan seemed pretty ugly. But T_Join isn't really used anywhere, despite being nominally a executable plan type. It's pretty much a judgment call whether that's really cleaner than using the path's own NodeTag... * I'm quite unimpressed with the refactoring you did to make a single function deal with merge clauses, hash clauses, *and* join removal clauses. I think that makes it close to unmaintainable and is buying little if anything speedwise. You're the committer; I'm not. But I completely disagree. There isn't any reason at all to duplicate this logic in two separate places, let alone three. I'd actually be in favor of merging the existing two cases even if we weren't adding join removal. No, I still think this was a bad idea. There are *parts* of those tests that are similar, but combining them all into one function is just a recipe for bugs. As for a GUC, I think it would be useful to have for debugging, or in case of bugs. It's really another join strategy, and we have enable_* flags for all the others. But if you don't want it for some reason, I'm not in a position to twist your arm. I left it out for the moment. If anyone can point to a case where the join removal logic slows planning noticeably without gaining anything, we can reconsider. * Not entirely sure where to put the code that does the hard work (relation_is_distinct_for). Yeah, it seemed a little weird to me. For a while I was reusing some of the support functions that query_is_distinct_for() calls, but the final version doesn't. I wonder if it should just be moved to joinpath.c; it seems to fit in better with what is going on there. I ended up putting the index-specific logic in indxpath.c, which aside from seeming reasonable made it easier to deal with expression indexes. If we ever add another test method that doesn't depend on indexes, we can put it in a reasonable place and have joinpath.c call that too. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
On Thu, Sep 17, 2009 at 4:59 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Tue, Sep 15, 2009 at 9:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Here we go again. Following Tom's advice to not insert crocks in add_path(), I've instead introduced a noopjoin path type which ignores its inner side. I've committed this with some revisions. As to the points discussed earlier: Thanks, it looks nice. I wonder if we shouldn't add some documentation or regression tests, though - any thoughts? * I'm not really happy with the NoopJoinPath representation. ... A possible compromise is to use T_SubqueryScan as the pathtype, with the idea that we're pretending a SubqueryScan is to be inserted and then immediately optimized away. But I don't want the inner path in there in any case. I don't have strong feelings about it. I thought about making just a Noop path type, but I couldn't see any clear reason to prefer it one way or the other. I ended up using a NoOpPath struct with pathtype = T_Join, which is a dummy superclass nodetag that never really gets instantiated. The SubqueryScan idea looked too messy after I remembered that createplan.c likes to switch() on the pathtype, so having two different pathtype uses of T_SubqueryScan seemed pretty ugly. But T_Join isn't really used anywhere, despite being nominally a executable plan type. It's pretty much a judgment call whether that's really cleaner than using the path's own NodeTag... Yeah. I haven't been able to wrap my head around why it's good to have two tags on some of these things. * I'm quite unimpressed with the refactoring you did to make a single function deal with merge clauses, hash clauses, *and* join removal clauses. I think that makes it close to unmaintainable and is buying little if anything speedwise. You're the committer; I'm not. But I completely disagree. There isn't any reason at all to duplicate this logic in two separate places, let alone three. I'd actually be in favor of merging the existing two cases even if we weren't adding join removal. No, I still think this was a bad idea. There are *parts* of those tests that are similar, but combining them all into one function is just a recipe for bugs. Having read your commit, it makes more sense to me. The fact that we're now looking at innerrel-baserestrictinfo also is a pretty powerful argument for your way. As for a GUC, I think it would be useful to have for debugging, or in case of bugs. It's really another join strategy, and we have enable_* flags for all the others. But if you don't want it for some reason, I'm not in a position to twist your arm. I left it out for the moment. If anyone can point to a case where the join removal logic slows planning noticeably without gaining anything, we can reconsider. Well, the purpose of such flags is for debugging, not production. The only argument I can see for not having one for join removal versus nestloop, mergejoin, etc. is that it's pretty easy to kill removal of a particular join by tweaking the output list, whereas it is not comparably easy to get the planner to pick a different join type otherwise. * Not entirely sure where to put the code that does the hard work (relation_is_distinct_for). Yeah, it seemed a little weird to me. For a while I was reusing some of the support functions that query_is_distinct_for() calls, but the final version doesn't. I wonder if it should just be moved to joinpath.c; it seems to fit in better with what is going on there. I ended up putting the index-specific logic in indxpath.c, which aside from seeming reasonable made it easier to deal with expression indexes. If we ever add another test method that doesn't depend on indexes, we can put it in a reasonable place and have joinpath.c call that too. Sounds fine to me. I think that the next project in this area should be to try to support removal of INNER joins. (Removal of SEMI, ANTI, and FULL joins seems unlikely ever to be interesting.) That will require teaching the planner about foreign key constraints, which interestingly opens up some other interesting optimization opportunities. An INNER join that can never match zero rows can alternatively be implemented as a LEFT join, which can be reordered in some situations where an inner join can't. e.g. A LJ (B IJ C ON Pbc) ON Pab can be implemented as (A LJ B ON Pab) LJ/IJ C ON Pbc if it is the case that for every row in B, there will be at least one row in C such that Pbc holds. Thanks again for fixing this up, and committing it. I have a TON of queries that will benefit from this, and it's one of the reasons why I started reading this mailing list on a regular basis and getting involved in development. For me, it's worth an upgrade by itself. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your
Re: [HACKERS] updated join removal patch
Robert Haas robertmh...@gmail.com writes: Here we go again. Following Tom's advice to not insert crocks in add_path(), I've instead introduced a noopjoin path type which ignores its inner side. This could possibly be simplified to just a noop path that doesn't even include an outer side, but the way I've done it here doesn't really cost anything and might allow debug pprints to print more useful information. If, in the future, we have some other use for a noop path, we might want to reconsider, though. I took a closer look at this version of the patch. Some comments: * I'm not really happy with the NoopJoinPath representation. In the first place, it isn't a join and should not be carrying a useless inner path. The other nasty thing about it is that it violates the rule that path.pathtype is supposed to be the nodetag for the executable plan node that corresponds to the path. My original thought about this was to just use the left input's paths as-is. I am not totally sure that that works --- up to now, all the members of a reloptinfo's path list have had that same reloptinfo as their parent --- but it seems less ugly than this. A possible compromise is to use T_SubqueryScan as the pathtype, with the idea that we're pretending a SubqueryScan is to be inserted and then immediately optimized away. But I don't want the inner path in there in any case. * Speaking of the left input, it's not good enough to copy up only the cheapest startup and cheapest total paths. If there are any other paths there at all, they're probably there because they have interesting sort orders. With the patch as-is, it'd be possible for join removal to be a net loss because it forces an extra sort at an upper level. Now it's possible that some of those sort orderings are only interesting to use for mergejoining in the same join that we're removing; in which case they could safely be discarded. But I'm not sure that's worth checking for, and I am sure that throwing everything away shouldn't be the base behavior. * I'm quite unimpressed with the refactoring you did to make a single function deal with merge clauses, hash clauses, *and* join removal clauses. I think that makes it close to unmaintainable and is buying little if anything speedwise. We can afford another iteration over the join clauses, especially if there's a GUC to let people turn it off. (BTW, do we really need that GUC, or is it only there for testing the patch? I'm leaning towards not having it.) * I'm not sure about this, because surely you would have tested it, but isn't it looking at the wrong side of the join clauses? I thought the idea is to prove the nullable (inner) side of the join unique. * Not entirely sure where to put the code that does the hard work (relation_is_distinct_for). I see you put it in pathnode.c beside query_is_distinct_for, but the *only* reason the latter is where it is is that it has only one caller, namely create_unique_path, which is (and belongs) in that module. Opening up pathnode's API to include a function unrelated to its purpose doesn't seem right. So I'm inclined to think they should both go someplace else, just not sure where. There doesn't seem to be any planner file whose charter is to export this type of knowledge; maybe we need to add one. * It shouldn't be that hard to support expression indexes. I think the only reason you can't do it right now is that you chose an intermediate data structure that presumes simple Vars ... but if you refactor to have bespoke code examining the indexes and joinclauses together, I think it would work all right. * I wonder whether all the relevant clauses are really to be found as join clauses. Consider tab1 left join tab2 on (tab1.a = tab2.x and tab2.y = 42) where tab2 has a unique index on (x,y). This would at least suggest examining the inner rel's baserestrictclauses along with the current join's clauses. * I wouldn't really bother with cost_noopjoin. It does not, and never will, encapsulate any interesting knowledge. In other places where we have similar issues (eg no-op UniquePaths in create_unique_path), we just copy up the child path's costs without any folderol. Do you want to have another go at this, or shall I? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
On Tue, Sep 15, 2009 at 9:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Here we go again. Following Tom's advice to not insert crocks in add_path(), I've instead introduced a noopjoin path type which ignores its inner side. This could possibly be simplified to just a noop path that doesn't even include an outer side, but the way I've done it here doesn't really cost anything and might allow debug pprints to print more useful information. If, in the future, we have some other use for a noop path, we might want to reconsider, though. I took a closer look at this version of the patch. Some comments: * I'm not really happy with the NoopJoinPath representation. In the first place, it isn't a join and should not be carrying a useless inner path. The other nasty thing about it is that it violates the rule that path.pathtype is supposed to be the nodetag for the executable plan node that corresponds to the path. My original thought about this was to just use the left input's paths as-is. I am not totally sure that that works --- up to now, all the members of a reloptinfo's path list have had that same reloptinfo as their parent --- but it seems less ugly than this. A possible compromise is to use T_SubqueryScan as the pathtype, with the idea that we're pretending a SubqueryScan is to be inserted and then immediately optimized away. But I don't want the inner path in there in any case. I don't have strong feelings about it. I thought about making just a Noop path type, but I couldn't see any clear reason to prefer it one way or the other. The only reason we need a path type at all is because you didn't like the crocks I inserted into add_path() to avoid pfree()-ing paths that might still be pointed to from elsewhere. But those crocks were VASTLY simpler than this, which feels completely Rube Goldberg-esque to me. If it were up to me, we'd be either reinserting those crocks, or looking for some less crock-y variant of them. * Speaking of the left input, it's not good enough to copy up only the cheapest startup and cheapest total paths. If there are any other paths there at all, they're probably there because they have interesting sort orders. With the patch as-is, it'd be possible for join removal to be a net loss because it forces an extra sort at an upper level. Now it's possible that some of those sort orderings are only interesting to use for mergejoining in the same join that we're removing; in which case they could safely be discarded. But I'm not sure that's worth checking for, and I am sure that throwing everything away shouldn't be the base behavior. Good point. * I'm quite unimpressed with the refactoring you did to make a single function deal with merge clauses, hash clauses, *and* join removal clauses. I think that makes it close to unmaintainable and is buying little if anything speedwise. We can afford another iteration over the join clauses, especially if there's a GUC to let people turn it off. (BTW, do we really need that GUC, or is it only there for testing the patch? I'm leaning towards not having it.) You're the committer; I'm not. But I completely disagree. There isn't any reason at all to duplicate this logic in two separate places, let alone three. I'd actually be in favor of merging the existing two cases even if we weren't adding join removal. The existing code does the identical tests in hash_inner_and_outer() in a slightly different order than select_mergejoin_clauses() for no reason at all. As for a GUC, I think it would be useful to have for debugging, or in case of bugs. It's really another join strategy, and we have enable_* flags for all the others. But if you don't want it for some reason, I'm not in a position to twist your arm. * I'm not sure about this, because surely you would have tested it, but isn't it looking at the wrong side of the join clauses? I thought the idea is to prove the nullable (inner) side of the join unique. Grr. I think it's more broken than that. Wow, this is really embarassing. * Not entirely sure where to put the code that does the hard work (relation_is_distinct_for). I see you put it in pathnode.c beside query_is_distinct_for, but the *only* reason the latter is where it is is that it has only one caller, namely create_unique_path, which is (and belongs) in that module. Opening up pathnode's API to include a function unrelated to its purpose doesn't seem right. So I'm inclined to think they should both go someplace else, just not sure where. There doesn't seem to be any planner file whose charter is to export this type of knowledge; maybe we need to add one. Yeah, it seemed a little weird to me. For a while I was reusing some of the support functions that query_is_distinct_for() calls, but the final version doesn't. I wonder if it should just be moved to joinpath.c; it seems to fit in better with what is going on there.
Re: [HACKERS] updated join removal patch
On Tue, Sep 15, 2009 at 10:10 PM, Robert Haas robertmh...@gmail.com wrote: * I'm not sure about this, because surely you would have tested it, but isn't it looking at the wrong side of the join clauses? I thought the idea is to prove the nullable (inner) side of the join unique. Grr. I think it's more broken than that. Wow, this is really embarassing. Well, you're definitely right that it's looking at the wrong side of the join clauses. Still trying to figure out if there is another bug, too. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] updated join removal patch
On Tue, Sep 15, 2009 at 10:25 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Sep 15, 2009 at 10:10 PM, Robert Haas robertmh...@gmail.com wrote: * I'm not sure about this, because surely you would have tested it, but isn't it looking at the wrong side of the join clauses? I thought the idea is to prove the nullable (inner) side of the join unique. Grr. I think it's more broken than that. Wow, this is really embarassing. Well, you're definitely right that it's looking at the wrong side of the join clauses. Still trying to figure out if there is another bug, too. It looks to me like relation_is_distinct_for() is also horribly broken in my previous version. I think the attached is how it is supposed to work. ...Robert join_removal.2009-09-15.fixes Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers