On Mon, Nov 20, 2023 at 03:51:33PM -0500, Robert Haas wrote: > On Mon, Nov 13, 2023 at 3:26 AM fujii.y...@df.mitsubishielectric.co.jp > <fujii.y...@df.mitsubishielectric.co.jp> wrote: > > In postgres_fdw.sql, I have corrected the output format for floating point > > numbers > > by extra_float_digits. > > Looking at this, I find that it's not at all clear to me how the > partial aggregate function is defined. Let's look at what we have for > documentation: > > + <para> > + Paraemter <literal>AGGPARTIALFUNC</literal> optionally defines a > + partial aggregate function used for partial aggregate pushdown; see > + <xref linkend="xaggr-partial-aggregates"/> for details. > + </para> > > + Partial aggregate function (zero if none). > + See <xref linkend="partial-aggregate-pushdown"/> for the definition > + of partial aggregate function. > > + Partial aggregate pushdown is an optimization for queries that contains > + aggregate expressions for a partitioned table across one or more remote > + servers. If multiple conditions are met, partial aggregate function > > + When partial aggregate pushdown is used for aggregate expressions, > + remote queries replace aggregate function calls with partial > + aggregate function calls. If the data type of the state value is not > > But there's no definition of what the behavior of the function is > anywhere that I can see, not even in <sect2 > id="partial-aggregate-pushdown">. Everywhere it only describes how the > partial aggregate function is used, not what it is supposed to do.
Yes, I had to figure that out myself, and I was wondering how much detail to have in our docs vs README files vs. C comments. I think we should put more details somewhere. > Looking at the changes in pg_aggregate.dat, it seems like the partial > aggregate function is a second aggregate defined in a way that mostly > matches the original, except that (1) if the original final function > would have returned a data type other than internal, then the final > function is removed; and (2) if the original final function would have > returned a value of internal type, then the final function is the > serialization function of the original aggregate. I think that's a > reasonable definition, but the documentation and code comments need to > be a lot clearer. Agreed. I wasn't sure enough about this to add it when I was reviewing the patch. > I do have a concern about this, though. It adds a lot of bloat. It > adds a whole lot of additional entries to pg_aggregate, and every new > aggregate we add in the future will require a bonus entry for this, > and it needs a bunch of new pg_proc entries as well. One idea that > I've had in the past is to instead introduce syntax that just does > this, without requiring a separate aggregate definition in each case. > For example, maybe instead of changing string_agg(whatever) to > string_agg_p_text_text(whatever), you can say PARTIAL_AGGREGATE > string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or > something. Then all aggregates could be treated in a generic way. I'm > not completely sure that's better, but I think it's worth considering. So use an SQL keyword to indicates a pushdown call? We could then automate the behavior rather than requiring special catalog functions? > I think that the control mechanism needs some thought. Right now, > there are two possible behaviors: either we assume that the local and > remote sides are the same unconditionally, or we assume that they're > the same if the remote side is a new enough version. I do like having > those behaviors available, but I wonder if we need to do something > better or different. What if somebody wants to push down a > non-built-in aggregate, for example? I realize that we don't have It does allow specification of extensions that can be pushed down. > great solutions to the problem of knowing which functions are > push-downable in general, and I don't know that partial aggregation > needs to be any better than anything else, but it's probably worth > comparing and contrasting the approach we take here with the > approaches we've taken in other, similar cases. From that point of > view, I think check_partial_aggregate_support is a novelty: we don't > do those kinds of checks in other cases, AFAIK. But on the other hand, > there is the 'extensions' argument to postgres_fdw. Right. I am not sure how to improve what the patch does. > I don't think the patch does a good job explaining why HAVING, > DISTINCT, and ORDER BY are a problem. It seems to me that HAVING > shouldn't really be a problem, because HAVING is basically a WHERE > clause that occurs after aggregation is complete, and whether or not > the aggregation is safe shouldn't depend on what we're going to do > with the value afterward. The HAVING clause can't necessarily be > pushed to the remote side, but I don't see how or why it could make > the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a > little trickier: if we pushed down DISTINCT, we'd still have to > re-DISTINCT-ify when combining locally, and if we pushed down ORDER > BY, we'd have to do a merge pass to combine the returned values unless > we could prove that the partitions were non-overlapping ranges that > would be visited in the correct order. Although that all sounds > doable, I think it's probably a good thing that the current patch > doesn't try to handle it -- this is complicated already. But it should > explain why it's not handling it and maybe even a bit about how it > could be handling in the future, rather than just saying "well, this > kind of thing is not safe." The trouble with that explanation is that > it does nothing to help the reader understand whether the thing in > question is *fundamentally* unsafe or whether we just don't have the > right code to make it work. Makes sense. > Typo: Paraemter > > I'm so sorry to keep complaining about comments, but I think the > comments in src/backend/optimizer are very far from being adequate. > They are strictly formulaic and don't really explain anything. For > example, I see that the patch adds a partial_target to > GroupPathExtraData, but how do I understand the reason why we now need > a second pathtarget beside the one that already exists? Certainly not > from the comments in setGroupClausePartial, because there basically > aren't any. True, there's a header comment, but it just says we > generate this thing, not WHY we generate this thing. There's nothing > meaningful to be found in src/include/nodes/pathnodes.h about why > we're doing this, either. > > And this problem really extends throughout the patch: comments are > mostly short and just describe what the code does, not WHY it does > that. And the WHY is really the important part. Otherwise we will not > be able to maintain this code going forward. Understood. I wish I knew enough to add them myself. I can help if someone can supply the details. -- Bruce Momjian <br...@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.