>>>>> "Andres" == Andres Freund <and...@anarazel.de> writes:
Andres> This is not a real review. I'm just scanning through the Andres> patch, without reading the thread, to understand if I see Andres> something "worthy" of controversy. While scanning I might have Andres> a couple observations or questions. >> + * A list of grouping sets which is structurally equivalent to a ROLLUP >> + * clause (e.g. (a,b,c), (a,b), (a)) can be processed in a single pass >> over >> + * ordered data. We do this by keeping a separate set of transition >> values >> + * for each grouping set being concurrently processed; for each input >> tuple >> + * we update them all, and on group boundaries we reset some initial >> subset >> + * of the states (the list of grouping sets is ordered from most >> specific to >> + * least specific). One AGG_SORTED node thus handles any number of >> grouping >> + * sets as long as they share a sort order. Andres> Found "initial subset" not very clear, even if I probably Andres> guessed the right meaning. How about: * [...], and on group boundaries we reset those states * (starting at the front of the list) whose grouping values have * changed (the list of grouping sets is ordered from most specific to * least specific). One AGG_SORTED node thus handles any number [...] >> + * To handle multiple grouping sets that _don't_ share a sort order, we >> use >> + * a different strategy. An AGG_CHAINED node receives rows in sorted >> order >> + * and returns them unchanged, but computes transition values for its own >> + * list of grouping sets. At group boundaries, rather than returning the >> + * aggregated row (which is incompatible with the input rows), it writes >> it >> + * to a side-channel in the form of a tuplestore. Thus, a number of >> + * AGG_CHAINED nodes are associated with a single AGG_SORTED node (the >> + * "chain head"), which creates the side channel and, when it has >> returned >> + * all of its own data, returns the tuples from the tuplestore to its own >> + * caller. Andres> This paragraph deserves to be expanded imo. OK, but what in particular needs clarifying? >> + * In order to avoid excess memory consumption from a chain of >> alternating >> + * Sort and AGG_CHAINED nodes, we reset each child Sort node >> preemptively, >> + * allowing us to cap the memory usage for all the sorts in the chain at >> + * twice the usage for a single node. Andres> What does reseting 'preemtively' mean? Plan nodes are normally not reset (in the sense of calling ExecReScan) just because they finished, but rather it's done before a subsequent new scan is done. Doing the rescan call after all the sorted output has been read means we discard the data from each sort node as soon as it is transferred to the next one. There is a more specific comment in agg_retrieve_chained where this actually happens. >> + * From the perspective of aggregate transition and final functions, the >> + * only issue regarding grouping sets is this: a single call site >> (flinfo) >> + * of an aggregate function may be used for updating several different >> + * transition values in turn. So the function must not cache in the >> flinfo >> + * anything which logically belongs as part of the transition value (most >> + * importantly, the memory context in which the transition value exists). >> + * The support API functions (AggCheckCallContext, AggRegisterCallback) >> are >> + * sensitive to the grouping set for which the aggregate function is >> + * currently being called. Andres> Hm. I've seen a bunch of aggreates do this. Such as? This was discussed about a year ago in the context of WITHIN GROUP: http://www.postgresql.org/message-id/87r424i24w....@news-spur.riddles.org.uk >> + * TODO: AGG_HASHED doesn't support multiple grouping sets yet. Andres> Are you intending to resolve this before an eventual commit? Original plan was to tackle AGG_HASHED after a working implementation was committed; we figured that we'd have two commitfests to get the basics right, and then have a chance to get AGG_HASHED done for the third one. Also, there was talk of other people working on hashagg memory usage issues, and we didn't want to conflict with that. Naturally the extended delays rather put paid to that plan. Going ahead and writing code for AGG_HASHED anyway wasn't really an option, since with the overall structural questions unresolved there was too much chance of it being wasted effort. Andres> Possibly after the 'structural' issues are resolved? Or do you Andres> think this can safely be put of for another release? I think the feature is useful even without AGG_HASHED, even though that means it can sometimes be beaten on performance by using UNION ALL of many separate GROUP BYs; but I'd defer to the opinions of others on that point. Andres> Maybe it's just me, but I get twitchy if I see a default being Andres> used like this. I'd much, much rather see the two remaining Andres> AGG_* types and get a warning from the compiler if a new one is Andres> added. Meh. It also needs a bogus initialization of "result" to avoid compiler warnings if done that way. Andres> I'll bet this will look absolutely horrid after a pgindent run :/ pgindent doesn't touch it, I just checked. [making CUBE and ROLLUP work without being reserved] Andres> This is somewhat icky. I've not really thought abuot this very Andres> much, but it's imo something to pay attention to. This one was discussed in December or so - all the arguments were thrashed out then. The bottom line is that reserving "cube" is really painful due to contrib/cube, and of the possible workarounds, using precedence rules to resolve it in the grammar is already being used for some other constructs. Andres> So, having quickly scanned through the patch, do I understand Andres> correctly that the contentious problems are: Andres> * Arguably this converts the execution *tree* into a DAG. Tom Andres> seems to be rather uncomfortable with that. I am wondering Andres> whether this really is a big deal - essentially this only Andres> happens in a relatively 'isolated' part of the tree right? Andres> I.e. if those chained together nodes were considered one node, Andres> there would not be any loops? Additionally, the way Andres> parametrized scans works already essentially "violates" the Andres> tree paradigma somewhat. The major downsides as I see them with the current approach are: 1. It makes plans (and hence explain output) nest very deeply if you have complex grouping sets (especially cubes with high dimensionality). This can make explain very slow in the most extreme cases (explain seems to perform about O(N^3) in the nesting depth of the plan, I don't know why). But it's important not to exaggerate this effect: if anyone actually has a real-world example of a 12-d cube I'll eat the headgear of their choice, and for an 8-d cube the explain overhead is only on the order of ~40ms. (A 12-d cube generates more than 35 megs of explain output, nested about 1850 levels deep, and takes about 45 seconds to explain, though only about 200ms to plan.) In more realistic cases, the nesting isn't too bad (4-d cube gives a 12-deep plan: 6 sorts and 6 agg nodes), but it's still somewhat less readable than a union-based plan would be. But honestly I think that explain output aesthetics should not be a major determining factor for implementations. 2. A union-based approach would have a chance of including AGG_HASHED support without any significant code changes, just by using one HashAgg node per qualifying grouping set. However, this would be potentially significantly slower than teaching HashAgg to do multiple grouping sets, and memory usage would be an issue. (The current approach is specifically intended to allow the use of an AGG_HASHED node as the head of the chain, once it has been extended to support multiple grouping sets.) Andres> There still might be better representations, about which I want Andres> to think, don't get me wrong. I'm "just" not seing this as a Andres> fundamental problem. The obvious alternative is this: -> CTE x -> entire input subplan here -> Append -> GroupAggregate -> Sort -> CTE Scan x -> GroupAggregate -> Sort -> CTE Scan x -> HashAggregate -> CTE Scan x [...] which was basically what we expected to do originally. But all of the existing code to deal with CTE / CTEScan is based on the assumption that each CTE has a rangetable entry before planning starts, and it is completely non-obvious how to arrange to generate such CTEs on the fly while planning. Tom offered in December to look into that aspect for us, and of course we've heard nothing about it since. Andres> * The whole grammar/keyword issue. To me this seems to be a Andres> problem of swallowing one out of several similarly coloured Andres> poisonous pills. Right. Which is why this issue was thrashed out months ago and the current approach decided on. I consider this question closed. Andres> Are those the two bigger controversial areas? Or are there Andres> others in your respective views? Another controversial item was the introduction of GroupedVar. The need for this can be avoided by explicitly setting to NULL the relevant columns of the representative group tuple when evaluating result rows, but (a) I don't think that's an especially clean approach (though I'm not pushing back very hard on it) and (b) the logic needed in its absence is different between the current chaining implementation and a possible union implementation, so I decided against making any changes on wasted-effort grounds. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers