Re: [HACKERS] Final Patch for GROUPING SETS

Andres Freund Wed, 29 Apr 2015 17:09:12 -0700

Hi,

This is not a real review.  I'm just scanning through the patch, without
reading the thread, to understand if I see something "worthy" of
controversy. While scanning I might have a couple observations or
questions.



On 2015-03-13 15:46:15 +0000, Andrew Gierth wrote:

> + *     A list of grouping sets which is structurally equivalent to a ROLLUP
> + *     clause (e.g. (a,b,c), (a,b), (a)) can be processed in a single pass 
> over
> + *     ordered data.  We do this by keeping a separate set of transition 
> values
> + *     for each grouping set being concurrently processed; for each input 
> tuple
> + *     we update them all, and on group boundaries we reset some initial 
> subset
> + *     of the states (the list of grouping sets is ordered from most 
> specific to
> + *     least specific).  One AGG_SORTED node thus handles any number of 
> grouping
> + *     sets as long as they share a sort order.

Found "initial subset" not very clear, even if I probably guessed the
right meaning.

> + *     To handle multiple grouping sets that _don't_ share a sort order, we 
> use
> + *     a different strategy.  An AGG_CHAINED node receives rows in sorted 
> order
> + *     and returns them unchanged, but computes transition values for its own
> + *     list of grouping sets.  At group boundaries, rather than returning the
> + *     aggregated row (which is incompatible with the input rows), it writes 
> it
> + *     to a side-channel in the form of a tuplestore.  Thus, a number of
> + *     AGG_CHAINED nodes are associated with a single AGG_SORTED node (the
> + *     "chain head"), which creates the side channel and, when it has 
> returned
> + *     all of its own data, returns the tuples from the tuplestore to its own
> + *     caller.

This paragraph deserves to be expanded imo.

> + *     In order to avoid excess memory consumption from a chain of 
> alternating
> + *     Sort and AGG_CHAINED nodes, we reset each child Sort node 
> preemptively,
> + *     allowing us to cap the memory usage for all the sorts in the chain at
> + *     twice the usage for a single node.

What does reseting 'preemtively' mean?

> + *     From the perspective of aggregate transition and final functions, the
> + *     only issue regarding grouping sets is this: a single call site 
> (flinfo)
> + *     of an aggregate function may be used for updating several different
> + *     transition values in turn. So the function must not cache in the 
> flinfo
> + *     anything which logically belongs as part of the transition value (most
> + *     importantly, the memory context in which the transition value exists).
> + *     The support API functions (AggCheckCallContext, AggRegisterCallback) 
> are
> + *     sensitive to the grouping set for which the aggregate function is
> + *     currently being called.

Hm. I've seen a bunch of aggreates do this.

> + *     TODO: AGG_HASHED doesn't support multiple grouping sets yet.

Are you intending to resolve this before an eventual commit? Possibly
after the 'structural' issues are resolved? Or do you think this can
safely be put of for another release?

> @@ -534,11 +603,13 @@ static void
>  advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
>  {
>       int                     aggno;
> +     int         setno = 0;
> +     int         numGroupingSets = Max(aggstate->numsets, 1);
> +     int         numAggs = aggstate->numaggs;
>
> -     for (aggno = 0; aggno < aggstate->numaggs; aggno++)
> +     for (aggno = 0; aggno < numAggs; aggno++)
>       {
>               AggStatePerAgg peraggstate = &aggstate->peragg[aggno];
> -             AggStatePerGroup pergroupstate = &pergroup[aggno];
>               ExprState  *filter = peraggstate->aggrefstate->aggfilter;
>               int                     numTransInputs = 
> peraggstate->numTransInputs;
>               int                     i;
> @@ -582,13 +653,16 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup 
> pergroup)
>                                       continue;
>                       }
>
> -                     /* OK, put the tuple into the tuplesort object */
> -                     if (peraggstate->numInputs == 1)
> -                             tuplesort_putdatum(peraggstate->sortstate,
> -                                                                
> slot->tts_values[0],
> -                                                                
> slot->tts_isnull[0]);
> -                     else
> -                             tuplesort_puttupleslot(peraggstate->sortstate, 
> slot);
> +                     for (setno = 0; setno < numGroupingSets; setno++)
> +                     {
> +                             /* OK, put the tuple into the tuplesort object 
> */
> +                             if (peraggstate->numInputs == 1)
> +                                     
> tuplesort_putdatum(peraggstate->sortstates[setno],
> +                                                                        
> slot->tts_values[0],
> +                                                                        
> slot->tts_isnull[0]);
> +                             else
> +                                     
> tuplesort_puttupleslot(peraggstate->sortstates[setno], slot);
> +                     }
>               }

Hm. So a normal GROUP BY is just a subcase of grouping sets. Seems to
make sense, but worthwhile to mention somewhere in the intro.

> +     if (!node->agg_done)
> +     {
> +             /* Dispatch based on strategy */
> +             switch (((Agg *) node->ss.ps.plan)->aggstrategy)
> +             {
> +                     case AGG_HASHED:
> +                             if (!node->table_filled)
> +                                     agg_fill_hash_table(node);
> +                             result = agg_retrieve_hash_table(node);
> +                             break;
> +                     case AGG_CHAINED:
> +                             result = agg_retrieve_chained(node);
> +                             break;
> +                     default:
> +                             result = agg_retrieve_direct(node);
> +                             break;
> +             }
> +
> +             if (!TupIsNull(result))
> +                     return result;
> +     }

Maybe it's just me, but I get twitchy if I see a default being used like
this. I'd much, much rather see the two remaining AGG_* types and get a
warning from the compiler if a new one is added.
> +             /*-
> +              * If a subgroup for the current grouping set is present, 
> project it.
> +              *
> +              * We have a new group if:
> +              *  - we're out of input but haven't projected all grouping sets
> +              *    (checked above)
> +              * OR
> +              *    - we already projected a row that wasn't from the last 
> grouping
> +              *      set
> +              *    AND
> +              *    - the next grouping set has at least one grouping column 
> (since
> +              *      empty grouping sets project only once input is 
> exhausted)
> +              *    AND
> +              *    - the previous and pending rows differ on the grouping 
> columns
> +              *      of the next grouping set
> +              */
> +             if (aggstate->input_done
> +                     || (node->aggstrategy == AGG_SORTED
> +                             && aggstate->projected_set != -1
> +                             && aggstate->projected_set < (numGroupingSets - 
> 1)
> +                             && nextSetSize > 0
> +                             && !execTuplesMatch(econtext->ecxt_outertuple,
> +                                                                     
> tmpcontext->ecxt_outertuple,
> +                                                                     
> nextSetSize,
> +                                                                     
> node->grpColIdx,
> +                                                                     
> aggstate->eqfunctions,
> +                                                                     
> tmpcontext->ecxt_per_tuple_memory)))

I'll bet this will look absolutely horrid after a pgindent run :/

> +/*
> + * We want to produce the absolute minimum possible number of lists here to
> + * avoid excess sorts. Fortunately, there is an algorithm for this; the 
> problem
> + * of finding the minimal partition of a poset into chains (which is what we
> + * need, taking the list of grouping sets as a poset ordered by set 
> inclusion)
> + * can be mapped to the problem of finding the maximum cardinality matching 
> on
> + * a bipartite graph, which is solvable in polynomial time with a worst case 
> of
> + * no worse than O(n^2.5) and usually much better. Since our N is at most 
> 4096,
> + * we don't need to consider fallbacks to heuristic or approximate methods.
> + * (Planning time for a 12-d cube is under half a second on my modest system
> + * even with optimization off and assertions on.)

I think using the long form of poset once would be a good thing.

> + * We use the Hopcroft-Karp algorithm for the graph matching; it seems to 
> work
> + * well enough for our purposes.  This implementation is based on pseudocode
> + * found at:
> + *
> + * 
> http://en.wikipedia.org/w/index.php?title=Hopcroft%E2%80%93Karp_algorithm&oldid=593898016
> + *
> + * This implementation uses the same indices for elements of U and V (the two
> + * halves of the graph) because in our case they are always the same size, 
> and
> + * we always know whether an index represents a u or a v. Index 0 is reserved
> + * for the NIL node.
> + */
> +
> +struct hk_state
> +{
> +     int                     graph_size;             /* size of half the 
> graph plus NIL node */
> +     int                     matching;
> +     short     **adjacency;          /* adjacency[u] = [n, v1,v2,v3,...,vn] 
> */
> +     short      *pair_uv;            /* pair_uv[u] -> v */
> +     short      *pair_vu;            /* pair_vu[v] -> u */
> +     float      *distance;           /* distance[u], float so we can have 
> +inf */
> +     short      *queue;                      /* queue storage for breadth 
> search */
> +};

I wonder if it'd not be better to put this in a separate file. Most
readers just won't care about this bit and the file is long enough.

> -     if (!parse->hasAggs && !parse->groupClause && !root->hasHavingQual &&
> +     if (!parse->hasAggs && !parse->groupClause && !parse->groupingSets && 
> !root->hasHavingQual &&
>               !parse->hasWindowFuncs)
>       {

Looks like well above 80 lines.
>  %nonassoc    UNBOUNDED               /* ideally should have same precedence 
> as IDENT */
> -%nonassoc    IDENT NULL_P PARTITION RANGE ROWS PRECEDING FOLLOWING
> +%nonassoc    IDENT NULL_P PARTITION RANGE ROWS PRECEDING FOLLOWING CUBE 
> ROLLUP
>  %left                Op OPERATOR             /* multi-character ops and 
> user-defined operators */

> +/*
> + * These hacks rely on setting precedence of CUBE and ROLLUP below that of 
> '(',
> + * so that they shift in these rules rather than reducing the conflicting
> + * unreserved_keyword rule.
> + */
> +
> +rollup_clause:
> +                     ROLLUP '(' expr_list ')'
> +                             {
> +                                     $$ = (Node *) 
> makeGroupingSet(GROUPING_SET_ROLLUP, $3, @1);
> +                             }
> +             ;
> +
> +cube_clause:
> +                     CUBE '(' expr_list ')'
> +                             {
> +                                     $$ = (Node *) 
> makeGroupingSet(GROUPING_SET_CUBE, $3, @1);
> +                             }
> +             ;
> +
> +grouping_sets_clause:
> +                     GROUPING SETS '(' group_by_list ')'
> +                             {
> +                                     $$ = (Node *) 
> makeGroupingSet(GROUPING_SET_SETS, $4, @1);
> +                             }
> +             ;
> +

This is somewhat icky. I've not really thought abuot this very much, but
it's imo something to pay attention to.


So, having quickly scanned through the patch, do I understand correctly
that the contentious problems are:

* Arguably this converts the execution *tree* into a DAG. Tom seems to
  be rather uncomfortable with that. I am wondering whether this really
  is a big deal - essentially this only happens in a relatively
  'isolated' part of the tree right? I.e. if those chained together
  nodes were considered one node, there would not be any loops?
  Additionally, the way parametrized scans works already essentially
  "violates" the tree paradigma somewhat.
  There still might be better representations, about which I want to
  think, don't get me wrong. I'm "just" not seing this as a fundamental
  problem.
* The whole grammar/keyword issue. To me this seems to be a problem of
  swallowing one out of several similarly coloured poisonous
  pills. Which we can't really avoid, i.e. this isn't really this
  patches fault that we have to make them.

Are those the two bigger controversial areas? Or are there others in
your respective views?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Final Patch for GROUPING SETS

Reply via email to