Re: hashagg slowdown due to spill changes

Tomas Vondra Thu, 25 Jun 2020 02:11:10 -0700

On Wed, Jun 24, 2020 at 05:26:07PM -0700, Melanie Plageman wrote:

On Tue, Jun 23, 2020 at 10:06 AM Andres Freund <and...@anarazel.de> wrote:

Hi,

On 2020-06-23 09:23:57 -0700, Melanie Plageman wrote:
> On Mon, Jun 22, 2020 at 9:02 PM Andres Freund <and...@anarazel.de>
wrote:
> > It's not this patch's fault, but none, really none, of this stuff
should
> > be in the executor.
> >
> >
> Were you thinking it could be done in grouping_planner() and then the
> bitmaps could be saved in the PlannedStmt?
> Or would you have to wait until query_planner()? Or are you imagining
> somewhere else entirely?

I haven't thought about it in too much detail, but I would say
create_agg_plan() et al. I guess there's some argument to be made to do
it in setrefs.c, because we already do convert_combining_aggrefs() there
(but I don't like that much).

There's no reason to do it before we actually decided on one specific
path, so doing it earlier than create_plan() seems unnecessary. And
having it in agg specific code seems better than putting it into global
routines.

There's probably an argument for having a bit more shared code between
create_agg_plan(), create_group_plan() and
create_groupingsets_plan(). But even just adding a new extract_*_cols()
call to each of those would probably be ok.

So, my summary of this point in the context of the other discussion
upthread is:

Planner should extract the columns that hashagg will need later during
planning. Planner should not have HashAgg/MixedAgg nodes request smaller
targetlists from their children with CP_SMALL_TLIST to avoid unneeded
projection overhead.
Also, even this extraction should only be done when the number of groups
is large enough to suspect a spill.


IMO we should extract the columns irrespectedly of the estimates,
otherwise we won't be able to handle underestimates efficiently.


Not to stir the pot, but I did notice that hashjoin uses CP_SMALL_TLIST
in create_hashjoin_plan() for the inner side sub-tree and the outer side
one if there are multiple batches. I wondered what was different about
that vs hashagg (i.e. why it is okay to do that there).


Yeah. That means that if we have to start batching during execution, we
may need to spill much more datai. I'd say that's a hashjoin issue that
we should fix too (in v14).

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: hashagg slowdown due to spill changes

Reply via email to