Re: accounting for memory used for BufFile during hash joins

Tomas Vondra Thu, 11 Jul 2019 08:40:52 -0700

On Wed, Jul 10, 2019 at 04:51:02PM -0700, Melanie Plageman wrote:

Okay, so, while I do have specific, actual code review/commitfest-y
feedback for the patch in this thread registered for this commitfest,
I wanted to defer that for a later email and use this one to cover off
on a few higher level issues.


1) How this patch's approach fits into the wider set of problems with
hybrid hashjoin.

2) Parallel HashJoin implementation of this patch's approach

I think implementing support for parallel hashjoin or explicitly
disabling it would be the bare minimum for this patch, which is why I
made 2 its own item. I've marked it as returned to author for this
reason.


OK. I'm a bit confused / unsure what exactly our solution to the various
hashjoin issues is. I have not been paying attention to all the various
threads, but I thought we kinda pivoted to the BNL approach, no? I'm not
against pushing this patch (the slicing one) forward and then maybe add
BNL on top.

I do think that accounting for Buffile overhead when estimating the
size of the hashtable during ExecChooseHashTableSize() so it can be
used during planning is a worthwhile patch by itself (though I know it
is not even part of this patch).


+1 to that

I'll start with 2 since I have less to say there.

From comments upthread, I take it this would not work with parallel
hashjoin as expected. Is this because each worker operates on batches
independently and now batches are lumped into slices?

Thinking through a parallel-aware implementation, it seems like you
would use slice-based barriers for the build phase but batch-based
barriers for the probe phase to avoid getting out of sync (workers
with outer tuples from one batch should not try and join those with
tuples from another batch, even if in the same slice).

You would, of course, need to add code to make slices work with
SharedTuplestore--caveat here is I still haven't tried to understand
how parallel-aware hashjoin works/uses SharedTuplestore.


I don't know. I haven't thought about the parallel version very much. I
wonder if Thomas Munro has some thoughts about it ...

Now, addressing 1, how this patch fits into the wider set of problem's
with current hybrid hashjoin:

Thomas Munro nicely summarized roughly what I'm about to lay out like
this (upthread) -- he called them "three separate but related
problems":

A.  Broken escape valve:  sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys.  We could fix that
with (say) a 95% rule.
B.  Lack of good alternative execution strategy when the escape valve
is triggered.  A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C.  Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.


However, I would like to lay out the problem space a little bit
differently. (using the end of the alphabet to differentiate).

The following scenarios are how you could end up running out of
memory:

Y. Plan-time underestimation of the number of required batches with
relatively uniform data distribution

In this case, the best join execution strategy is a plain hashjoin
with spilling as needed.
nbatches should be increased as needed, because the data is ~evenly
distributed.
slicing should be employed when buffile overhead exceeds some
threshhold for the ratio of work_mem to be used for buffile overhead


OK, makes sense. But at some point we get so many slices the overflow
files alone use more than work_mem. Of course, to hit that the
underestimate needs to be sufficiently serious. My understanding was we'll
roll until that point and then switch to BNL.

Z. Plan and or execution time underestimation of the number of
required batches with skewed data

If you knew this at planning time, you could have picked another
join-type, though, there might be cases where it would actually be
less costly to use plain hashjoin for all batches except the bad batch
and fall back to hash block nested loop join just for the duplicate
values.

If you could not have known this at planning time, the best join
execution strategy is a hybrid hashjoin/hash block nested loop join.

To do this, preview if increasing nbatches would move tuples, and, if
it would, do this (also, employing slicing if buffile overhead exceeds
the threshold)

If increasing nbatches wouldn't move tuples, process this batch with
hash block nested loop join.

OK.

Essentially, what we want is logical units of tuples which are
work_mem-sized. In some cases, each unit may contain multiple batches
(a slice in Tomas' patch) and in other cases, each unit may contain
only part of a batch (a chunk is the term I used in my hash block
nested loop join patch [1]).


OK, although with slicing the work_mem-sized unit is still one batch. The
slice just ensures the metadata we need to keep in memory does not grow as
O(N) with the number of batches (instead it's O(log(N)) I think).

For slicing, each unit, a slice, has multiple batches but one spill
file.
For hbnlj, each unit, a chunk, is one of multiple chunks in a single
batch, all of which are in the same spill file (1 batch = 1 spill
file).


Yep.

Thinking through it, it seems to make the most sense to split the work
into ~ 3 independent pieces:

patch1 - "preview" a batch increase (not yet written [I think])
patch2 - slicing (Tomas' patch [2] but add in threshhold for portion of
work_mem buffile overhead is using)
patch3 - hash block nested loop join (my patch [1])

patch1 allows us to re-enable growth and was mentioned upthread, but I
will quote it here for simplicity:

I think we can fix A by relaxing the escape valve condition, and then
rechecking it once in a while. So we fill work_mem, realize it didn't
actually reduce the batch size significantly and disable nbatch growth.
But at the same time we increase the threshold to 2x work_mem, and after
reaching it we "consider" a nbatch increase.  That is, we walk the batch
and see how many tuples would move if we increased nbatch (that should be
fairly cheap) - if it helps, great, enable growth and split the batch. If
not, double the threshold again.  Rinse and repeat.


We don't want to fill up work_mem with buffile overhead after
increasing nbatches many times just to move a few tuples for one batch
and end up disabling growth thus making it so that later we can't
increase nbatches and repartition for a batch that would nicely
partition (like Hubert's case, I believe [3]).


Yes, this seems like a very reasonable plan. Also, I now see it actually
explains what the plan with BNL vs. slicing is.

We want to identify when re-partitioning would help and only do it
then and, for times when it wouldn't help, use a fallback strategy
that still allows progress on the hashjoin, and, for some spiky data,
where we have re-partitioned for the right reasons, but there are
still a lot of batches that are small enough that they could all fit
in memory at once, we want to track them with as little overhead as
possible -- lump them into slices.

We should probably consider deciding to use slices based on some
threshold for the portion of work_mem which is allowed to be occupied
by buffile overhead instead of waiting until the buffile overhead is
literally taking up most of work_mem.


But that heuristics is already there, no? That's the "Don't use more than
2*work_mem/3 for batch BufFiles" at which point we start adding slices.

The above summary is to address the concern in this thread about a
holistic solution.

I think the slicing patch is independent of both the hash block nested
loop join patch and the "preview" mode for batch increasing.

If slicing is made to work for parallel-aware hashjoin and the code is
in a committable state (and probably has the threshold I mentioned
above), then I think that this patch should go in.


Yes, I think this seems like a good plan.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: accounting for memory used for BufFile during hash joins

Reply via email to