Re: Avoiding hash join batch explosions with extreme skew and weird stats

Tomas Vondra Thu, 25 Jun 2020 17:23:40 -0700

On Thu, Jun 25, 2020 at 03:09:44PM -0700, Melanie Plageman wrote:

On Tue, Jun 23, 2020 at 3:24 PM Tomas Vondra <[email protected]>
wrote:

I started looking at the patch to refresh my knowledge both of this
patch and parallel hash join, but I think it needs a rebase. The
changes in 7897e3bb90 apparently touched some of the code.



Thanks so much for the review, Tomas!

I've attached a rebased patch which also contains updates discussed
below.


Thanks.

I assume
you're working on a patch addressing the remaining TODOS, right?


I wanted to get some feedback on the patch before working through the
TODOs to make sure I was on the right track.

Now that you are reviewing this, I will focus all my attention
on addressing your feedback. If there are any TODOs that you feel are
most important, let me know, so I can start with those.

Otherwise, I will prioritize parallel batch 0 spilling.


Feel free to work on the batch 0 spilling, please. I still need to get
familiar with various parts of the parallel hash join etc. so I don't
have any immediate feedback which TODOs to work on first.

David Kimura plans to do a bit of work on on parallel hash join batch 0
spilling tomorrow. Whatever is left after that, I will pick up next
week. Parallel hash join batch 0 spilling is the last large TODO that I
had.

My plan was to then focus on the feedback (either about which TODOs are
most important or outside of the TODOs I've identified) I get from you
and anyone else who reviews this.

OK.


I see you've switched to "stripe" naming - I find that a bit confusing,
because when I hear stripe I think about RAID, where it means pieces of
data interleaved and stored on different devices. But maybe that's just
me and it's a good name. Maybe it'd be better to keep the naming and
only tweak it at the end, not to disrupt reviews unnecessarily.


I hear you about "stripe". I still quite like it, especially as compared
to its predecessor (originally, I called them chunks -- which is
impossible given that SharedTuplestoreChunks are a thing).


I don't think using chunks in one place means we can't use it elsewhere
in a different context. I'm sure we have "chunks" in other places. But
let's not bikeshed on this too much.

For ease of review, as you mentioned, I will keep the name for now. I am
open to changing it later, though.

I've been soliciting ideas for alternatives and, so far, folks have
suggested "stride", "step", "flock", "herd", "cohort", and "school". I'm
still on team "stripe" though, as it stands.

;-)


nodeHash.c
----------


1) MultiExecPrivateHash says this

   /*
    * Not subject to skew optimization, so either insert normally
    * or save to batch file if it belongs to another stripe
    */

I wonder what it means to "belong to another stripe". I understand what
that means for batches, which are identified by batchno computed from
the hash value. But I thought "stripes" are just work_mem-sized pieces
of a batch, so I don't quite understand this. Especially when the code
does not actually check "which stripe" the row belongs to.

I agree this was confusing.

"belongs to another stripe" meant here that if batch 0 falls back and we
are still loading it, once we've filled up work_mem, we need to start
saving those tuples to a spill file for batch 0. I've changed the
comment to this:

-        * or save to batch file if it belongs to another stripe
+       * or save to batch file if batch 0 falls back and we have
+       * already filled the hashtable up to space_allowed.


OK. Silly question - what does "batch 0 falls back" mean? Does it mean
that we realized the hash table for batch 0 would not fit into work_mem,
so we switched to the "hashloop" strategy?

2) I find the fields hashloop_fallback rather confusing. We have one in
HashJoinTable (and it's array of BufFile items) and another one in
ParallelHashJoinBatch (this time just bool).

I think HashJoinTable should be renamed to hashloopBatchFile (similarly
to the other BufFile arrays).



I think you are right about the name. I've changed the name in
HashJoinTableData to hashloopBatchFile.

The array of BufFiles hashloop_fallback was only used by serial
hashjoin. The boolean hashloop_fallback variable is used only by
parallel hashjoin.

The reason I had them named the same thing is that I thought it would be
nice to have a variable with the same name to indicate if a batch "fell
back" for both parallel and serial hashjoin--especially since we check
it in the main hashjoin state machine used by parallel and serial
hashjoin.

In serial hashjoin, the BufFiles aren't identified by name, so I kept
them in that array. In parallel hashjoin, each ParallelHashJoinBatch has
the status saved (in the struct).
So, both represented the fall back status of a batch.

However, I agree with you, so I've renamed the serial one to
hashloopBatchFile.

OK


Although I'm not sure why we even need
this file, when we have innerBatchFile? BufFile(s) are not exactly free,
in fact it's one of the problems for hashjoins with many batches.

Interesting -- it didn't even occur to me to combine the bitmap with the
inner side batch file data.
It definitely seems like a good idea to save the BufFile given that so
little data will likely go in it and that it has a 1-1 relationship with
inner side batches.

How might it work? Would you reserve some space at the beginning of the
file? When would you reserve the bytes (before adding tuples you won't
know how many bytes you need, so it might be hard to make sure there is
enough space.) Would all inner side files have space reserved or just
fallback batches?


Oh! So the hashloopBatchFile is only used for the bitmap? I haven't
realized that. In that case it probably makes sense to keep it separate
from the files with spilled tuples, interleaving that somehow would be
way too complex, I think.

However, do we need an array of those files? I thought we only need the
bitmap until we process all rows from each "stripe" and then we can
throw it away, right? Which would also mean we don't need to worry about
the memory usage too much, because the 8kB buffer will go away after
calling BufFileClose.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Avoiding hash join batch explosions with extreme skew and weird stats

Reply via email to