On Wed, Mar 8, 2017 at 1:29 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.mu...@enterprisedb.com> writes: >> I have been wondering about a couple of different worst case execution >> strategies that would be better than throwing our hands up and >> potentially exploding memory once we detect that further partitioning >> is not going to help, if we still manage to reach that case despite >> adding stats-based defences like this due to statistics being missing, >> bad or confused by joins below us. > > Yeah, it would definitely be nice if we could constrain the runtime > space consumption better. > >> 1. We could probe the fraction of the hash table that we have managed >> to load into work_mem so far and then rewind the outer batch and do it >> again for the next work_mem-sized fraction of the inner batch and so >> on. For outer joins we'd need to scan for unmatched tuples after each >> hash table refill. > > I do not understand how that works for a left join? You'd need to track > whether a given outer tuple has been matched in any one of the fractions > of the inner batch, so that when you're done with the batch you could know > which outer tuples need to be emitted null-extended. Right now we only > need to track that state for the current outer tuple, but in a rescan > situation we'd have to remember it for each outer tuple in the batch. > > Perhaps it could be done by treating the outer batch file as read/write > and incorporating a state flag in each tuple; or to reduce write volume, > maintaining a separate outer batch file parallel to the main one with just > a bool or even just a bit per outer tuple. Seems messy though.
Right. Messy. I think what I described may fall under the category of "block nested loop". It looks doable but not very appealing for left joins, and performance seems not great, multiplying the probing scans by the number of fragments. Whether we actually care about performance at all when we've reached this emergency state and are primarily concerned with completing the query I'm not entirely sure. Another idea would be to identify the offending bucket (how?) and spill it to disk in its own file, and track it by pushing a special control object with a distinguishing header flag into the hash table (or a new overflow table, or extend the duties of the skew table, or...). We'd have to deal with the matched flags of spilled inner tuples for right/full joins. Matching is really per-key, not per-tuple, so if there is a control object in memory for each of these "overflow" buckets then perhaps it could hold the matched flag that covers all tuples with each distinct key. What I like about this is that is doesn't change the join algorithm at all, it just bolts on a per-bucket escape valve. The changes might be quite localised, though I know someone who probably wouldn't like an extra branch in ExecScanHashBucket(). -- Thomas Munro http://www.enterprisedb.com