On Tue, Sep 19, 2017 at 8:06 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Sep 14, 2017 at 10:01 AM, Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: >> 3. Gather Merge and Parallel Hash Join may have a deadlock problem. > > [...] > > Thomas and I spent about an hour and a half brainstorming about this > just now. > > [...] > > First, as long as nbatches == 1, we can use a hash > table of up to size (participants * work_mem); if we have to switch to > multiple batches, then just increase the number of batches enough that > the current memory usage drops below work_mem. Second, following an > idea originally by Ashutosh Bapat whose relevance to this issue Thomas > Munro realized during our discussion, we can make all the batches > small enough to fit in work_mem (rather than participants * work_mem > as the current patch does) and spread them across the workers (in the > style of Parallel Append, including potentially deploying multiple > workers against the same batch if there are fewer batches than > workers).
Here is an updated patch set that does that ^. Assorted details: 1. To avoid deadlocks, we can only wait at barriers when we know that all other attached backends have either arrived already or are actively executing the code preceding the barrier wait, so that progress is guaranteed. The rules is that executor nodes can remain attached to a barrier after they've emitted a tuple, which is useful for resource management (ie avoids inventing a separate reference counting scheme), but must never again wait for it. With that programming rule there can be no deadlock between executor nodes. 2. Multiple batches are processed at the same time in parallel, rather than being processed serially. Participants try to spread themselves out over different batches to reduce contention. 3. I no longer try to handle outer joins. I have an idea for how to do that while adhering to the above deadlock-avoidance programming rule, but I would like to consider that for a later patch. 4. I moved most of the parallel-aware code into ExecParallelHash*() functions that exist alongside the private hash table versions. This avoids uglifying the existing hash join code and introducing conditional branches all over the place, as requested by Andres. This made some of the earlier refactoring patches unnecessary. 5. Inner batch repartitioning, if required, is now completed up front for Parallel Hash. Rather than waiting until we try to load hash tables back into memory to discover that they don't fit, this version tracks the size of hash table contents while writing the batches out. That change has various pros and cons, but its main purpose is to remove dependencies between batches. 6. Outer batch repartitioning is now done up front too, if it's necessary. This removes the dependencies that exist between batch 0 and later batches, but means that outer batch 0 is now written to disk if for multi-batch joins. I don't see any way to avoid that while adhering to the deadlock avoidance rule stated above. If we've already started emitting tuples for batch 0 (by returning control to higher nodes) then we have no deadlock-free way to wait for the scan of the outer relation to finish, so we can't safely process later batches. Therefore we must buffer batch 0's tuples. This renders the skew optimisation useless. 7. There is now some book-keeping state for each batch. For typical cases the total space used is negligible but obviously you can contrive degenerate cases where it eats a lot of memory (say by creating a million batches, which is unlikely to work well for other reasons). I have some ideas on reducing their size, but on the other hand I also have some ideas on increasing it profitably... (this is the perfect place to put the Bloom filters discussed elsewhere that would make up for the loss of the skew optimisation, for selective joins; a subject for another day). 8. Barrier API extended slightly. (1) BarrierWait() is renamed to BarrierArriveAndWait(). (2) BarrierArriveAndDetach() is new. The new function is the mechanism required to get from PHJ_BATCH_PROBING to PHJ_BATCH_DONE without waiting, and corresponds to the operation known as Phaser.arriveAndDeregister() in Java (and maybe also barrier::arrive_and_drop() in the C++ concurrency TS and Clock.drop() in X10, not sure). 9. I got rid of PARALLEL_KEY_EXECUTOR_NODE_NTH(). Previously I wanted more than one reserved smh_toc key per executor node. Robert didn't like that. 10. I got rid of "LeaderGate". That earlier attempt at deadlock avoidance clearly missed the mark. (I think it probably defended against the Gather + full TupleQueue deadlock but not the GatherMerge-induced deadlock so it was a useless non-general solution.) 11. The previous incarnation of SharedTuplestore had a built-in concept of partitions, which allowed the number of partitions to be expanded any time but only allowed one partition to be read back in at a time. That worked for the earlier kind of sequential partition processing but doesn't work for parallel partition processing. I chose to get rid of the built-in partitioning concept and create separate SharedTuplestores for each batch, since I now have a place to store per-batch information in memory. 12. The previous incarnation of BufFileSet had a concept of "stripe" used for distributing temporary files over tablespaces; this is now gone from the API. Files are distributed over your configured temp_tablespaces without help from client code. 13. For now the rescan optimisation (ie reusing the hash table) is not enabled for Parallel Hash. I had to provide ExecHashJoinReInitializeDSM (for 41b0dd98), but haven't figured out how to reach it yet and didn't want to wait any longer before posting a new patch so that function is effectively blind code and will probably require adjustments. I've also attached a test that shows a worker starting up and attaching at every phase. -- Thomas Munro http://www.enterprisedb.com
parallel-hash-v21.patchset.tgz
Description: GNU Zip compressed data
parallel-hash-all-arrival-phases.sql
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers