On Sat, Jan 7, 2017 at 9:01 AM, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > Stepping back a bit, I am aware of the following approaches to hash > join parallelism: > > 1. Run the inner plan and build a private hash table in each > participant [...]. > > 2. Run a partition-wise hash join[1]. [...] > > 3. Repartition the data on the fly, and then run a partition-wise > hash join. [...] > > 4. Scatter both the inner and outer plans arbitrarily across > participants [...], and build a shared hash > table. [...] > > [...] I suspect that 4 is probably a better > fit than 3 for Postgres today, because the communication overhead of > shovelling nearly all tuples through extra tuple queues to route them > to the right hash table would surely be very high, though I can see > that it's very attractive to have a reusable tuple repartitioning > operator and then run k disjoint communication-free joins (again, > without code change to the join operator, and to the benefit of all > join operators).
On this topic, I recently stumbled on the 2011 paper "Design and Evaluation of Main Memory Hash Join Algorithms for Multi-core CPUs"[1] and found it reassuring. It compares simple shared hash tables to some state-of-the-art repartitioning approaches (including the radix join algorithm which performs the amazing feat of building a lot of cacheline-sized hash tables and then runs with minimal cache misses). From the introduction: "Second, we show that an algorithm that does not do any partitioning, but simply constructs a single shared hash table on the build relation often outperforms more complex algorithms. This simple “no-partitioning” hash join algorithm is robust to sub-optimal parameter choices by the optimizer, and does not require any knowledge of the characteristics of the input to work well. To the best of our knowledge, this simple hash join technique differs from what is currently implemented in existing DBMSs for multi-core hash join processing, and offers a tantalizingly simple, efficient, and robust technique for implementing the hash join operation." "Finally, we show that the simple “no-partitioning” hash join algorithm takes advantage of intrinsic hardware optimizations to handle skew. As a result, this simple hash join technique often benefits from skew and its relative performance increases as the skew increases! This property is a big advancement over the state-of-the-art methods, as it is important to have methods that can gracefully handle skew in practice [8]." (That relates to SMT pipelining compensating for the extra cacheline misses during probing by doing thread A's work while waiting for thread B's memory to be fetched.) From the conclusion: "... Our results show that a simple hash join technique that does not do any partitioning of the input relations often outperforms the other more complex partitioning-based join alternatives. In addition, the relative performance of this simple hash join technique rapidly improves with increasing skew, and it outperforms every other algorithm in the presence of even small amounts of skew." For balance, the authors of a 2013 paper "Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware"[2] are less keen on the simple "hardware-oblivious" "no partitioning" approach and don't buy the other paper's ideas about SMT. Incidentally, their results on the benefits of large (huge) pages are interesting (table VI) and suggest that huge page support for DSM segments could be good here. [1] https://pdfs.semanticscholar.org/9de4/b32f2c7b630a4f6aae6994a362a46c7c49e9.pdf [2] https://www.inf.ethz.ch/personal/cagri.balkesen/publications/parallel-joins-icde13.pdf -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers