From: Thomas Munro <thomas.mu...@enterprisedb.com> Date: Wed 26 Jul 2017 19:58:20 NZST Subject: [PATCH] Add support for parallel-aware hash joins.
Hi, WRT the main patch: - Echoing concerns from other threads (Robert: ping): I'm doubtful that it makes sense to size the number of parallel workers solely based on the parallel scan node's size. I don't think it's this patch's job to change that, but to me it seriously amplifys that - I'd bet there's a lot of cases with nontrivial joins where the benefit from parallelism on the join level is bigger than on the scan level itself. And the number of rows in the upper nodes might also be bigger than on the scan node level, making it more important to have higher number of nodes. - If I understand the code in initial_cost_hashjoin() correctly, we count the synchronization overhead once, independent of the number of workers. But on the other hand we calculate the throughput by dividing by the number of workers. Do you think that's right? - I haven't really grokked the deadlock issue you address. Could you expand the comments on that? Possibly somewhere central referenced by the various parts. - maybe I'm overly paranoid, but it might not be bad to add some extra checks for ExecReScanHashJoin ensuring that it doesn't get called when workers are still doing something. - seems like you're dereffing tuple unnecessarily here: + /* + * If we detached a chain of tuples, transfer them to the main hash table + * or batch storage. + */ + if (regainable_space > 0) + { + HashJoinTuple tuple; + + tuple = (HashJoinTuple) + dsa_get_address(hashtable->area, detached_chain_shared); + ExecHashTransferSkewTuples(hashtable, detached_chain, + detached_chain_shared); + + /* Remove from the total space used. */ + LWLockAcquire(&hashtable->shared->chunk_lock, LW_EXCLUSIVE); + Assert(hashtable->shared->size >= regainable_space); + hashtable->shared->size -= regainable_space; + LWLockRelease(&hashtable->shared->chunk_lock); + + /* + * If the bucket we removed is the same as the bucket the caller just + * overflowed, then we can forget about the overflowing part of the + * tuple. It's been moved out of the skew hash table. Otherwise, the + * caller will call again; eventually we'll either succeed in + * allocating space for the overflow or reach this case. + */ + if (bucket_to_remove == bucketno) + { + hashtable->spaceUsedSkew = 0; + hashtable->spaceAllowedSkew = 0; + } + } - The names here could probably improved some: + case WAIT_EVENT_HASH_SHRINKING1: + event_name = "Hash/Shrinking1"; + break; + case WAIT_EVENT_HASH_SHRINKING2: + event_name = "Hash/Shrinking2"; + break; + case WAIT_EVENT_HASH_SHRINKING3: + event_name = "Hash/Shrinking3"; + break; + case WAIT_EVENT_HASH_SHRINKING4: + event_name = "Hash/Shrinking4"; - why are we restricting rows_total bit to parallel aware? + /* + * If parallel-aware, the executor will also need an estimate of the total + * number of rows expected from all participants so that it can size the + * shared hash table. + */ + if (best_path->jpath.path.parallel_aware) + { + hash_plan->plan.parallel_aware = true; + hash_plan->rows_total = best_path->inner_rows_total; + } + - seems we need a few more test - I don't think the existing tests are properly going to exercise the skew stuff, multiple batches, etc? This is nontrivial code, I'd really like to see a high test coverage of the new code. - might not hurt to reindent before the final submission - Unsurprisingly, please implement the FIXME ;) Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers