From: Thomas Munro <thomas.mu...@enterprisedb.com>
Date: Wed 26 Jul 2017 19:58:20 NZST
Subject: [PATCH] Add support for parallel-aware hash joins.

Hi,

WRT the main patch:

- Echoing concerns from other threads (Robert: ping): I'm doubtful that
  it makes sense to size the number of parallel workers solely based on
  the parallel scan node's size.  I don't think it's this patch's job to
  change that, but to me it seriously amplifys that - I'd bet there's a
  lot of cases with nontrivial joins where the benefit from parallelism
  on the join level is bigger than on the scan level itself.  And the
  number of rows in the upper nodes might also be bigger than on the
  scan node level, making it more important to have higher number of
  nodes.

- If I understand the code in initial_cost_hashjoin() correctly, we
  count the synchronization overhead once, independent of the number of
  workers.  But on the other hand we calculate the throughput by
  dividing by the number of workers.  Do you think that's right?

- I haven't really grokked the deadlock issue you address. Could you
  expand the comments on that? Possibly somewhere central referenced by
  the various parts.

- maybe I'm overly paranoid, but it might not be bad to add some extra
  checks for ExecReScanHashJoin ensuring that it doesn't get called when
  workers are still doing something.

- seems like you're dereffing tuple unnecessarily here:

+       /*
+        * If we detached a chain of tuples, transfer them to the main hash 
table
+        * or batch storage.
+        */
+       if (regainable_space > 0)
+       {
+               HashJoinTuple tuple;
+
+               tuple = (HashJoinTuple)
+                       dsa_get_address(hashtable->area, detached_chain_shared);
+               ExecHashTransferSkewTuples(hashtable, detached_chain,
+                                                                  
detached_chain_shared);
+
+               /* Remove from the total space used. */
+               LWLockAcquire(&hashtable->shared->chunk_lock, LW_EXCLUSIVE);
+               Assert(hashtable->shared->size >= regainable_space);
+               hashtable->shared->size -= regainable_space;
+               LWLockRelease(&hashtable->shared->chunk_lock);
+
+               /*
+                * If the bucket we removed is the same as the bucket the 
caller just
+                * overflowed, then we can forget about the overflowing part of 
the
+                * tuple.  It's been moved out of the skew hash table.  
Otherwise, the
+                * caller will call again; eventually we'll either succeed in
+                * allocating space for the overflow or reach this case.
+                */
+               if (bucket_to_remove == bucketno)
+               {
+                       hashtable->spaceUsedSkew = 0;
+                       hashtable->spaceAllowedSkew = 0;
+               }
+       }


- The names here could probably improved some:
+               case WAIT_EVENT_HASH_SHRINKING1:
+                       event_name = "Hash/Shrinking1";
+                       break;
+               case WAIT_EVENT_HASH_SHRINKING2:
+                       event_name = "Hash/Shrinking2";
+                       break;
+               case WAIT_EVENT_HASH_SHRINKING3:
+                       event_name = "Hash/Shrinking3";
+                       break;
+               case WAIT_EVENT_HASH_SHRINKING4:
+                       event_name = "Hash/Shrinking4";

- why are we restricting rows_total bit to parallel aware?

+       /*
+        * If parallel-aware, the executor will also need an estimate of the 
total
+        * number of rows expected from all participants so that it can size the
+        * shared hash table.
+        */
+       if (best_path->jpath.path.parallel_aware)
+       {
+               hash_plan->plan.parallel_aware = true;
+               hash_plan->rows_total = best_path->inner_rows_total;
+       }
+

- seems we need a few more test - I don't think the existing tests are
  properly going to exercise the skew stuff, multiple batches, etc?
  This is nontrivial code, I'd really like to see a high test coverage
  of the new code.

- might not hurt to reindent before the final submission

- Unsurprisingly, please implement the FIXME ;)


Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to